Data Science and Data Analytics (En)
Data Science and Data Analytics (En)
Edited by
Amit Kumar Tyagi
School of Computer Science and Engineering Vellore Institute of
Technology
First edition published 2022
by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742
© 2022 selection and editorial matter, Amit Kumar Tyagi individual chapters, the contributors
Reasonable efforts have been made to publish reliable data and information, but the author and
publisher cannot assume responsibility for the validity of all materials or the consequences of their
use. The authors and publishers have attempted to trace the copyright holders of all material
reproduced in this publication and apologize to copyright holders if permission to publish in this
form has not been obtained. If any copyright material has not been acknowledged please write and let
us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information
storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, access
www.copyright.com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive,
Danvers, MA 01923, 978-750-8400. For works that are not available on CCC please contact
[email protected]
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are
used only for identification and explanation without intent to infringe.
Typeset in Times
by MPS Limited, Dehradun
Contents
Preface
Editors
Contributors
Index
Preface
Aishwarya
Department of Computer Science and Engineering
Nitte Mahalinga Adyanthaya Memorial Institute of Technology
Karnataka, India
Felix Albu
Department of Electronics
Valahia University of Targoviste
Targoviste, Romania
Paul Anand
University of Florida
Gainesville, Florida, USA
Arunakumari B.N.
BMS Institute of Technology and Management
Bengaluru, India
Nebojsa Bacanin
Singidunum University
Belgrade, Serbia
Zubair Baig
School of Information Technology
Faculty of Science
Engineering and Built Environment
Deakin University, Australia
Raswitha Bandi
Vallurupalli Nageswara Rao Vignana Jyothi Institute of Engineering and
Technology
Hyderabad, India
Aruna Pavate
Department of Information Technology
Thakur College of Engineering
Mumbai University
Mumbai, India
Kanchipuram BasavaRaju
Sreenidhi Institute of Science and Technology
Hyderabad, India
Sumita Basu
Department of Mathematics
Bethune College
Kolkata, India
Elizabeth Behrman
Department of Physics and Mathematics
Wichita State University
Wichita, Kansas, USA
Timea Bezdan
Singidunum University
Belgrade, Serbia
Pulak Kanti Bhowmick
Mawlana Bhashani Science and Technology University
Santosh, Bangladesh
S. Kumar Chandar
School of Business and Management
CHRIST (Deemed to be University)
Bangalore, India
A. Chandrasekar
Department of Computer Science Engineering
St. Joseph’s College of Engineering
Chennai, India
Sumika Chauhan
Department of Electrical and Instrumentation Engineering
Sant Longowal Institute of Engineering and Technology
Longowal, India
Niranjan N. Chiplunkar
Department of Computer Science and Engineering
Nitte Mahalinga Adyanthaya Memorial Institute of Technology
Karnataka, India
Aleksa Cuk
Singidunum University
Belgrade, Serbia
Soumi Dutta
Institute of Engineering and Management
Kolkata, India
Roshan Fernandes
Department of Computer Science and Engineering
Nitte Mahalinga Adyanthaya Memorial Institute of Technology
Karnataka, India
R. Gayathri
Department of Electronics and Communication Engineering
Bannari Amman Institute of Technology
Tamilnadu, India
Arijit Ghosal
St. Thomas’ College of Engineering and Technology
Kolkata, India
Sreeya Ghosh
Department of Applied Mathematics
University of Calcutta
Kolkata, India
Shreyas Hingmire
Department of Information Technology
Atharva College of Engineering
Mumbai, India
Nazrul Islam
Mawlana Bhashani Science and Technology University
Santosh, Bangladesh
Nazura Javed
St. Francis College
Bangalore Central University
Bengaluru, India
S. Jeyanthi
Vellore Institute of Technology
Chennai, India
V. Kakulapati
Sreenidhi Institute of Science and Technology
Hyderabad, India
P. Shiva Kalyan
Accenture
Hyderabad, India
Jawwad Khan
Department of Information Technology
Atharva College of Engineering
Mumbai, India
Abhishek Krishnaswami
Vellore Institute of Technology
Chennai, India
Sanjay Kumar
National Institute of Technology, Raipur
Raipur, India
Yuvaraj L.
Vellore Institute of Technology
Chennai, India
Pradeep M.
Vellore Institute of Technology
Chennai, India
Joyston Menezes
Department of Computer Science and Engineering
Nitte Mahalinga Adyanthaya Memorial Institute of Technology
Karnataka, India
Aparna Mohan
Vellore Institute of Technology
Chennai, India
M. Leeban Moses
Department of Electronics and Communication Engineering
Bannari Amman Institute of Technology
Tamilnadu, India
Jehan Murugadhas
Information Technology Department
University of Technology and Applied Sciences
Nizwa, Oman
Vijaya Padmanabha
Department of Mathematics and Computer Science
Modern College of Business and Science
Muscat, Oman
Tannistha Pal
Department of Electronics and Communication Engineering
Institute of Engineering and Management
Kolkata, India
Ashutosh Pandey
Department of Information Technology
Atharva College of Engineering
Mumbai, India
Aruna Pavate
Department of Information Technology
Thakur College of Engineering
Mumbai University
and
Department of Information Technology
Atharva College of Engineering
Mumbai, India
T. Perarasi
Department of Electronics and Communication Engineering
Bannari Amman Institute of Technology
Tamilnadu, India
Srinivas Prasad
Gandhi Institute of Technology and Management
Visakhapatnam, India
Appiah Prince
ITMO University
St. Petersburg, Russia
Hitesh Punjabi
K.J. Somaiya Institute of Management and Research
Mumbai, India
Maheswari R.
Vellore Institute of Technology
Chennai, India
Shashidhar R.
JSS Science and Technology University
Mysuru, India
S. Radhika
School of Electrical and Electronics Engineering
Sathyabama Institute of Science and Technology
Chennai, India
Aman Rai
BMS Institute of Technology and Management
Bengaluru, India
Ratnavel Rajalakshmi
Vellore Institute of Technology
Chennai, India
Tarik A. Rashid
Computer Science and Engineering Department
University of Kurdistan Hewler
Erbil, Iraq
D. Anantha Reddy
National Institute of Technology, Raipur
Raipur, India
Bapuji Rao
CSEA
Indira Gandhi Institute of Technology, Sarang
Dhenkanal, India
Anisha P. Rodrigues
Department of Computer Science and Engineering
Nitte Mahalinga Adyanthaya Memorial Institute of Technology
Karnataka, India
Prantik Roy
St. Thomas’ College of Engineering and Technology
Kolkata, India
Jeyakrishna S.
Vellore Institute of Technology
Chennai, India
Sophia S.
Sri Krishna College of Engineering and Technology
Coimbatore, India
Arijit Santra
St. Thomas’ College of Engineering and Technology
Kolkata, India
Niloy Sarkar
The Neotia University
Kolkata, India
Siladitya Sarkar
St. Thomas’ College of Engineering and Technology
Kolkata, India
Manmohan Singh
Department of Electrical and Instrumentation Engineering
Sant Longowal Institute of Engineering and Technology
Longowal, India
G. Suganya
School of Computer Science and Engineering
Vellore Institute of Technology
Chennai, India
K. Tejaswini
Vallurupalli Nageswara Rao Vignana Jyothi Institute of Engineering and
Technology
Hyderabad, India
P. Thamaraiselvi
School of Management
Sri Krishna College of Engineering and Technology
Coimbatore, India
Ciza Thomas
Directorate of Technical Education
Government of Kerala, India
Kerala, India
Rakesh Tripathi
National Institute of Technology, Raipur
Raipur, India
Sridevi U.K.
PSG College of Technology
Coimbatore, India
Jia Uddin
Technology Studies Department
Endicott College
Woosong University
Daejeon, South Korea
Kanchana Devi V.
School of Computer Science and Engineering
Vellore Institute of Technology, Chennai
Chennai, India
K. Venkatachalam
School of Computer Science and Engineering
VIT Bhopal University
Bhopal, India
B. Vignesh
School of Computer Science and Engineering
Vellore Institute of Technology
Chennai, India
B. Vinoth
National Taiwan Normal University
Taipei, Taiwan
Ritika Yadav
St. Thomas’ College of Engineering and Technology
Kolkata, India
Miodrag Zivkovic
Singidunum University
Belgrade, Serbia
Section I
Introduction about Data Science
and Data Analytics
1
Data Science and Data Analytics: Artificial Intelligence
and Machine Learning Integrated Based Approach
CONTENTS
1.1 Introduction
1.2 Artificial Intelligence
1.3 Machine Learning (ML)
1.3.1 Regression
1.3.1.1 Linear Regression
1.3.1.2 Logistic Regression
1.3.2 Support Vector Machine (SVM)
1.4 Deep Learning (DL)
1.4.1 Methods for Deep Learning
1.4.1.1 Convolutional Neural Networks (CNNs)
1.4.1.2 Extreme Learning Machine
1.4.1.3 Transfer Learning (TL)
1.5 Bio-inspired Algorithms for Data Analytics
1.6 Conclusion
References
1.1 Introduction
In the previous few decades, all companies have produced data in large amounts from
different sources. It can be from business applications of their own, social media or other
web outlets, from smartphones, and client computing devices or from the Internet of Things
sensors and software. This knowledge is highly useful for companies that have resources in
place to build on it. The overall toolbox for these methods is called data analytics.
Data analytics is used to represent those methods that provide an essential arrangement of
the data. It can be classified into four categories, including descriptive, predictive, diagnostic,
and prescriptive data analytics. Out of these methods, predictive analytics is the most
dynamic approach for data analytics that involves an advanced statistical approach, Artificial
Intelligence–based algorithms. Predictive analytics (PA) is the member of advanced analytics
that is broadly utilized in the prediction of uncertain future events. A variety of data analysis,
statistical modeling, and theoretical approaches are used to bring management, information
technology, and business process forecasting together to forecast these predictive events. To
define threats and possibilities in the future, the trends contained in historical and
transactional data may be used. PA models may track relationships with a complex set of
conditions to distribute a score or weighting among several variables to determine risk.
Predictive analytics helps companies to anticipate, construct, and focus on the evidence
and not on a hunch or expectations, forecasting findings and actions. The value chain of
predictive analytics is seen in Figure 1.1.
1.3.1 Regression
Regression is the most powerful statistical method in data analytics that pursues to describe
the power and aspects of the relationship between one dependent variable to a series of other
independent variables. Various types of regression are available in the literature [2]. A few of
them are discussed as follows.
y (1.1)
m
ŷ i = β 0 + ∑ j=1 x ij β j
The β0indicates the intercept and is also called the bias in machine learning, and
= (β , β , … , β ) is the coefficient vector. The values of all input variates should be
T
β 2 m
1
numeric for the feasible computation of the covariate values. The equation can be rewritten
as
Ŷ = X
T
β̂ (1.2)
RSS(β) = ∑
N
i=1
(y i − x
T
i
β)
2
(1.3)
RSS(β) represent the quadratic operation of the parameters; thus, the minimum value of it is
always present. The solution is easily obtained in matrix representation, written as
RSS(β) = (y − Xβ)
T
(y − Xβ) (1.4)
The minimization of the above-mentioned equation can be obtained by setting the first
derivative of RSS(β) equal to zero. Differentiating w.r.t. β, the obtained normal equation is
given as
X
T
(y − Xβ) = 0 (1.5)
If X T
X is non-singular, then a unique solution is obtained by
β̂ = (X
T
X)
−1
X
T
y (1.6)
log
P r(y i =1,|,X i )
= ∑
m
k=0
x ik β k = X i β (1.7)
P r(y i =0,|,X i )
The value of x is 1 and β represents the intercept. As we know that in the case of two-class
i0
P r(y i = 1, |, X i ) =
exp(X i β)
(1.8)
1+exp(X i β)
The parameter estimation is accomplished by maximizing the cost function in the logistic
regression models. The joint conditional probability for all N points in training data is
i=1
P r(y = y i |X i ) (1.9)
where y ; i = 1, 2, …, N is the predicted labels in the training set. The log-likelihood for N
i
observations is
N
L (β) = ∑ i=1 log[(P r(y = y i )|X i )] (1.10)
L (β) = ∑
N
i=1
{X i β ⋅ y i − log [1 + exp (X i β)]} (1.12)
Generally, the Newton-Raphson method is utilized for maximizing this log-likelihood, where
the coefficient vector is modernize as
β
(t+1)
= β
(t)
− [
∂
2
L (β)
]
−1
∂L (β) (1.13)
T
∂β∂β ∂β
where
∂L (β)
= ∑
N
X i (y i −
exp(X i β)
) (1.14)
∂β i=1 1+exp(X i β)
(1.15)
2
∂ L (β) N T exp(X i β)
T
= −∑ Xi X 2
i=1 i
∂β∂β [1+exp(X i β)]
P r(y i = j, |, X i ) =
exp(X i β j )
(1.16)
∑ exp(X i β j )
k≠j
where j, kϵL and L is the label index. Therefore, the log-likelihood for N observations can
be written as
L (β) = ∑
N
[X i β j − log(∑ exp(X i β j ))] (1.17)
i=1 k≠j
log
P r(y=1|X i )
= Xi β1 (1.18)
P r(y=C|X i )
P r(y=2|X i )
log = Xi β2
P r(y=C|X i )
P r(y=C−1|X i )
log = X i β C−1
P r(y=C|X i )
It should be noted that for particle X the addition of all Derrirer probabilities is equal to 1.
i
P r (y = k|X i ) =
exp(X i β k )
C−1
, k = 1, 2, …, C − 1 (1.19)
1+∑ exp(X i β k )
j=1
1
P r (y = C|X i ) = C−1
1+∑ exp(X i β k )
j=1
binary classes is accomplished using –1 and 1 entities like y ϵ[−1,1]. For optimum hyper- i
plane, the distance should be maximum [4]. To obtain the solution for the optimization
problem for generalized separating optimum hyper-plane, the following equation is used:
(1.20)
1 2 m
M inimize ‖ω‖ + C ∑ k=1 ξ i
2
subjected to y i (⟨ω, x i ⟩ + b) ≥ 1 − ξ i ; ξ i ≥ 0, i = 1, 2, …, m
The ω represents the vector for dimension m, and b is the bias value. C indicates the penalty
for the error. ξ represents the slack variable, which is calculates the distance of hyper-plane
i
and misclassified data points. The considered constraints for this equation are given as
1 2 m m n
ι(ω, b, ξ, α, γ) = ||ω || + C ∑ ξi − ∑ α i [y i (⟨ω, x i ⟩ + b) − 1 + ξ i ] − ∑ γi ξi
2 k=1 i=1 i=1
(1.21)
To satisfy the KKT conditions and to minimize ι, equation (1.21) is converted in the
dual quadratic equation described as
s. t. 0 ≤ α i ≤ C; i = 1, 2, …, m
m
∑ i=1 α i y i = 0 (1.23)
f (x) = sgn(∑
m
α i y i ⟨x i , x j ⟩ + b) (1.24)
i,j=1
F (X, W ) = Y (1.25)
Here, W represents the weighting factor that explains the interconnected frequency of a
neighboring neuron’s layer; it is used in image classification problems. The CNN’s hidden
layer consists of the following elements shown in Figure 1.3 and named as convolution layer,
pooling layer, and fully connected layer [5]. In standard architecture, the initial convolution
layer is employed followed by the pooling layer. The fully connected layers form the
architecture between different layers as an ANN structure. This layer is finally connected to
the output layer.
FIGURE 1.3 General model of CNN.
Feed propagation is accomplished by converting input data into output data using these
layers. The working procedure of each element is discussed here.
Convolution Layer
The basic element of CNN architecture is a convolution layer to extract the features, carrying
a set of linear and nonlinear functions known as convolution and activation functions,
respectively. In convolution, a kernel (i.e., array of number) is applied to an input (tensor).
An element-wise dot multiplication is performed between kernel elements and input tensor
elements and integrated to achieve the output in corresponding space in output tensor, termed
a feature map, as shown in Figure 1.4. The same process is repeated in order to generate
unlimited feature maps representing characteristics of different input tensors such that each
kernel can be viewed as an extractor of features. The size and number of kernels are two
primary parameters that describe the convolution operation.
Weight sharing in the convolution process produces the following characteristics: (i)
allowing local feature patterns derived by kernel translation-invariant to move through all
image positions and recognize local learned patterns; (ii) by down sampling in combination
with a pooling operation, studying spatial hierarchies of feature configurations, resulting in
an exponentially wider field of view being collected; and (iii) the model reliability can be
learned by diminishing the number of parameters relative to fully connected neural networks.
The convolution operation is performed using
a ij = σ((W * X) + b) (1.26)
ij
Here, X is the input given to the layer, kernel that slides over input is W , and b represent
bias.
FIGURE 1.5 Activation functions for NNs: a) ReLU, b) sigmoid, and c) tanh.
The rectified linear unit (ReLU) is the most widely used nonlinear activation function for
two reasons: first, it is simple to determine the partial derivative of ReLU. Second, when one
of the variables is training time, the saturating nonlinearities such as sigmoid are slower than
non-saturating nonlinearities such as ReLU. The ReLU is mathematically described as
Third, ReLU does not enable the absence of gradients. The considerable value of gradients in
the network decreases the efficacy of ReLU, which updates the weight parameter due to
which neurons do not get activated. This resembles the dying ReLU problem. This is
addressed by utilizing leaky ReLU conditions:
if x > 0, the function activates resemble f (x) = x, but
if x < 0, the function activates as αx, where α is a constant with a small value.
Pooling Layer
In pooling, translation invariance is added to minor shifts and transformations. The number
of corresponding learnable parameters is also reduced by the down-sampling procedure that
decreases the in-plane dimensionality of the function maps. For this purpose, a window is
chosen that executes the pooling operations and pooling function transferred input item lying
in that window. Max pooling and global average pooling are widely used pooling strategies
[5,6].
Max pooling – It is a widely used approach that extracts the patches from the input feature
map and eliminates all other values to get the full value from each patch as an output. It
reduces map size very significantly.
Global average pooling – It conducts down-sampling of a feature map into a 1 × 1 array
by taking the average of all the elements present in each feature map, keeping depth its
constant.
which approximates these N samples with zero error. The activation function G(α , β , X ) i i i
is mathematically modeled as
f L (X j ) = ∑
L
i=1
β i G(α i . X j + b i ) = t j , j = 1, … , N . (1.28)
Here, α and b are learning parameters of hidden nodes, out of which α connects the input
i i i
weight vector of input nodes to ith hidden node and b denotes the threshold of the ith hidden
i
node. β and t represent output weight and test points, whereas activation function
i j
G(α , β , X ) gives output for the ith hidden node. The equation is given as
i i i
Hβ = T (1.29)
where
H
*
this purpose.
β = H (
λ
⎢⎥
H = (α 1 , … , α L , b 1 , … , b L , X 1 , … , X N ) =
i
β
β = H
λ
T
1
T
L
⎤
⎦
,T =
bi
+ HH )
⎡
T
G(α 1 , β 1 , X 1 )
⎣ G(α
⎡T ⎤
⎣T ⎦
1,
T
1
T
N
.
β1 , XN )
is inverse for the output matrix H . The Moore-Penrose generalized inverse is utilized for
To obtain an enhanced and stable result from this network, a regularization term is added
to the β [8]. If hidden layer neurons are less as compared to training samples, β can be
represented as
β = (
1
+ H
*
H)
−1
*
H
−1
*
T
If nodes in hidden layers are more compared to training samples, β can be expressed as
The basic requirement of any AI-based algorithm is that the training and target data must be
of identical functional space with the same distribution. This assumption does not apply,
however, in certain real-life implementations. In this case, the efficient transfer of
information will significantly boost learning efficacy by preventing wasteful data-labeling
attempts. If the space of features and/or the distribution of data varies, a new model must be
created. Whenever it receives a new dataset, it becomes costly to generate a new model from
the ground up. So, the need and efforts to remember the vast volumes of training data are
minimized by TL. Transfer learning includes the method of transmitting and using
information gained in one or more background assignments to facilitate the learning of a
relevant target task [9–11].
…
…
G(α L , β L , X 1 )
G(α L , β L , X N )
⎤
(1.30)
(1.31)
(1.32)
N XL
Traditional algorithms for data mining and deep learning render forecasts for future data
utilizing models based on statistics. These models are trained on classified or unlabeled
training data that is previously obtained. The aim of TL is to take the benefit of data from the
first set to gain knowledge that might help in the second set when there is a requirement to
make a prediction directly. The distinction between conventional learning methods and
conversion strategies is seen in Figure 1.7.
Standard machine learning methods learn each task from scratch, while in transfer learning
it utilizes information from the previous task to a target task where the latter has less high-
quality training data.
Transfer learning methods in three different contexts can be divided into four groups on
the basis of “What to transfer.” The first group uses some part source domain data which that
can be utilized for learning the target domain [12,13]. In this case, instance reweighting and
significance sampling techniques are used.
The second category is the feature-representation-transfer approach [14–17]. The concept
used for this method is to learn a successful target domain representation of features. The
information used to pass between domains in this group is encoded into the acquired function
representation, which is supposed to dramatically increase the efficiency of the target task.
The third category is the parameter-transfer approach [18,19], in which certain parameters
of the models are distributed between source and target tasks. In the mutual parameters or
prior, the transmitted information is encoded. Finally, the last category is relational-
knowledge-transfer approach, which manages relations between different domains.
TABLE 1.1
Bio-inspired Algorithms for Data Analytics
Salp swarm algorithm and its variants [20,21] Genetic algorithm Biogeography-
Slime mold algorithm [22] Particle swarm and its variants based
optimization (PSO) and its variants [23] Coral [32,33] optimization
reef optimizer (CRO) Artificial bee colony Simulated (BBO) [36]
(ABC) [24] Squirrel search algorithm [25] Whale annealing [34] Artificial
optimization and its variants [26,27] Grey wolf Cuckoo search ecosystem-
optimizer (GWO) and its variants [28] Crow algorithm based
search algorithm [29] Boosting salp swarm Evolutionary optimization
Firefly swarm optimization Cat swarm strategy (AEO)
optimization (CSO) [30] Ant colony Genetic Invasive weed
optimization (ACO) [31] programming colony (IWC)
Differential Multi-species
Evolution (DE) optimizer
[35] (PS2O)
1.6 Conclusion
The transformation of data obtained through numerous organizations into practical
knowledge is accomplished by data analytics. In data analytics, artificial intelligence has a
huge scope to process and analyze the data. The advantages of artificial intelligence in data
analytics are given as follows: (i) automation becomes easy with the application of AI; (ii)
progressive learning; AI algorithms can train the machine to perform any desired operation;
and (iii) neural networks make it easy to train the machines because networks learn from its
input data, analyze it, and identify the correct dataset. Optimization algorithms are also the
part of AI and very useful to obtain enhanced results for different applications such as in the
field of biomedical signal processing, in the fault diagnosis of machines to identify and
predict the faults.
REFERENCES
1. S. J. Russell and P. Norvig, Artificial Intelligence: A Modern Approach. Englewood
Cliffs, New Jersey: Alan Apt, 1995.
2. T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data
Mining, Interface, and Prediction, Second edn Edi. Springer Series in Statistics, 2009.
3. V. Kumar, Healthcare Data Analytics. Taylor & Francis, 2015.
4. A. Kumar and R. Kumar, “Time-frequency analysis and support vector machine in
automatic detection of defect from vibration signal of centrifugal pump”, Meas. J. Int.
Meas. Confed., vol. 108, no. April, pp. 119–133, 2017, doi:
10.1016/j.measurement.2017.04.041.
5. R. Yamashita, M. Nishio, R. Kinh, G. Do, and K. Togashi, “Convolutional neural
networks: An overview and application in radiology”, Insights Imaging, vol. 9, pp. 611–
629, 2018.
6. S. Indolia, A. Kumar Goswami, S. P. Mishra, and P. Asopa, “Conceptual understanding
of convolutional neural network – A deep learning approach”, Procedia Comput. Sci.,
vol. 132, pp. 679–688, 2018, doi: 10.1016/j.procs.2018.05.069.
7. G. Huang, Q. Zhu, and C. Siew, “Extreme learning machine: Theory and applications”,
Neurocomputing, vol. 70, pp. 489–501, 2006, doi: 10..1016/j.neucom.2005.12.126.
8. D. Xiao, B. Li, and Y. Mao, “A multiple hidden layers extreme learning machine
method and its application”, Math. Probl. Eng., vol. 2017, pp 1–10, 2017.
9. S. J. Pan and Q. Yang, “A survey on transfer learning”, IEEE Trans. Knowl. Data Eng.,
vol. 22, no. 10, pp. 1345–1359, 2010, doi: 10.1109/TKDE.2009.191.
10. M. Kaboli, “A review of transfer learning algorithms”. Diss. Technische Universität
München, 2017.
11. D. Sarkar, R. Bali, and T. Ghosh, Hands-On Transfer Learning with Python: Implement
Advanced Deep Learning and Neural Network Models Using Tensor Flow and Keras.
Packt Publishing Ltd, 2018.
12. W. Dai, Q. Yang, and G.-R. Xue, “Boosting for transfer learning”, Proc. 24th Int. Conf.
Mach. Learn., pp. 93–200, 2007.
13. W. Dai, G.-R. Xue, Q. Yang, and Y. Yu, “Transferring naive Bayes classifiers for text
classification”, Proceedings – 22nd Assoc. Adv. Artif. Intell., pp. 540–545, 2007.
14. A. Argyriou, T. Evgeniou, and M. Pontil, “Multi-task feature learning”, Adv. Neural Inf.
Syst., vol. 19, 41, pp. 41–48, 2007.
15. S. I. Lee, V. Chatalbashev, D. Vickrey, and D. Koller, “Learning a meta-level prior for
feature relevance from multiple related tasks”, Proc. 24th Int. Conf. Mach. Learn., pp.
489–496, 2007.
16. T. Jebara, “Multi-task feature and kernel selection for SVMs”, Proc. 21st Int. Conf.
Mach. Learn., p. 55, 2004.
17. C. Wang and S. Mahadevan, “Manifold alignment using Procrustes analysis”, Proc.
25th Int. Conf. Mach. Learn., pp. 1120–1127, 2008.
18. E. V. Bonilla, K. M. A. Chai, and C. K. I. Williams, “Multi-task Gaussian process
prediction”, Adv. Neural Inf. Process. Syst., pp. 153–160, 2008.
19. N. D. Lawrence and J. C. Platt, “Learning to Learn with the informative vector
machine”, Proc. 21st Int. Conf. Mach. Learn., p. 65, 2004.
20. S. J. Pan and Q. Yang “A survey on transfer learning”, IEEE Trans. Knowl. Data Eng.,
vol. 22, pp. 1345–1359, 2009.
21. R. K. Ando and T. Zhang, “A framework for learning predictive structures from
multiple tasks and unlabeled data”, J. Mach. Learn. Res., vol. 6, pp. 1817–1853, 2005.
22. J. Holland, “Adaption in natural and artificial systems”, An Introductory Analysis with
Applications to Biology, Control, and Artificial Intelligence, Massachusetts, USA: MIT
Press, 1975.
23. S. Chauhan, M. Singh, and A. K. Aggarwal, “Diversity driven multi-parent evolutionary
algorithm with adaptive non-uniform mutation”, J. Exp. Theor. Artif. Intell., no. 2020,
pp. 1–32, 2020.
24. S. Zhao and P. N. Suganthan, “Empirical investigations into the exponential crossover
of differential evolutions”, Swarm Evol. Comput., vol. 9, pp. 27–36, 2013, doi:
10.1016/j.swevo.2012.09.004.
2
IoT Analytics/Data Science for IoT
CONTENTS
2.1 Preface
2.1.1 Data Science Components
2.1.2 Method for Data Science
2.1.3 The Internet of Stuff
2.1.3.1 Difficulties in the Comprehension of Stuff on the
Internet
2.1.3.2 Sub-domain of Data Science for IoT
2.1.3.3 IoT and Relationship with Data
2.1.3.4 IoT Applications in Data Science Challenges
2.1.3.5 Ways to Distribute Algorithms in Computer Science
to IoT Data
2.2 Computational Methodology-IoT Science for Data Science
2.2.1 Regression
2.2.2 Set of Trainings
2.2.3 Pre-processing
2.2.4 Sensor Fusion Leverage for the Internet of Things
2.3 Methodology-IoT Mechanism of Privacy
2.3.1 Principles for IoT Security
2.3.2 IoT Architecture Offline
2.3.3 Offline IoT Architecture
2.3.4 Online IoT Architecture
2.3.5 IoT Security Issues
2.3.6 Applications
2.4 Consummation
References
2.1 Preface
A multifaceted development is data science to extract information from
various data templates. Data science is defined as “unifying information,
data processing, machine learning, sphere knowledge and interrelated
techniques” to “comprehend and evaluate real phenomena” by knowledge.
It’s a “fourth paradigm” of science and argued that due to the effects of
information technology and the data deluge, all about science is changing as
imagined by Jim Gray.
Big data is becoming a critical instrument for corporations and
enterprises of each and every size quite rapidly. Big data availability and
interpretation have transformed the company representation of existing
industries that have allowed new ones to be developed. The different
technologies and techniques that can rely on the application are worn for
data science. The strategies are as follows:
In this step, the method and technique for drawing the relationship
between input variables must be determined. By using various
mathematical formulas and visualisation tools, planning for a model is
carried out.
Stage 1: Discovery: The phase of discovery involves collecting
knowledge from all the internal and external sources known that allows you
to address the business query. The data may be web server logs, social
media data collected, census datasets, or online source data streamed using
APIs.
Step 2: Preparation of Data: Data may have several inconsistencies, such
as missing value, blank columns, and incorrect format of data that needs to
be cleaned. Prior to modeling, you need to store, explore, and condition
data. The cleaner your data is, the better your forecasts.
Step 3: Planning of Models: In this step, the method and technique for
drawing the relationship between input variables must be determined. By
using various mathematical formulas and visualization tools, planning for a
model is carried out. Some of the methods used for this function are SQL
review services, R, and SAS/access.
Step 4: Constructing Models: The actual model-building process begins
in this phase. Here, data scientists distribute educational and research
datasets. For the training data collection, techniques such as association,
grouping, and clustering are applied. The model is checked against the”
testing “dataset until prepared.
Step 5: Operationalize: You deliver the final baseline model with reports,
code, and technical documents at this level. The model is implemented after
rigorous testing in a real-time production environment.
Step 6: Communication results: The main results are conveyed to all
stakeholders in this process. This helps you to determine if the project
outcomes are a success or a failure based on the model inputs.
TABLE 2.1
IoT Applications and Challenges in Data Science
Challenges Actions
Data Data storage and data analysis; hard drives used for
management data storage
and analysis
Credited with a Data warehouses and data centers; charge of extracting
good set of the data attained from the working frameworks;
data multi-faceted computational design, susceptibility;
information discovery and computational
complexities
Scalability and Adaptability and protection of data scalability and
Visualization visualization; proven to be necessary for handling
of data certain dangerous datasets, certainly when execution
problems occur
Poor data Poor data quality requires the use of highly established
quality industry principles and the detection of ceaseless
irregularities
Too much data Too much data overload will potentially create a large
number of problems that counteract substantial
progress
Data Data structures relations between pieces of information
structures collected at explicit intervals of time
Multiple data Knowledge on time arrangements has developed
formats protocols and procedures for coping with
together experiences and experiences
Balance scale It may not be suitable for circumstances needing a lot
and speed of knowledge to be managed gradually, and
adaptability of the cloud
i. The essence with IoT: As opposed to big data, the IoT scan is gaining
traction. Numerous shopper instances and new apps alter the
environment functions and converse for it. An advertisement loop
must break down [21] in order to see how IoT has progressed after
some time.
ii. IoT problems: There are a number of difficulties with IoT
identification: Inspire devices from different manufacturers to talk to
one another.
Prerequisites and communication routines of design
Information confidentiality
Stage 1: Describe the problem you have. How to describe the problem
with machine learning.
Stage 2: Get your data ready.
Stage 3: Algorithms for spot-check.
Stage 4: Outcomes change.
Stage 5: Current outcomes.
2.2.1 Regression
The most famous model is regression, used to estimate the relationships
between variables, whereas classification models belong to the group on
observation, as in Figure 2.6. These systems range from linear regression
(simple) to more complex techniques including gradient boosting and
neural networks.
FIGURE 2.6 Regression for weather analysis.
2.2.3 Pre-processing
Data pre-processing is an umbrella concept that encompasses a number of
operations that can be used by data scientists to bring their data into a more
fitting shape for what they want to do with it. In a void, however, pre-
processing data would not occur. Pre-processing provides an easy and
simple solution as can be seen; there are best practices and build an
intuition, and pre-processing is normally important to determine its output
in context. Scale is our respective data, for each of us, this function will be
the same.
Figure 2.8 Sensory information is obtained from one’s surroundings
(vision, sound, smell, taste, and touch) and passes to the brain for
processing and reaction via the peripheral nervous system. Sensor fusion
takes the simultaneous input from several sensors when integrating all of
these technologies, processes the input, and produces an output that is
greater than the sum of its parts (i.e., sensor fusion removes the
shortcomings of each individual sensor by using special algorithms and
filtering techniques, similar to how the human body works as mentioned
previously).
Sensor fusion offers a variety of features that simplify our lives and allow
us to leverage these features in a variety of services. Sensor fusion also
applies to a combination of 3D accelerometers, 3D gyros, and 3D
magnetometers. This configuration is called a 9-axis system and provides
the user with 9 degrees of freedom (9-DoF). Freescale released the
Extrinsic 12-axis sensor platform for Windows 8 in 2012, which provides a
fusion solution for 12-DoF sensors. This is done by including the features
of a barometer sensor, a thermometer sensor, and ambient light detection.
2.3.6 Applications
2.4 Consummation
Data science is a field of study that involves extracting information from
vast amounts of data using a variety of scientific methods, algorithms, and
processes. This upcoming sector in industry and education has plenty of
applications and benefits. When this data science is incorporated with the
Internet of things, then the world becomes simple in all sorts of
technologies. This sort of analytics will improve the technology at a global
level.
REFERENCES
1. Abu-Elkheir, M., Hayajneh, M., Ali, N. A.: Data management for the
internet of things: design primitives and solution. Sensors 13(11),
15582–15612 (2013).
2. Riggins, F. J., Wamba, S.F.: Research directions on the adoption,
usage, and impact of the internet of things through the use of big data
analytics. In: Proceedings of 48th Hawaii International Conference on
System Sciences (HICSS’15), pp. 1531–1540. IEEE (2015).
3. Cheng, B., Papageorgiou, A., Cirillo, F., Kovacs, E.: Geelytics: geo-
distributed edge analytics for large scale IoT systems based on
dynamic topology. In: 2015 IEEE 2nd World Forum on Internet of
Things (WF-IoT), pp. 565–570. IEEE (2015).
4. Fang, H.: Managing data lakes in big data era: what’s a data lake and
why has it become popular in data management ecosystem. In: 2015
IEEE International Conference on Cyber Technology in Automation,
Control, and Intelligent Systems (CYBER), pp. 820–824. IEEE (2015).
5. Desai, P., Sheth, A., Anantharam, P.: Semantic gateway as a service
architecture for IoT interoperability. In: 2015 IEEE International
Conference on Mobile Services (MS), pp. 313–319. IEEE (2015).
6. Hu, S.: Research on data fusion of the internet of things. In: 2015
International Conference on Logistics, Informatics and Service
Sciences (LISS), pp. 1–5. IEEE (2015) Google Scholar.
7. Schmidhuber, J.: Deep learning in neural networks: an overview.
Neural Netw. 61, 85–117 (2015).
8. Tsai, C.-W., Lai, C.-F., Chiang, M.-C., Yang, L.T.: Data mining for
internet of things: a survey. IEEE Commun. Surveys Tuts. 16(1), 77–97
(2014).
9. Sun, Y., et al: Organizing and querying the big sensing data with
event-linked network in the internet of things. Int. J. Distrib. Sensor
Netw. 11, pp. 1–11, (2014) Google Scholar.
10. Provost, F., Fawcett, T.: Data Science for Business-What you need to
Know About Data Mining and Data-Analytic Thinking. O’Reilly
(2013), ISBN 978-1-449-36132-7.
11. Dhar, V.: Data science and prediction. Comm. ACM 56(12), 64–73
(2013).
12. Mattmann, C.A.: Computing: a vision for data science. Nature
493(7433), 473–475 (2013).
13. Tiropanis, T.: Network science web science and internet science.
Comm. ACM 58(8), 76–82 (2015).
14. Tinati, R., et al: Building a real-time web observatory. IEEE Internet
Comput. 19(6), 36–45 (2015).
15. Sun, Y., Yan, H., Lu, C., Bie, R., Zhou, Z.: Constructing the web of
events from raw data in the web of things. Mobile Inf. Syst. 10(1),
105–125 (2014).
16. Mehta, Brijesh, Rao, Udai Pratap: Privacy preserving unstructured big
data analytics: issues and challenges. Procedia Comput. Sci. 78, 120–
124 (2016).
17. Al-Fuqaha, A., Guizani, M., Mohammadi, M., Aledhari, M., Ayyash,
M.: Internet of things: a survey on enabling technologies protocols and
applications. IEEE Commun. Surveys Tuts. 17(4), 2347–2376 (2015).
18. Mohammadi, M., Al-Fuqaha, A.: Enabling cognitive smart cities using
big data and machine learning: approaches and challenges. IEEE
Commun. Mag. 56(2), 94–101 (2018).
19. Chen, M., Mao, S., Zhang, Y., Leung, V.C.: Big Data: Related
Technologies Challenges and Future Prospects, Heidelberg. Springer,
Germany (2014).
20. Fadlullah, Z.M., et al: State-of-the-art deep learning: evolving machine
intelligence toward tomorrow’s intelligent network traffic control
systems. IEEE Commun. Surv. Tuts. 19(4), 2432–2455 (2017).
21. Lee, J., Ardakani, H.D., Yang, S., Bagheri, B.: Industrial big data
analytics and cyber-physical systems for future maintenance & service
innovation. Procedia CIRP 38, 3–7 (2015).
22. Hu, H., Wen, Y., Chua, T.-S., Li, X.: Toward scalable systems for big
data analytics: a technology tutorial. IEEE Access 2, 652–687 (2014).
23. Xia, F., Yang, L.T., Wang, L., Vinel, A.: “Internet of things”. Int. J.
Commun. Syst. 25, 1101–1109 (2012).
24. Zaslavsky, A., Perera, C., Georgakopoulos, D.: Sensing as a service
and big data. Proceedings of the International Conference on Advances
in Cloud Computing (ACC) 2, 1–8 (2013).
3
A Model to Identify Agriculture Production
Using Data Science Techniques
CONTENTS
3.1 Agriculture System Application Based on GPS/GIS Gathered
Information
3.1.1 Important Tools Required for Developing GIS/GPS-Based
Agricultural System
3.1.1.1 Information (Gathered Data)
3.1.1.2 Map
3.1.1.3 System Apps
3.1.1.4 Data Analysis
3.1.2 GPS/GIS in Agricultural Conditions
3.1.2.1 GIS System in Agriculture
3.1.3 System Development Using GIS and GPS Data
3.2 Design of Interface to Extract Soil Moisture and Mineral Content in
Agricultural Lands
3.2.1 Estimating Level of Soil Moisture and Mineral Content Using
COSMIC-RAY (C-RAY) Sensing Technique
3.2.1.1 Cosmic
3.2.2 Soil Moisture and Mineral Content Measurement Using Long
Duration Optical Fiber Grating (LDOPG)
3.2.3 Moisture Level and Mineral Content Detection System Using
a Sensor Device
3.2.4 Soil Moisture Experiment
3.2.4.1 Dataset Description
3.2.5 Experimental Result
3.3 Analysis and Guidelines for Seed Spacing
3.3.1Correct Spacing
3.3.2System Components
3.3.2.1 Electronic Compass
3.3.2.2 Optical Flow Sensor
3.3.2.3 Motor Driver
3.3.2.4 Microcontroller
3.4 Analysis of Spread of Fertilizers
3.4.1 Relationship between Soil pH Value and Nutrient Availability
3.4.2 Methodology
3.4.2.1 Understand Define Phase
3.4.2.2 Analysis and Quick Design Phase
3.4.2.3 Prototype Development Phase
3.4.2.4 Testing Phase
3.4.3 System Architecture
3.4.4 Experimental Setup
3.4.5 Implementation Phase
3.4.6 Experimental Results
3.5 Conclusion and Future Work
References
3.1.1.2 Map
This can be defined as a collection of images. Basically, maps are defined
as a geographical suitcase of different types of layered data and their
interconnection. GPS maps are very handy to copy or transfer by
embedding a similar application and also easily anytime and anywhere with
any person accessible.
FIGURE 3.1 System design using GIS- and GPS-based expert system.
Treatment before cultivating choice has two cycles:
1. Based on atmosphere, soil richness, and item condition, logical and
technological levels for finding that field motive of growing are
decided about the required crop.
2. Real field fertilizer requirement and usage.
3.2.1.1 Cosmic
The connection between the deliberate neutron total count available and
availability of water substance in soil can be detected with help of the
COSMIC–BEAM Soil Moisture Detection physical model, known as the
“COSMIC-RAY” model. Inestimable speaks to the number of quick
neutrons arriving at the COSMOS near-surface estimation point cal is
calculated with the help of the following equation: as
∞
s = N ∫
0
{A(z)[αρ s (z) + ρ w (z)] (3.1)
m s (z) m w (z)
× exp (− [ + ])}dz
L1 L2
Infinite discreteness of the dirt data is in 250 levels with soil data of 4 m
depth. This water resource, which is in the form of different layers, has been
by these C-RAY physical models. Then, the average frequently generated
rate of neutrons in every layer of water source has been determined. At last,
the average depth of soil dampness and its powerful detecting depthness of
the C-RAY soil dampness test are likewise determined by COSMIC system.
2020-02-23
00:00:00,67.92,0,55.72,0,1.56,1,26.57,1,19.52,55.04,101.5,2.
13,6.3,225
2020-02-23
00:05:00,67.89,0,55.74,0,1.51,1,26.58,1,19.49,55.17,101.5,2.
01,10.46,123.75
2020-02-23
00:10:00,67.86,0,55.77,0,1.47,1,26.59,1,19.47,55.3,101.51,1.
9,14.63,22.5
2020-02-23
00:15:00,67.84,0,55.79,0,1.42,1,26.61,1,19.54,54.2,101.51,2.
28,16.08,123.75
2020-02-23
00:20:00,67.81,0,55.82,0,1.38,1,26.62,1,19.61,53.09,101.51,
2.66,17.52,225
3.3.2.4 Microcontroller
The microchip provides a multi-feature microcontroller needed for the
project. The microchip provides a DSPIC30F4013 microcontroller, which is
a very powerful 16-bit chipset microcontroller the provides accuracy and
stability to our project. This microcontroller acts as an interface between the
optical flow sensor and electronic compass. The module SPI provides an
interface feature between the compass and a OF sensor to collect data from
it. The input to the motor driver has been provided by four pulse modulators
provided by this microchip.
3.4 Analysis of Spread of Fertilizers
User fertilizer feeding and spreading is mostly inefficient and hazardous
due to the inaccurate measurement, unequal spreading, and direct contact
with fertilizer can be dangerous. The various experiments performed up
until now provide feedback of the need of a system for better and accurate
fertilizer spreading. By using this idea, a better and accurate system can
improve crop yield and productivity. Thus, this operation is intended to give
a basic and profitable course to farmers in order to lead fertilizer feeding
and spreading measures accurately.
3.4.2 Methodology
The system methodology that is adopted in this study is the combination of
Rapid Application Development (RAD) and Design Thinking (DT).
Algorithm_path_planner
1 make_dir_path:
2 if !(path_exists(join(*path)):
3 makedirs(path.join(path))
4 train_img_path=”data/path”
5 train_label=listdir(train_img_path)
6 num_per_label=list[]
7 for i in train_label:
8 num_per_label.append(len(listdir(path.join(train_img_path,i))))
9 num_valid=min(num_per_label)*0.2
10 end for;
11 for i in train_label:
12 idx_valid =
np.random.coise(listdir(path.join(train_img_path,i)),num_valid)))
13 make_dir([data,valid,i])
14 for img in idx_valid:
15 move(path.join(data/train,i,img),path.join(‘data/valid’,i,img))
16 end for;
17 end for;
REFERENCES
1. P. Kumar, S. Suman, and S. Mishra, “Shortest route finding by ant
system algorithm in web geographical information system-based
advanced traveller information system”. The Journal of Engineering,
vol. 2014, no. 10, pp. 563–573, 2014, doi: 10.1049/joe.2014.0190.
2. Y. Dong et al, “Automatic system for crop pest and disease dynamic
monitoring and early forecasting”. IEEE Journal of Selected Topics in
Applied Earth Observations and Remote Sensing, vol. 13, pp. 4410–
4418, 2020, doi: 10.1109/JSTARS.2020.3013340.
3. C. A. Martínez Félix, G. E. Vázquez Becerra, J. R. Millán Almaraz, F.
Geremia-Nievinski, J.R. Gaxiola Camacho, and Á. Melgarejo Morales,
“In-field electronic based system and methodology for precision
agriculture and yield prediction in seasonal maize field”. IEEE Latin
America Transactions, vol. 17, no. 10, pp. 1598–1606, Oct. 2019, doi:
10.1109/TLA.2019.8986437.
4. E. R. Hunt, C. S. T. Daughtry, S. B. Mirsky, and W. D. Hively,
“Remote sensing with simulated unmanned aircraft imagery for
precision agriculture applications”. IEEE Journal of Selected Topics in
Applied Earth Observations and Remote Sensing, vol. 7, no. 11, pp.
4566–4571, Nov. 2014, doi: 10.1109/JSTARS.2014.2317876.
5. D. Gomez-Candon et al, “Semiautomatic detection of artificial
terrestrial targets for remotely sensed image georeferencing”. IEEE
Geoscience and Remote Sensing Letters, vol. 10, no. 1, pp. 184–188,
Jan. 2013, doi: 10.1109/LGRS.2012.2197729.
6. A. H. S. Solberg, T. Taxt, and A. K. Jain, “A Markov random field
model for classification of multisource satellite imagery”. IEEE
Transactions on Geoscience and Remote Sensing, vol. 34, no. 1, pp.
100–113, Jan. 1996, doi: 10.1109/36.481897.
7. H. McNairn et al, “The soil moisture active passive validation
experiment 2012 (SMAPVEX12): Prelaunch calibration and validation
of the SMAP soil moisture algorithms”. IEEE Transactions on
Geoscience and Remote Sensing, vol. 53, no. 5, pp. 2784–2801, May
2015, doi: 10.1109/TGRS.2014.2364913.
8. A. Loew and W. Mauser, “On the disaggregation of passive microwave
soil moisture data using a priori knowledge of temporally persistent
soil moisture fields”. IEEE Transactions on Geoscience and Remote
Sensing, vol. 46, no. 3, pp. 819–834, March 2008, doi:
10.1109/TGRS.2007.914800.
9. X. Han, R. Jin, X. Li, and S. Wang, “Soil moisture estimation using
cosmic-ray soil moisture sensing at heterogeneous farmland”. IEEE
Geoscience and Remote Sensing Letters, vol. 11, no. 9, pp. 1659–1663,
Sept. 2014, doi: 10.1109/LGRS.2014.2314535.
10. M. S. Burgin et al, “A comparative study of the SMAP passive soil
moisture product with existing satellite-based soil moisture products”.
IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 5,
pp. 2959–2971, May 2017, doi: 10.1109/TGRS.2017.2656859.
11. A. Eliran, N. Goldshleger, A. Yahalom, E. Ben-Dor, and M. Agassi,
“Empirical model for backscattering at millimeter-wave frequency by
bare soil subsurface with varied moisture content”. IEEE Geoscience
and Remote Sensing Letters, vol. 10, no. 6, pp. 1324–1328, Nov. 2013,
doi: 10.1109/LGRS.2013.2239603.
12. Y. Wang, S. Wang, S. Yang, L. Zhang, H. Zeng, and D. Zheng, “Using
a remote sensing driven model to analyze effect of land use on soil
moisture in the Weihe River Basin, China”. IEEE Journal of Selected
Topics in Applied Earth Observations and Remote Sensing, vol. 7, no.
9, pp. 3892–3902, Sept. 2014, doi: 10.1109/JSTARS.2014.2345743.
13. K. C. Kornelsen and P. Coulibaly, “Design of an optimal soil moisture
monitoring network using SMOS retrieved soil moisture”. IEEE
Transactions on Geoscience and Remote Sensing, vol. 53, no. 7, pp.
3950–3959, July 2015, doi: 10.1109/TGRS.2014.2388451.
14. Wang Yecheng and Qiu Lichun, “Research of new type cell wheel feed
precision seed-metering device”. 2011 International Conference on
New Technology of Agricultural, Zibo, 2011, pp. 102–105, doi:
10.1109/ICAE.2011.5943759.
15. P. V. S. Jayakrishna, M. S. Reddy, N. J. Sai, N. Susheel, and K. P.
Peeyush, “Autonomous seed sowing agricultural robot”. 2018
International Conference on Advances in Computing, Communications
and Informatics (ICACCI), Bangalore, 2018, pp. 2332–2336, doi:
10.1109/ICACCI.2018.8554622.
16. R. S. Dionido and M. C. Ramos, “Autonomous seed-planting vehicle”.
2017 7th IEEE International Conference on Control System,
Computing and Engineering (ICCSCE), Penang, 2017, pp. 121–126,
doi: 10.1109/ICCSCE.2017.8284391.
17. Kwok Pui Choi, Fanfan Zeng, and Louxin Zhang, “Good spaced seeds
for homology search”. Proceedings. Fourth IEEE Symposium on
Bioinformatics and Bioengineering, Taichung, Taiwan, 2004, pp. 379–
386, doi: 10.1109/BIBE.2004.1317368.
18. K. Ramesh, K. T. Prajwal, C. Roopini, M. Gowda M.H., and V. V. S.
N. S. Gupta, “Design and development of an agri-bot for automatic
seeding and watering applications”. 2020 2nd International Conference
on Innovative Mechanisms for Industry Applications (ICIMIA),
Bangalore, India, 2020, pp. 686–691, doi:
10.1109/ICIMIA48430.2020.9074856.
19. A. I. Zainal Abidin, F. A. Fadzil, and Y. S. Peh, “Micro-controller
based fertilizer dispenser control system”. 2018 IEEE Conference on
Wireless Sensors (ICWiSe), Langkawi, Malaysia, 2018, pp. 17–22,
doi: 10.1109/ICWISE.2018.8633277.
20. R. Eatock and M. R. Inggs, “The use of differential GPS in field data
extraction and spatially variable fertilizer application”. Proceedings of
IGARSS '94 - 1994 IEEE International Geoscience and Remote
Sensing Symposium, Pasadena, CA, USA, 1994, pp. 841–843, vol. 2,
doi: 10.1109/IGARSS.1994.399280.
21. M. S. Islam, A. Islam, M. Z. Islam, and E. Basher, “Feasibility analysis
of deploying biogas plants for producing electricity and bio-fertilizer
commercially at different scale of poultry farms in Bangladesh”. 2014
3rd International Conference on the Developments in Renewable
Energy Technology (ICDRET), Dhaka, 2014, pp. 1–6, doi:
10.1109/ICDRET.2014.6861654.
22. S. Villette, C. Gée, E. Piron, R. Martin, D. Miclet, and M.
Paindavoine, “An efficient vision system to measure granule velocity
and mass flow distribution in fertiliser centrifugal spreading”. 2010
2nd International Conference on Image Processing Theory, Tools and
Applications, Paris, 2010, pp. 543–548, doi:
10.1109/IPTA.2010.5586738.
23. M. Chakroun, G. Gogu, M. Pradel, F. Thirion, and S. Lacour, “Eco-
design in the field of spreading technologies”. 2010 IEEE Green
Technologies Conference, Grapevine, TX, 2010, pp. 1–6, doi:
10.1109/GREEN.2010.5453796.
24. K. D. Sowjanya, R. Sindhu, M. Parijatham, K. Srikanth, and P.
Bhargav, “Multipurpose autonomous agricultural robot”. 2017
International Conference of Electronics, Communication and
Aerospace Technology (ICECA), Coimbatore, 2017, pp. 696–699, doi:
10.1109/ICECA.2017.8212756.
25. A. K. Mariappan and J. A. Ben Das, “A paradigm for rice yield
prediction in Tamilnadu”. 2017 IEEE Technological Innovations in
ICT for Agriculture and Rural Development (TIAR), Chennai, 2017,
pp. 18–21, doi: 10.1109/TIAR.2017.8273679.
26. A. Manjula and G. Narsimha, “XCYPF: A flexible and extensible
framework for agricultural Crop Yield Prediction”. 2015 IEEE 9th
International Conference on Intelligent Systems and Control (ISCO),
Coimbatore, 2015, pp. 1–5, doi: 10.1109/ISCO.2015.7282311.
27. Y. Gandge and Sandhya, “A study on various data mining techniques
for crop yield prediction”. 2017 International Conference on Electrical,
Electronics, Communication, Computer, and Optimization Techniques
(ICEECCOT), Mysuru, 2017, pp. 420–423, doi:
10.1109/ICEECCOT.2017.8284541.
28. P. S. Vijayabaskar, R. Sreemathi, and E. Keertanaa, “Crop prediction
using predictive analytics”. 2017 International Conference on
Computation of Power, Energy Information and Commuincation
(ICCPEIC), Melmaruvathur, 2017, pp. 370–373, doi:
10.1109/ICCPEIC.2017.8290395.
29. P. S. Nishant, P. Sai Venkat, B. L. Avinash, and B. Jabber, “Crop yield
prediction based on Indian agriculture using machine learning”. 2020
International Conference for Emerging Technology (INCET),
Belgaum, India, 2020, pp. 1–4, doi:
10.1109/INCET49848.2020.9154036.
30. S. Bang, R. Bishnoi, A. S. Chauhan, A. K. Dixit, and I. Chawla,
“Fuzzy logic based crop yield prediction using temperature and rainfall
parameters predicted through ARMA, SARIMA, and ARMAX
models”. 2019 Twelfth International Conference on Contemporary
Computing (IC3), Noida, India, 2019, pp. 1–6, doi:
10.1109/IC3.2019.8844901.
31. R. Medar, V. S. Rajpurohit, and S. Shweta, “Crop yield prediction
using machine learning techniques”. 2019 IEEE 5th International
Conference for Convergence in Technology (I2CT), Bombay, India,
2019, pp. 1–5, doi: 10.1109/I2CT45611.2019.9033611.
4
Identification and Classification of Paddy Crop Diseases Using Big
Data Machine Learning Techniques
Anisha P. Rodrigues, Joyston Menezes, Roshan Fernandes, Aishwarya, Niranjan N. Chiplunkar, and Vijaya
Padmanabha
CONTENTS
4.1 Introduction
4.1.1 Overview of Paddy Crop Diseases
4.1.2 Overview of Big Data
4.1.2.1 Features of Big Data
4.1.3 Overview of Machine Learning Techniques
4.1.3.1 K-Nearest Neighbor
4.1.3.2 Support Vector Machine
4.1.3.3 K-Means
4.1.3.4 Fuzzy C-Means
4.1.3.5 Decision Tree
4.1.4 Overview of Big Data Machine Learning Tools
4.1.4.1 Hadoop
4.1.4.2 Hadoop Distributed File System (HDFS)
4.1.4.3 YARN (“Yet Another Resource Negotiator”)
4.2 Related Work
4.2.1 Image Recognition/Processing
4.2.2 Classification and Feature Extraction
4.2.3 Problems and Diseases
4.3 Proposed Architecture
4.3.1 Image Acquisition
4.3.2 Image Enhancement
4.3.3 Image Segmentation
4.3.4 Feature Extraction
4.3.5 Classification
4.4 Proposed Algorithms and Implementation Details
4.4.1 Image Preprocessing
4.4.2 Image Segmentation and the Fuzzy C-Means Model Using Spark
4.4.3 Feature Extraction
4.4.4 Classification
4.4.4.1 Support Vector Machine (SVM)
4.4.4.2 Naïve Bayes
4.4.4.3 Decision Tree and Random Forest
4.5 Result Analysis
4.5.1 Comparison of Speed-up Performance between the Spark-Based and Hadoop-Based FCM Approach
4.5.2 Comparison of Scale-up Performance between the Spark-Based and Hadoop-Based FCM Approach
4.5.3 Result Analysis of Various Segmentation Techniques
4.5.4 Results of Disease Identification
4.6 Conclusion and Future Work
References
4.1 Introduction
India is a country known for agriculture in which various crops are grown in different parts of the country and
often serves as a host to a multitude of emerging and invasive crop diseases [1]. Major parts of India range from
tropical to subtropical and this climate is more favorable for the development of disease-causing insects and
microorganisms that result in 20–30% yield losses of principal food sources; hence, prevention and early diagnosis
are critical to limit the destruction caused by them [2]. Oryza sativa (paddy/rice) is one of the major food crops
that offers more than 2.8 billion people a primary meal. Uncharted infection of a rice crop can create an immense
loss of crop production that will produce fewer yields.
This chapter stresses how machine learning techniques and big data can be implemented in the identification of
infection in paddy crops. The generalized architecture of crop infection detection is depicted in Figure 4.1.
Important stages of image processing that need to be followed to achieve our desired results are along these lines.
Existing approaches may not be reasonable due to various infections with indistinguishable infection patterns; also,
these infections can widely vary with various regional conditions and numerous crop varieties.
FIGURE 4.1 Overall approach of paddy crop disease detection using machine learning.
Structured data: The data that might be effortlessly sorted and analyzed; for example, words and numbers. It
is produced because of system sensors implanted in electric devices such as the Global Positioning System
(GPS), smartphones, and similar devices.
Semi-structured Data: This type of structured data refuses to satisfy a clear-cut and certain arrangement. The
information is constitutionally self-describing and consists of markers to impose hierarchies of fields and
records inside the datasets and tags.
Unstructured data: It consists of highly complicated data; for instance videos and photos posted on social
media and reviews from customers on commercial websites, likes on social media networking.
Volume: It relates, with collection and generation of a huge number of datasets; data size keeps increasing.
Variety: It constitutes all kinds of datasets. Google remarks that the growth rate of unstructured data is 15
times more than structured information, which rapidly expands by 20.0% compound annual growth rate.
Velocity: The data velocity is defined regarding the density of its production along with delivery, which is
also another feature of big data.
4.1.3.3 K-Means
The K-means approach is one of the conventional clustering technique. This approach initially selects a K dataset
arbitrarily as the primary cluster hub, in favor of the remaining just adjoin in the direction of clusters through the
maximum resemblance stated to its space of the cluster hub, along with recalculating the cluster hub of every
collection [7].
Jk = ∑
k
∑ (X i − m k )
2 (4.1)
k=1 i∈C k
In Equation (4.1), (X1, X2,· ·, Xn) = X is the matrix of data, along with mk = ∑ i∈C k
X i/nk, which is the kernel of the
collection Ck along with nk at the quantity of points within Ck.
C N
E = ∑ j=1 ∑ i=1 μ
k
||X i − C j ||
2
(4.2)
ij
Here, μij stands for fuzzy association of pixel (or model), which is Xi along with the cluster recognized through its
pivot Cj, and the constant k represents the fuzziness of the resultant partition.
TABLE 4.1
Advantages and Disadvantages of Crop Disease Analysis Techniques
Support Locate the finest partition hyper- Needs mutual negative and positive instances.
Vector plane. Required to choose a fine function of a kernel.
Machine Able to deal with extremely elevated Need plenty of CPU time and memory.
data dimension.
Typically works extremely fine.
K-Nearest Exceptionally simple to comprehend Encompass huge storage space necessities.
Neighbor because there are a small number of Responsive to the alternative of the resemblance
analyst variables. purpose that is used to evaluate cases.
Helpful for constructing models with Short of an upright method to prefer K, apart from
the intention to engage substandard cross validation or comparable.
information types, for example texts.
Fuzzy C- It permits data points designated in The cluster number C needs to be defined.
Mean numerous clusters. Membership cutoff values are needed to be
Behavior of genes is represented determined.
more naturally.
K-Means Few complexes. Requirement of K specification.
Susceptible to outlier record points and noise.
Algorithm Advantages Disadvantages
Decision Straightforward to infer and The difficulty of training a finest decision tree is
Tree understand. recognized as NP-complete beneath numerous
It needs modest preparation of data. features of easy concepts and even for optimality.
It is able to look after both Decision tree algorithms generate trees that are
categorical and mathematical data. overly complex.
White box approach is used.
4.1.4.1 Hadoop
The term Hadoop, along with MapReduce, are considered synonymous by most individuals, but this is not entirely
true. Initially enforced as an associate degree open-source code implementation of the MapReduce to process
engines coupled to a distributed database system in 2007, since then Hadoop has developed into a colossal network
of project works linked to every phase of workflow of massive sets of data consisting data collection, storage,
processing, and so on [11].
a. Color – It is the most significant characteristic in a picture, because it can differentiate one disease from
another. Pandey et al. [15] surveyed image processing and machine learning approaches implemented in the
crop grading method. Fruit quality grading is done according to texture, color, and size; calyx of fruits, stems,
and sorting the fruits using shape.
b. Shape – Each infection may have a different shape; some common shape aspects are axis, angle, and area.
c. Texture – It is how patterns of color are extracted in a picture. Dhaygude S. B. et al. [16] have built a system
using four stages. The initial stage is generating a RGB-altered picture. Then the transformed RGB is used to
produce HIS pictures. The disadvantage of feature extraction algorithms is that their computational levels are
so high; also, due to dependence on explicit facts, generalization is not easy.
TABLE 4.2
Comparison between Different Segmentation Approaches
4.3.5 Classification
The proposed system used various classification techniques, namely, Naïve Bayes, DT, SVM, and Random Forest
using Spark as a cluster computing platform.
r =
R
R+G+B
g =
G
R+G+B
b =
B
R+G+B
(4.3)
The standardized H, S, and V constituents can be acquired by using (4.4), (4.5), (4.6), and (4.7) [23–26].
leaf blight.
FIGURE 4.7
h = cos
h = 2π − cos
−1
−1
⎪ ⎢⎥
⎧
⎨
H =
v =
π
2
√(r−g) +(r−b)(g−b)
0.5[(r−g)+(r−b)]
2
√(r−g) +(r−b)(g−b)
s = 1 − 3. min(r, g, b);
R+G+B
3.255
⎫
⎬ ;h ∈
⎫
⎬
;v ∈ [0, 1]
⎡
⎣
;h ∈
; S = sx 100; I = ix 255;
0, π
⎡
⎣
s ∈ [0, 1]
Equation (4.6) is used to change HSV values to a further expedient choice of [0,360], [0,100], [0,255],
respectively.
hx180
⎤
⎦
π, 2π
f or b ≤ g
⎤
⎦
f or b > g
Equation (4.4) is used to obtain the RGB matrix of the pictures. Figure 4.7 shows the leaf illustration of bacterial
Equations (4.4) to (4.8) are used to convert the RGB image into a HSV color space by enhancing the
effectiveness in the image processing and minimizing calculation, which is shown in Figure 4.8.
FIGURE 4.8
4.4.2
Alteration of RGB image to HSV.
(4.5)
(4.6)
(4.7)
(8)
Jm in the FCM approach, that accomplish fine partitioning outcome shown by searching for the finest cluster
centers. An objective function, Jm, is represented in the criteria.
J m (U , V , X = ∑ i=1 ∑ j=1 u ij d ij
c n m 2
(4.9)
c d ij
2
m−1
−1
c
(4.10)
u ij = [∑ ( ) ] 0 ≤ u ij ≤ 1, ∑ u ij = 1
k=1 d kj i=1
(4.11)
m
∑ u Xj
j=1 ij
Vi = n m
∑ u u
j=1 ij
ij
…
where X = {x1, x2, , xn} represents the datasets, along with V = {v1,v2, ,vn} represents the set of clusters. m
indicates the association scale of all data component to the cluster and is the real number that manages fuzziness of
clustering. Generally, the value m is 2. uij is used to represent c × n size relationship matrix, where n is the size of
data and c is the number of clusters. The nearness of the data constituent xj to the center of the cluster vi is
∥ ∥
measured by dij = xj−vi 2. The result of FCM clustering and the infected part of the rice leaf is shown in
Figures 4.9 and 4.10, respectively.
I. Shape feature extraction: Shape is one of the significant frameworks of a picture. One can easily recognize
and differentiate an entity by visualizing its shapes.
II. Color feature extraction: By using texture characteristics, namely skewness, cluster prominence, cluster
shade, and Kurtosis. By using grey-level co-occurrence matrix, we remove the features of texture.
4.4.4 Classification
4.5.1 Comparison of Speed-up Performance between the Spark-Based and Hadoop-Based FCM
Approach
In this section, we calculate the results of the Spark-based FCM algorithm using speed-up. Here, speed-up denotes
how quicker a parallel algorithm is compared to equivalent serial approach. The following equation is used to
define speed-up:
Speedup =
T1
Tp
(4.12)
In the above equation, T1 refers to the implementation time of the sequential algorithm on a lone node, p refers to
the number of nodes, and Tp is the execution time of the parallel algorithm. We use test images of rice leaf to
validate these two Fuzzy-Means algorithms by comparing the speed-up among the Spark-based and Hadoop-based
FCM approach. The results experimentation is presented in Figure 4.11 in the form of a graph that shows the
overall working enhancement for the Spark-based algorithm and Hadoop-based algorithm. The following equation
is used to describe the increase rate:
rate_inc =
Spark_speedup−H adoop_speedup
H adoop_speedup
× 100% (4.13)
FIGURE 4.11 Comparison of the speed-up between Spark-based and Hadoop-based FCM algorithm.
We can see that the from the Performance improvement column in Table 4.3, it is clear that the proportion
enhancement changes from 76.53% to 163.15%; also, the average increase in the rate can arrive at 115.32%. These
experimental outcomes specify that the Spark-based FCM approach can gain better performance improvement for
all pictures compared with the Hadoop-based FCM approach.
TABLE 4.3
The Performance Improvement Summary
4.5.2 Comparison of Scale-up Performance between the Spark-Based and Hadoop-Based FCM
Approach
Scale-up is used to examine the scalability of the system to amplify both the dataset size and the system. It is the
capability of an x-times bigger method to execute an x-time bigger work within the equal run time compared to the
original approach. The following equation is used to describe the scale-up rate:
Scaleup(data, x) =
T1
(4.14)
T xx
Here, T1 denotes the time of execution for processing facts on single node, and carrying out time for processing x∗
data on x computing nodes are denoted by Txx. Scale-up performances of the datasets are shown in Figure 4.12.
Image Img005 sustains up to 68% scalability, whereas Image Img005 also sustains up to 61% scale-up. Hence,
from the following graph, we can clearly make out the Spark-based FCM approach scales just fine.
FIGURE 4.12 Comparison between scale-up between the Spark-based and Hadoop-based FCM algorithm.
Accuracy =
T P +T N
T P +T N +F P +F N
(4.15)
Sensitivity: This is used to calculate the percentage of true positives that are properly recognized by it. It is also
known as true positive rates.
Sensitivity =
TP
T P +F N
(4.16)
Specificity: This is used to calculate the percentage of true negative that are properly recognized by it. It is also
known as true negative rates.
Specif icity =
TN
T N +F P
(4.17)
In our research, we have used four eminent classifiers and compared the performance of SVM, Naïve Bayes,
Random Forest, and Decision Tree with the intention of detecting the rice plant diseases. We can observe from
table that sensitivity for the Random Forest classifier achieve well than all the other classifiers.
From Table 4.4, we can clearly see that the Decision Tree and Random Forest schemes have specificity that is
approximately alike with 98.01% and 98.57%, correspondingly. However, Naïve Bayes is the least among all the
algorithms when it comes to Specificity. Amongst the approaches, accuracy of the Naïve Bayes approach is less,
with 72.45%. Random Forest performs well in terms of accuracy, with 98.89%. Result analysis of various
segmentation techniques is shown in Figure 4.13.
TABLE 4.4
Results of Various Classifiers
Switching Bacterial Leaf Blight – forbearing rice seeds such as Rc54, Rc150, and 98.49%
Rc170. Treating paddy seeds with zinc sulfate and calcium hypochlorite to reduce the
disease.
Choose forbearing varieties, for instance NSIC and Rc170 PSB Rc82. Extreme exercise 99.06%
of fertilizer must be reduced as it leads to the occurrence of leaf blast. Remove blast-
infected leaf regularly to avoid spread.
As the fungi are transmitted through seeds, treatment of the seed using hot water (54– 99.73%
55°C) for 11–13 minutes may be effective prior to sowing. By doing this treatment,
primary infection can be controlled at the initial stages. Appropriate managing of
manure by using calcium silicate slag.
REFERENCES
1. Shah, Jitesh P., Harshadkumar B. Prajapati, and Vipul K. Dabhi. “A survey on detection and classification of
rice plant diseases”. In 2016 IEEE International Conference on Current Trends in Advanced Computing
(ICCTAC), pp. 1–8. IEEE, 2016.
2. Gianessi, Leonard P. “Importance of pesticides for growing rice in South and South East Asia”. International
Pesticide Benefit Case Study 108 (2014): 1–4.
3. Buneman, Peter. “Semistructured data”. [Online] Available from:
http://homepages.inf.ed.ac.uk/opb/papers/PODS1997a.pdf.
4. Sethy, Prabira Kumar, Nalini Kanta Barpanda, Amiya Kumar Rath, and Santi Kumari Behera. “Image
processing techniques for diagnosing rice plant disease: A survey”. Procedia Computer Science 167 (2020):
516–530.
5. Suthaharan, Shan. “Big data classification: Problems and challenges in network intrusion prediction with
machine learning”. ACM SIGMETRICS Performance Evaluation Review 41, no. 4 (2014): 70–73.
6. Manocha, S., and Mark A. Girolami. “An empirical analysis of the probabilistic K-nearest neighbour
classifier”. Pattern Recognition Letters 28, no. 13 (2007): 1818–1824.
7. Ding, Chris, and Xiaofeng He. “K-means clustering via principal component analysis”. In Proceedings of the
Twenty-first International Conference on Machine Learning, p. 29, 2004.
8. Dunn, Joseph C. “A fuzzy relative of the ISODATA process and its use in detecting compact well-separated
clusters”. Journal of Cybernetics 3, (1973): 32–57.
9. Bezdek, James C. Pattern recognition with fuzzy objective function algorithms. Springer Science & Business
Media, New York, 2013.
10. Quinlan, J. Ross. C4. 5: programs for machine learning. Elsevier, USA, 2014.
11. White, Tom. Hadoop: The definitive guide. O'Reilly Media, Inc., USA, 2012.
12. Landset, Sara, Taghi M. Khoshgoftaar, Aaron N. Richter, and Tawfiq Hasanin. “A survey of open source tools
for machine learning with big data in the Hadoop ecosystem”. Journal of Big Data 2, no. 1 (2015): 24.
13. Friis, Ib, and Henrik Balslev, eds. Plant Diversity and Complexity Patterns: Local, Regional, and Global
Dimensions: Proceedings of an International Symposium held at the Royal Danish Academy of Sciences and
Letters in Copenhagen, Denmark, 25–28 May, 2003. Vol. 55. Kgl. Danske Videnskabernes Selskab, 2005.
14. Y. Nam and E. Hwang, “A representation and matching method for shape-based leaf image retrieval”, Journal
of KIISE: Software and Applications 32, no. 11 (2005): 1013–1021.
15. Pandey, Rashmi, Sapan Naik, and Roma Marfatia. “Image processing and machine learning for automated
fruit grading system: a technical review”. International Journal of Computer Applications 81, no. 16 (2013):
29–39.
16. Dhaygude, Sanjay B., and Nitin P. Kumbhar. “Agricultural plant leaf disease detection using image
processing”. International Journal of Advanced Research in Electrical, Electronics and Instrumentation
Engineering 2, no. 1 (2013): 599–602.
17. Nidhis, A. D., Chandrapati Naga Venkata Pardhu, K. Charishma Reddy, and K. Deepa. “Cluster based paddy
leaf disease detection, classification and diagnosis in crop health monitoring unit”. In Computer Aided
Intervention and Diagnostics in Clinical and Medical Images, pp. 281–291. Springer, Cham, 2019.
18. Barik, Lipsa. “A survey on region identification of rice disease using image processing”. International
Journal of Research and Scientific Innovation 5, no. 1 (2018).
19. Burgueño, J., J. Crossa, P. L. Cornelius, and R.-C. Yang. “Using factor analytic models for joining
environments and genotypes without crossover genotype× environment interaction”. Crop Science 48 (2008):
1291.
20. Burgueño, Juan, José Crossa, José Miguel Cotes, Felix San Vicente, and Biswanath Das. “Prediction
assessment of linear mixed models for multienvironment trials”. Crop Science 51, no. 3 (2011): 944–954.
21. Badage, Anuradha. “Crop disease detection using machine learning: Indian agriculture”. International
Research Journal of Engineering and Technology (IRJET) 5, no. 9 (2018): 866–869.
22. Orillo, John William, Jennifer Dela Cruz, Leobelle Agapito, Paul Jensen Satimbre, and Ira Valenzuela.
“Identification of diseases in rice plant (oryza sativa) using back propagation Artificial Neural Network”. In
2014 International Conference on Humanoid, Nanotechnology, Information Technology, Communication and
Control, Environment and Management (HNICEM), pp. 1–6. IEEE, 2014.
23. Orillo, John William, Jennifer Dela Cruz, Leobelle Agapito, Paul Jensen Satimbre, and Ira Valenzuela.
“Identification of diseases in rice plant (oryza sativa) using back propagation Artificial Neural Network”. In
2014 International Conference on Humanoid, Nanotechnology, Information Technology, Communication and
Control, Environment and Management (HNICEM), pp. 1–6. IEEE, 2014.
24. Gadekallu, Thippa Reddy, Neelu Khare, Sweta Bhattacharya, Saurabh Singh, Praveen Kumar Reddy
Maddikunta, and Gautam Srivastava. “Deep neural networks to predict diabetic retinopathy”. Journal of
Ambient Intelligence and Humanized Computing (2020): 1–14. https://doi.org/10.1007/s12652-020-01963-7
25. Gadekallu, Thippa Reddy, Dharmendra Singh Rajput, M. Praveen Kumar Reddy, Kuruva Lakshmanna, Sweta
Bhattacharya, Saurabh Singh, Alireza Jolfaei, and Mamoun Alazab. “A novel PCA–whale optimization-based
deep neural network model for classification of tomato plant diseases using GPU”. Journal of Real-Time
Image Processing (2020): 1–14. https://doi.org/10.1007/s11554-020-00987-8
26. Reddy, G. Thippa, M. Praveen Kumar Reddy, Kuruva Lakshmanna, Rajesh Kaluri, Dharmendra Singh
Rajput, Gautam Srivastava, and Thar Baker. “Analysis of dimensionality reduction techniques on big data”.
IEEE Access 8 (2020): 54776–54788.
Section II
Algorithms, Methods, and Tools for
Data Science and Data Analytics
5
Crop Models and Decision Support Systems
Using Machine Learning
CONTENTS
5.1 Introduction
5.1.1 Decision Support System
5.1.2 Decision Support System for Crop Yield
5.1.3 What Is Crop Modeling?
5.1.4 Necessity of Crop Modeling
5.1.5 Recent Trends in Crop Modeling
5.2 Methodologies
5.2.1 Machine-Learning-Based Techniques
5.2.2 Deep-Learning-Based Techniques
5.2.3 Hyper-Spectral Imaging
5.2.4 Popular Band Selection Techniques
5.2.5 Leveraging Conventional Neural Network
5.3 Role of Hyper-Spectral Data
5.3.1 Farm Based
5.3.2 Crop Based
5.3.3 Advanced HSI Processing
5.4 Potential Challenges and Strategies to Overcome the Challenges
5.5 Current and Future Scope
5.6 Conclusion
References
5.1 Introduction
The human population throughout the world is estimated to reach around 9
billion by the year 2050 as per the prediction of the Food and Agriculture
Organization (United Nations). Therefore, the demand for food and
agriculture-based commodities will increase. The estimated food demand
gradually increases at a rate of 2% per year, approximately. By the year
2050, the food demand will increase by 70% of the present-day food
demand. Therefore the agricultural yield should be multiplied manifold to
meet the increasing food demand (i.e., the food production should increase
to 13.5 billion tonnes a year from the current food production, which is 8.4
billion tonnes a year) [1].
Such a huge target cannot be achieved using traditional farming methods
because of the present-day challenges and environmental problems like
climate change, biodiversity loss, erosion, pesticide resistance, fertilizers
and eutrophication, water depletion, soil salinization, urban sprawl,
pollution, and silt. Hence, traditional farming methods are unable to keep
pace with the increasing food production demand [2]. Under such
conditions, the increased food security threat will lead to global instability.
A sustainable and long-term solution to this problem is “smart farming,”
especially in countries like India that has a growing population. Smart
farming is a more productive and sustainable way of farming. Smart
farming maximizes the prediction of yields; reducing the resource/raw
material wastage, moderate the economic and security risks so as to
improve efficiency and to reduce agricultural uncertainty.
Smart farming uses intelligent and digital tools that optimize the farm
operations and facilitate the use of resources in order to improve the overall
quantity and quality of farm yield. The main resource for farming is land,
seed, water, fertilizers, and pesticides. Thus, smart farming helps to predict
and manage the imbalances between crop production, processing, and food
consumption. Using smart farming technology, natural resources like water
and land are effectively used, along with appropriate selection of seeds and
fertilizers. This improves overall quantitative and qualitative measures
involved in prediction of farm yield.
In traditional farming, the decisions about the crops and farming methods
are taken based on the regional conditions, historical data, and previous
experience of the farmers. The same approach is not sufficient for the
present-day agriculture that faces numerous challenges. Present-day
agriculture should be effective, efficient, and sustainable.
In contrast to the traditional farming, precision farming uses data from
modern technologies like sensors used as part of IoT, robotics, GPS
tracking, mapping tools, and data-analytics software and acts as a decision
support system to provide customized solutions to each and every case.
Smart farming overcomes all the challenges of traditional farming and
provides effective practical solutions. Smart farming technology acts as a
systematic tool that predicts and solves unforeseen problems also [3].
Moreover, precision farming increases the outcome with minimum input
resources and without polluting the soil and the environment. Based on the
United States Department of Agriculture (USDA) report, it is evident that
the precision agriculture technology adoption increases operating profit
margins, by increasing the return and crop yield [4].
In this chapter, the view of smart faming as knowledge-based agriculture
is explained. It also discusses how smart farming drives agricultural
productivity when data intense approach is used in collaboration with
machine learning (ML) and deep learning (DL). It outlines how crop
models play a potential role in increasing the quantity and quality of
agricultural products.
Development of more precise, efficient, and reliable pipelines for
seamless analysis of large crop model phenotyping datasets is briefed. The
crop model along with DSS techniques is capable of detecting biotic and
abiotic stress in plants with finite accuracy level. Hyper-spectral image
(HSI) processing along with crop management is employed for the effective
analysis and classification of crops based on the variety, location, and
sustainable growth over a period of time. Even a small change in the crop
can be recorded and processed for building an evolvable crop model. The
models can be frequently improvised according to changes and increases in
demand. Increases in technical processing capabilities of HSI will provide
additional levers such as multidimensional hyperspectral capabilities in
order to filter and process the required data precisely. Finally, the process
and necessary stages for implementing a Decision Support System (DSS)
for smart farming is deliberated such that it will be more useful to the
farming community.
5.1.1 Decision Support System
Smart farming is empowered by a computerized decision support system
(CDSS) inside which machine learning tools with its high-precision
algorithms is embedded. CDSS uses real-time and historical data, along
with machine learning algorithms to make specific decisions. CDSS takes
the real-time data from field video cameras, sensors, and micro-
meteorological data for monitoring. They analyze the collected data and
give predictions and warnings about early signs of disease, weed growth,
pests, and crop yield. The highly trained precision algorithms notice the
changes that are not easily noticed by human beings.
The decision support systems process and convert the raw data into
useful information using machine learning tools. This useful information
helps in improving the precision of farming. To increase the agricultural
productivity, crop models are developed and incorporated into
computerized decision support systems.
Implementing crop models based on the crop behavioral data along with the
understanding of biotic and abiotic components will impact the yield.
Moreover, there are other qualified key attributes such as environment and
climate conditions that enable sustainable growth that need to be included.
The key benefits of data-driven agriculture are
5.2 Methodologies
The crop models, with live data like meteorological data, soil information,
and hyper-spectral images of the crops, are fed as input to the machine-
learning-based Computerized Decision Support Systems (CDSS). Based on
these inputs, the CDSS produces accurate analysis and precise predictions.
The efficiency of the CDSS depends on how it produces sensible
information from the real-time raw data. The CDSS collects the real-time
data, interprets, and compares the results. Based on the comparison, it gives
the predictions and thereby acts as a decision support system.
A model-based DSS can be leveraged to predict the yield. An overview
of the steps involved in smart farming is shown in Figure 5.2. The
agriculture-based dataset is prepared by collecting the necessary
information about a crop. The crop data is enriched by preprocessing
techniques along with standardization. The crop model is built over the
critical attributes as a statistical function. The statistical functions can be
fine-tuned by the reinforcement learning of model. Once the errors are
reduced, the models are executed based on rules for effective prediction.
By using these methods the visible and near infrared spectral data are
processed and models for crops are built. The wavelengths and spectral
reflections vary based on the crop types, plant growth, and locations. Crop
models are built for each variety of crop based on the spectral range and
time (days after planting) [6]. The hyper-spectral data is collected over a
period of time and a multidimensional effective crop model is built. Some
of the examples of crop models built from hyper-spectral include
From the hyperspectral data, a suitable bandwidth for a crop and location is
identified. Over a period of time, the classification and analysis of crop
models is improved and crop discrimination is empowered. Image
processing and correction techniques are used to reduce the noise such as
sensor calibration, platform motion, image reference with geo-position, and
system calibration.
Remote inspection of plots, crop density and yield, and optimize plot
size for yield improvement - optimum plot size (1 × 2.4 m)
The coefficient is 0.79 and root mean square variation up to 5.9 gm
variation for a plant size of 2.5 m2
Improved normalized difference vegetation index (NDVI)
The effect of plant density can be studied along with high spatial
resolution, in order to arrive at the yield and growth rate.
Impact due to the side trimming for crop and field yields needs to be
assessed across various varieties
Identification and investing of multiple diseases and resistance across
the crop species
Overall accuracy
Average accuracy
Kappa coefficient
5.6 Conclusion
Several unanticipated scenarios and problems may rise while applying the
technology over multi-domain agriculture enrichment studies. Another
problem is while applying deep learning algorithms to the real-time data in
order to design and develop a new application. Once the initial hiccups and
problems are covered by deep domain expertise, these precision deep
learning algorithms were embedded inside the decision support system for
providing more promising predictions for agricultural systems in large
scale. The Convolutional Neural Network gains its significance due to a
drastic increase in building a reliable crop model for agriculture-based data.
The CNN-improved versions are very promising in calibrating real-time
applications required for agricultural needs. The spectral information
processing enables categorical identification and classification of insights.
This will enable the system in the future to involve augmented and virtual
reality on real-time crop classification, semantic segmentation along with
object detection, disease identification, and yield prediction remotely.
Evidently, a hybrid CNN is capable of providing enhanced real-time model
for agricultural-based technologies.
As discussed in the previous sections, precision farming offers plenty of
benefits and has a wide scope in the near present and future. Though initial
glitches and challenges may come across, they can be overcome using site-
specific and problem-specific strategies. Precision farming transforms the
traditional farming methods into a more profitable and more productive
way of farming. It also improves the economy while meeting the growing
food demand with good crop quality.
REFERENCES
1. Website: http://www.fao.org/state-of-food-security-nutrition/en/
2. Singh, Rinku and G.S. Singh, “Traditional agriculture: A climate-smart
approach for sustainable food production”. Energ. Ecol. Environ. 2(5)
(2017): 296–316. doi: 10.1007/s40974-017-0074-7.
3. Saiz-Rubio, Verónica and Francisco Rovira-Más, “From smart farming
towards agriculture 5.0: A review on crop data management”.
Agronomy 10 (2020): 207. doi: 10.3390/agronomy10020207.
4. Schimmelpfennig, D., “Farm profits and adoption of precision
agriculture”. USDA 217 (2016): 1–46.
5. McQueen, Robert J., Stephen R. Gamer, Craig G. Nevill-Manning, Ian
H. Witten. “Applying machine learning to agricultural data”. Comput.
Electron. Agric. 12 (1995): 275–293.
6. Wilson, Jeffrey H., Chunhua Zhang, and John M. Kovacs. “Separating
crop species in northeastern Ontario using hyperspectral data”. Remote
Sens. 6(2) (2014): 925–945.
7. Haboudane, D., J.R. Miller, E. Pattey, P. Zarco-Tejada, I.B. Strachan
“Hyperspectral vegetation indices and novel algorithms for predicting
green LAI of crop canopies: Modeling and validation in the context of
precision agriculture”. Remote Sens. Environ. 90 (2004): 337–352.
8. Thenkabail, P.S., E.A. Enclona, M.S. Ashton, B. van Der Meer.
“Accuracy assessments of hyperspectral waveband performance for
vegetation analysis applications”. Remote Sens. Environ. 91 (2004):
345–376.
9. Gray, C.J., D.R. Shaw, L.M. Bruce. “Utility of hyperspectral
reflectance for differentiating soybean (Glycine max) and six weed
species”. Weed Technol. 23 (2009): 108–119.
10. Martin, M.P., L. Barreto, D. Riano, C. Fernandez-Quintanilla, P.
Vaughan. “Assessing the potential of hyperspectral remote sensing for
the discrimination of grassweeds in winter cereal crops”. Int. J. Remote
Sens. 32 (2011): 49–67.
11. Lin, W.-S., C.-M. Yang, B.-J. Kuo. “Classifying cultivars of rice
(Oryza sativa L.) based on corrected canopy reflectance spectra data
using the orthogonal projections to latent structures (O-PLS) method”.
Chemometr. Intell. Lab. Syst. 115 (2012): 25–36.
12. Pena-Barragan, J.M., F. Lopez-Granados, M. Jurado-Exposito, L.
Carcia-Torres. “Spectral discrimination of Ridolfia segetum and
sunflower as affected by phenological stage”. Weed Res. 46 (2006):
10–21.
13. Zhang, H., Y. Lan, C.P. Suh, J.K. Westbrook, R. Lacey, W.C.
Hoffmann. “Differentiation of cotton from other crops at different
growth stages using spectral properties and discriminant analysis”.
Trans. ASABE 55 (2012): 1623–1630.
14. Shibayama, M., and A. Tsuyoshi. “Estimating grain yield of maturing
rice canopies using high spectral resolution reflectance
measurements”. Remote Sens. Environ. 36 (1991): 45–53.
15. Caporaso, Nicola, Martin B. Whitworth, and Ian D. Fisk. “Protein
content prediction in single wheat kernels using hyperspectral
imaging”. Food Chem. 240 (2018): 32–42.
16. S.G. Bajwa, P. Bajcsy, P. Groves, L.F. Tian, “Hyperspectral image data
mining for band selection in agricultural applications”. Trans. Am. Soc.
Agric. Eng. 47 (2004): 895–907.
17. Burai, Péter, et al. “Classification of herbaceous vegetation using
airborne hyperspectral imagery”. Remote Sens. 7(2) (2015): 2046–
2066.
18. Moghimi, Ali, Ce Yang, and James A. Anderson. “Aerial hyperspectral
imagery and deep neural networks for high-throughput yield
phenotyping in wheat”. arXiv preprint arXiv:1906.09666 (2019).
19. Moghimi, Ali, Ce Yang, and Peter M. Marchetto. “Ensemble feature
selection for plant phenotyping: A journey from hyperspectral to
multispectral imaging”. IEEE Access 6 (2018): 56870–56884.
20. Moghimi, Ali, et al. “A novel approach to assess salt stress tolerance in
wheat using hyperspectral imaging”. Front. Plant Sci. 9 (2018): 1182.
21. Singh, Asheesh Kumar, et al. “Deep learning for plant stress
phenotyping: Trends and future perspectives”. Trends Plant Sci. 23(10)
(2018): 883–898.
22. Fuentes, Alvaro, Sook Yoon, and Dong Sun Park. “Deep learning-
based phenotyping system with glocal description of plant anomalies
and symptoms”. Front. Plant Sci. 10 (2019): 1–19.
23. Jin, Xiu, et al. “Classifying wheat hyperspectral pixels of healthy
heads and Fusarium head blight disease using a deep neural network in
the wild field”. Remote Sens. 10(3) (2018): 395.
24. Caporaso, Nicola, Martin B. Whitworth, and Ian D. Fisk. “Near-
infrared spectroscopy and hyperspectral imaging for non-destructive
quality assessment of cereal grains”. Appl. Spectrosc. Rev. 53(8)
(2018): 667–687.
25. Roy, Swalpa Kumar, et al. “Hybridsn: Exploring 3-D-2-D CNN
feature hierarchy for hyperspectral image classification”. IEEE Geosci.
Remote Sens. Lett. 17(2) (2019): 277–281.
26. Lowe, Amy, Nicola Harrison, and Andrew P. French. “Hyperspectral
image analysis techniques for the detection and classification of the
early onset of plant disease and stress”. Plant Methods 13(1) (2017):
80.
27. Liakos, Konstantinos G., et al. “Machine learning in agriculture: A
review”. Sensors 18(8) (2018): 2674.
6
An Ameliorated Methodology to Predict Diabetes Mellitus Using
Random Forest
CONTENTS
6.1 Motivation to Use the “R” Language to Predict Diabetes Mellitus?
6.2 Related Work
6.3 Collection of Datasets
6.3.1 Implementation Methods
6.3.1.1 Decision Tree
6.3.1.2 Random Forest
6.3.1.3 Naïve Bayesian Algorithm
6.3.1.4 Support Vector Machine (SVM)
6.4 Visualization
6.5 Correlation Matrix
6.6 Training and Testing the Data
6.7 Model Fitting
6.8 Experimental Analysis
6.9 Results and Analysis
6.10 Conclusion
References
i. Data wrangling
The data we encounter in real-life problems is often messy and unstructured. Data wrangling involves
refining this data, which is a lengthy process. The complex data is converted into a simpler, more relevant
data set which is easier to consume and analyze. Some R packages that help in this are:
dplyr Package – used for data exploration and transformation
data.table Package – allows for quick manipulation of data with less coding, thereby reducing compute
times and simplifying data aggregation.
readr Package – helps in reading various forms of data into R language at fast speeds
ii. Popularity in academic circles
The R language is extensively used and popular among academia. Data science experimentation is carried
out using the R language. Scholars and researches tend to use the R language for statistical analysis. As a
consequence, there are great number of people who are proficient in the R language. Thus, since it is in use
by a great many individuals since their academic years, a large set of skilled individuals capable of using R
in the industry are exist. This further makes the language apt for data analysis.
iii. Effective visualization of data
Unorganized data can be put in cleaner, more structured forms and an accurate graphical representation of
this data can be made using the tools provided by R language, which include the popular packages ggplot2
and ggedit. These are used in plotting of data. Ggplot2, for example, is used in data visualization itself,
whereas ggedit adds to the capabilities by correcting the aesthetics of plots.
iv. Machine learning capabilities
To make predictions a possibility, it is essential to train algorithms and enable learning automation. R
provides a myriad of packages focused on machine learning such as PARTY and rpart (data partitioning),
MICE (missing value correction), CARET (regression, classification), and randomFOREST (decision trees).
All these, and many more, make machine learning applications a real possibility using the R language.
v. Open source and freely available
R is platform independent, meaning it isn’t restricted to certain operating systems. Also, it is open source,
which makes it free to use. It’s cost effective as it is covered under the GNU agreement. As such, the
development work is continuous in the R community and it makes the language ever-expanding and
improving. There various resources that aid budding programmers in learning this language as well. Hence,
it is easy and not expensive to recruit and employ R developers.
The aforementioned reasons are some, among many, that endow upon R its popularity. The R language is going
to expand further in the domain of statistic and graphical analysis, as well as machine learning and big data
applications. It is easy to learn and ever-expanding, making it the perfect choice for data science, and in this case,
prediction of diabetes.
Dataset_pima = pd.read_csv(‘../input/diabetes.csv’)
Dataset_pima.head()
First few rows of dataset are returned by head() function
TABLE 6.1
List of Attributes in the Dataset
Pregnancies Glucose Blood Skin Insulin BMI Diabetes Pedigree Age Outcome
Pressure Thickness Function
The large measure of the dataset thatmay be greater in volume than needed is broken down via the process of
information mining. This is done to segregate useful data, instrumental in making correct choices and predictive
parameters. Connections between different information can also be made, and correlations derived, which
ultimately help determine answers to crucial issues.
P(c|X) = P(x1|c) × P(x2|c) × ⋯ × P(xn |c) × P(c)P (c|X) = P (x1|c) × P (x2|c) × ⋯ × P (xn |c) × P (c)
where
P(c) = given class prior probability
P(c|x) = posterior probability of class target given predictor
P(x|c) = predictor given class probability
P(x) = predictor prior probability
6.4 Visualization
A histogram will help visualize the range of ages. A histogram is essentially a technique to display numerical
data using bars of various heights, as shown in Figure 6.3.
In the above code, we first import the ggplot() function, which takes two arguments. The first one is the data we
will work upon, which in this case is the age. The aes(x = age) part, where aes stands for aesthetic, implies the
mapping of the data variables to the visual properties. The “+” is almost always used in ggplot to add extra
components. The last two xlab() and ylab() methods simply impart labels to the x and y axes. The code above
finally produces the following output.
Another method for visualization is a boxplot. Various entities are allotted a bar, and the size of the bar is an
indication of its numeric value. The higher the bar height, the higher its value, and vice versa. We can also
visualize the data using a bar plot as follows. Again the ggplot() function takes in two parameters, one for the
input data, x = age_1, and dbt which stands for data build told. The geom_bar(), short for geometric bar, is
another function like geom_histogram, that constructs the bar graph. In the previous exam, we used fill = ’red’,
which quite obviously, filled the histogram with a red color is as shown in Figure 6.3. Here we fill = “blue,”
hence the bar plot graphs are blue in color and as shown in Figure 6.4.
FIGURE 6.4 Visualization of data using boxplot.
This is what the output will look like for the above piece of code.
Yet another visualization technique is the boxplot. It is a method to show the distribution of data over some
range as shown in Figure 6.5. In the below code, we pass x- and y-axis parameters are age and BMI, respectively.
Then the geom_boxplot() method is called, with an interesting new parameter, the outlier color. Outliers are
certain value that lie unusually or abnormally outside the usual range of values of that parameter. They are
represented by red dots in the following boxplot. The coord_cartestian() method has the ylim set as 0 to 80,
implying the y-axis range to be from 0 to 80 units of BMI.
FIGURE 6.5 Visualization of distribution of data range using boxplot.
The code below produces a boxplot that plots age category on x-axis versus the BMI on y-axis for an
individual.
The red dots denote outliers, which are evidently maximum in the 21–30 age category is as shown in the
Figure 6.5.
The correlation matrix results help us infer that there is no inherent correlation among the variables themselves
and as shown in Figure 6.7.
Eighty percent of this value will be the number of trained data rows, which can be affirmed in the next
command.
The remaining 20% are the test data rows.
Next, let’s perform an operation to determine the average prediction for the two outcomes (0 and 1).
Finally, we will discuss the Receiving Operating Characteristic (ROC). In predicting binary outcomes, the
Receiving Operating Characteristic is a curve to estimate the accuracy of a continuous measurement. It is a
measure of how well a classification model performs at all classification thresholds.
There are two parameters – the True and False Positive Rates.
The Area Under Curve (AUC) is the 2-D area found under the ROC curve. To analyze which of the models
being used predicts the classes best, we use the AUC parameter.
The ROCR package of R helps combine up to 25 performance metrics to form a performance curve. The
prediction function returns a predictor object, which is the first step in the classifier process. The parameters
passed to it are the predict object, containing the predictions, and a label as the second parameter. The
performance() function then takes the predictor standardized object returned in as a parameter, along with tpr and
fpr which denote true and false positive rates.
Next, we will generate the AUC curve.
The result is the graph shown in Figure 6.9. Seeing the AUC parameter of 0.84, we can infer that the accuracy
rate is 84% on the train data. Next, we will test our model to the test data as follows. The accuracy rate is 74%,
which can be improved as follows (see Figure 6.10).
FIGURE 6.9 ROC performance curve.
Thus, as is evident from the above result, the model is accurately able to predict if someone has diabetes or not
with 82% accuracy.
6.10 Conclusion
It is evident that the R language provides many methods to accurately and quickly perform data analysis on
various measures and predict the outcome, as was proved while predicting diabetes using our dataset. There is
sufficient provision for various visualization techniques to correctly represent the data. In conclusion, data
analysis can be efficiently performed using R.
REFERENCES
1. Shaoming Qiu, Jiahao Li Bo Chen et al, An Improved Prediction Method for Diabetes Based on a Feature-
based Least Angle Regression Algorithm, published in Association for Computing Machinery digital
library, January 2019.
2. Hassan Uraibi, Habshah Midi, and Sohel Rana, “Robust Multivariate Least Angle Regression”. ScienceAsia
43 (2017): 56–60.
3. Faizan Zafar, Saad Raza et al, “Predictive Analytics in Healthcare for Diabetes Prediction, published in
Association for Computing Machiner”y, ICBET ’19, March 28–30, 2019, Tokyo, Japan © 2019.
4. Abir Al-Sideiri, Zaihisma Binti Che Cob, and Sulfeeza Bte Mohd Drus, “Machine Learning Algorithms for
Diabetes Prediction: A Review Paper”, in Proceedings of the 2019 International Conference on Artificial
Intelligence, Robotics and Control December 2019 Pages 27–32.
5. Jyotismita Chaki, S. Thillai Ganesh, S.K. Cidham et al, “Machine Learning, and Artificial Intelligence
Based Diabetes Mellitus Detection and Self-Management: A Systematic Review”. Journal of King Saud
University – Computer and Information Sciences, 2020. ISSN 1319-1578. DOI:
https://doi.org/10.1016/j.jksuci.2020.06.013
6. Muhammad Azeem Sarwar, Nasir Kamal et al, Prediction of Diabetes Using Machine Learning Algorithms
in Healthcare, Proceedings of the 24th International Conference on Automation & Computing, Newcastle
University, Newcastle upon Tyne, UK, 6–7 September 2018.
7. Manal Alghamdi, Mouaz Al-Mallah et al, “Predicting Diabetes Mellitus Using SMOTE and Ensemble
Machine Learning Approach: The Henry Ford ExercIse Testing (FIT) Project”. PLoS ONE 12(7) (2017):
e0179805. 10.1371/journal.pone.0179805.
8. Md. Kamrul Hasan, Md. Ashraful Alam et al, Diabetes Prediction Using Ensembling of Different Machine
Learning Classifiers, published in IEEE, 2017, DOI 10.1109/ACCESS.2020.2989857.
9. Zhongxian Xu, Zhiliang Wang et al, A Risk Prediction Model for Type 2 Diabetes Based on Weighted
Feature Selection of Random Forest and XGBoost Ensemble Classifier, published in IEEE, the 11th
International Conference on Advanced Computational Intelligence, June 7–9, 2019, China.
10. W. Kerner and J. Brckel, “Definition Classification and Diagnosis of Diabetes Mellitus”. Experimental and
Clinical Endocrinology & Diabetes 122(7)(2014): 384–386.
11. A. Misra, H. Gopalan, R. Jayawardena, A. P. Hills, M. Soares, A. A. RezaAlbarrán, and K. L. Ramaiya,
“Diabetes in Developing Countries”. Journal of Diabetes 11(7)(2019 Mar): 522–539.
12. Talha Mahboob Alama, Muhammad Atif Iqbal et al, A Model for Early Prediction of Diabetes, Informatics
in Medicine Unlocked 16 (2019) 100204 2352-9148/© 2019 Published by Elsevier Ltd.
13. Hongxia Xu, Yonghui Kong, and Shaofeng Tan, “Predictive Modeling of Diabetic Kidney Disease using
Random Forest Algorithm along with Features Selection, Published in ISAIMS 2020”. Proceedings of the
2020 International Symposium on Artificial Intelligence in Medical Sciences, September 2020 Pages 23–27.
14. Ebru Pekel Özmen and Tuncay Özcan, “Diagnosis of Diabetes Mellitus using Artificial Neural Network and
Classification and Regression Tree Optimized with Genetic Algorithm”. Journal of Forecasting 39(2020):
661–670. wileyonlinelibrary.com/journal/for © 2020 John Wiley & Sons, Ltd.
15. N. Nai-Arun and R. Moungmai, “Comparison of Classifiers for the Risk of Diabetes Prediction”, Procedia
Computer Science 69 (2015): 132–142.
7
High Dimensionality Dataset Reduction Methodologies in Applied
Machine Learning
CONTENTS
7.1 Problems Faced with High Dimensionality Data: An Introduction
7.2 Dimensionality Reduction Algorithms with Visualizations
7.2.1 Feature Selection Using Covariance Matrix
7.2.1.1 Importing the Modules
7.2.1.2 The Boston Housing Dataset
7.2.1.3 Perform Basic Data Visualization
7.2.1.4 Pearson Coefficient Correlation Matrix
7.2.1.5 Detailed Correlation Matrix Analysis
7.2.1.6 3-Dimensional Data Visualization
7.2.1.7 Extracting the Features and Target
7.2.1.8 Feature Scaling
7.2.1.9 Create Training and Testing Datasets
7.2.1.10 Training and Evaluating Regression Model with Reduced Dataset
7.2.1.11 Limitations of the Correlation Matrix Analysis
7.2.2 t-Distributed Stochastic Neighbor Embedding (t-SNE)
7.2.2.1 The MNIST Handwritten Digits Dataset
7.2.2.2 Perform Exploratory Data Visualization
7.2.2.3 Random Sampling of the Large Dataset
7.2.2.4 T-Distributed Stochastic Neighboring Entities (t-SNE) – An Introduction
7.2.2.5 Probability and Mathematics behind t-SNE
7.2.2.6 Implementing and Visualizing t-SNE in 2-D
7.2.2.7 Implementing adn Visualizing t-SNE in 3-D
7.2.2.8 Applying k-Nearest Neighbors (k-NN) on the t-SNE MNIST Dataset
7.2.2.9 Data Preparation – Extracting the Features and Target
7.2.2.10 Create Training and Testing Dataset
7.2.2.11 Choosing the k-NN hyperparameter – k
7.2.2.12 Model Evaluation – Jaccard Index, F1 Score, Model Accuracy, and Confusion Matrix
7.2.2.13 Limitations of the t-SNE Algorithm
7.2.3 Principle Component Analysis (PCA)
7.2.3.1 The UCI Breast Cancer Dataset
7.2.3.2 Perform Basic Data Visualization
7.2.3.3 Create Training and Testing Dataset
7.2.3.4 Principal Component Analysis (PCA): An Introduction
7.2.3.5 Transposing the Data for Usage into Python
7.2.3.6 Standardization – Finding the Mean Vector
7.2.3.7 Computing the n-Dimensional Covariance Matrix
7.2.3.8 Calculating the Eigenvalues and Eigenvectors of the Covariance Matrix
7.2.3.9 Sorting the Eigenvalues and Corresponding Eigenvectors Obtained
7.2.3.10 Construct Feature Matrix – Choosing the k Eigenvectors with the Largest Eigenvalues
7.2.3.11 Data Transformation – Derivation of New Dataset by PCA – Reduced Number of
Dimensions
7.2.3.12 PCA Using Scikit-Learn
7.2.3.13 Verification of Library and Stepwise PCA
7.2.3.14 PCA – Captured Variance and Data Lost
7.2.3.15 PCA Visualizations
7.2.3.16 Splitting the Data into Test and Train Sets
7.2.3.17 An Introduction to Classification Modeling with Support Vector Machines (SVM)
7.2.3.18 Types of SVM
7.2.3.19 Limitations of PCA
7.2.3.20 PCA vs. t-SNE
Conclusion
In the field of artificial intelligence, data explosion has created a plethora of input data and
features to be fed into machine learning algorithms. Since most of the real-world data is
multi-dimensional in nature, data scientists and data analysts require the core concepts of
dimensionality reduction mechanisms for better:
This chapter introduces the practical working implementation of these reduction algorithms
in applied machine learning.
Multiple features make it difficult to obtain valuable insights into data, as the visualization
plots obtained can be 3-dimensional at most. Due to this limitation, dependent
properties/operations such as Outlier Detection and Noise Removal become more and more
non-intuitive to perform on these humongous datasets. Therefore, applying dimensionality
reduction helps in identifying these properties more effortlessly.
Due to this reduced/compressed form of data, faster mathematical operations such as Scaling,
Classification, and Regression can be performed. Also, the data is more clean and this further solves the
issues of overfitting a model.
Dimensionality reduction can be broadly classified into:
i. Feature Selection Techniques: Feature selection attempts to train the machine learning model by
selectively choosing a subset of the original feature set based on some criteria. Hence, redundant and
obsolete characteristics could be eliminated without much information loss. Examples – Correlation
Matrix Thresholding and Chi-Squared Test Selection.
ii. Feature Extraction/Projection Techniques: This method projects the original input features from the high
dimensional space by summarizing most statistics and removing redundant data/manipulating to create
new relevant output features with reduced dimensionality (fewer dimensional space). Examples –
Principle Component Analysis (PCA), Linear Discriminant Analysis (LDA), t-distributed Stochastic
Neighbour Embedding (t-SNE), and Isometric Mapping (IsoMap).
However, we have limited our discussion to Correlation Matrices, PCA, and t-SNE only, as
covering all such techniques is beyond the scope of this book chapter
Firstly, we will import all the necessary libraries that we will be requiring for dataset reductions.
TABLE 7.1
The Boston Housing Dataset
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT M
0 0.00632 18 2.31 0 0.538 6.575 65.2 4.09 1 296 15.3 396.9 4.98 2
1 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.9 9.14 2
2 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 3
3 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 3
Library Information:
The variable boston_dataset is a python dictionary returned via the scikit-learn library with the following keys:
You can access each key’s values using boston_dataset.key_name as we used to create a pandas dataframe. You
can read the official scikit learn datasets documentation (https://scikit-learn.org/stable/datasets/index.html#toy-
datasets) and get to know about embedded datasets.
Alternatively:
You can also run the following alternative code, provided you have cloned our github repository
(https://github.com/khanfarhan10/DIMENSIONALITY_REDUCTION).
df= pd.read_excel(“/content/DIMENSIONALITY_REDUCTION/data/Boston_Data.xlsx”)
df=
pd.read_excel(“https://raw.githubusercontent.com/khanfarhan10/DIMENSIONALITY_REDUCTION/master/data/
Boston_Data.xlsx”)
df=
pd.read_excel(“https://github.com/khanfarhan10/DIMENSIONALITY_REDUCTION/blob/master/data/Boston_D
ata.xlsx?raw=true”)
Data Insights:
You might want to try df.isnull().sum(), df.info(), df.describe() to get the columnwise null values, dataframe
information, and row-wise description, respectively. However, here the data provided is clean and free from such
issues which would be needed to be processed/handled inspectionally.
The Pearson Correlation Coefficient (also known as the Pearson R Test) is a very useful statistical formulae
that measures the strength between features and relations.
Mathematically,
N Σxy−(Σx)(Σy)
r xy =
2 2
√[N Σx 2 −(Σx) ][N Σy 2 −(Σy) ]
where
r xy= Pearson’s Correlation Coefficient between variables x and y
N = number of pairs of x and y variables in the data Σxy = sum of products between x and y variables Σy =
sum of x values
Σy = sum of y values
We will now use Pandas to get the correlation matrix and plot a heatmap using Seaborn (Figure 7.2).
FIGURE 7.2 Correlation matrix plot from Seaborn heatmap for the Boston Dataset
Home prices (MEDV) tend to decrease with the increase in LSTAT. The curve follows a linear – semi-
quadratic equation in nature.
Home prices (MEDV) tend to increase with the increase in RM linearly. There are few outliers present in the
dataset as clearly portrayed by the 3-D visualization.
where
x= Input Feature Variable
x = Standardized Value of x
′
σ = √
N
(Standard Deviation)
∑ (x i −μ)
xi = Each value in x
N = No. of Observations in x (Size of x)
Multivariate Linear Regression is a linear approach to modelling the relationship (mapping) between
various dependent input feature variables and the independent output target variable.
h Θ (x) = Θ 0 + Θ 1 x 1 + Θ 2 x 2
where
y = Output Target Variable MEDV
We perform Ordinary Least Squares (OLS) Regression using the scikit-learn library to obtain Θ .
i
Model Evaluation – Regression Metrics:
We need to calculate the following values in order to evaluate our model.
Correlation coefficients are a vital parameter when applying linear regression on your datasets. However it is
limited as:
Only LINEAR RELATIONSHIPS are being considered as candidates for mapping of the target to the
features. However, most mappings are non-linear in nature.
Ordinary Least Squares (OLS) Regression is SUSCEPTABLE TO OUTLIERS and may learn an inaccurate
hypothesis from the noisy data.
There may be non-linear variables other than the ones chosen with Pearson Coefficient Correlation
Thresholding, which have been discarded, but do PARTIALLY INFLUENCE the output variable.
A strong correlation assumes a direct change in the input variable would reflect back immediately into the
output variable, but there exist some variables that are SELECTIVELY INDEPENDENT in nature yet they
provide a suitably high value of the correlation coefficient.
t-SNE is a nonli, near dimensionality reduction algorithm that is commonly used to reduce complex problems
with linearly nonseparable data.
Linearly nonseparable data refers to the data that cannot be separated by any straight line, such as (Figure 7.6).
FIGURE 7.6 Few examples of nonlinearly separable datasets
For i ≠ j,
2 2
exp(−∥x i −x j ∥ /2σ )
i
p ji = 2 2
∑ exp(−∥x i −x k ∥ /2σ )
k≠i i
For i = j,
p i∣i = 0
p j|i +p i|j
p ij =
2N
∑ p j∣i = 1 f or all i
j
p ij = p ji
p ii = 0
∑ p ij = 1
i,j
Also, x would pick x as its neighbor if neighbors were picked in proportion to their probability density under a
i j
Gaussian centered at x . i
t-Distributed stochastic neighbor embedding (t-SNE) minimizes the divergence between two distributions: a
distribution that measures pairwise similarities of the input objects and a distribution that measures pairwise
similarities of the corresponding low-dimensional points in the embedding.
– Van der Maaten and Hinton
Hence, the t-SNE algorithm generates the reduced feature set by synchronizing the probability distributions of the
original data and the best represented low dimensional data.
t-SNE Detailed Information:
For detailed visualization and hyperparameter tuning (perplexity, number of iterations) for t-SNE, visit Distill
(https://distill.pub/2016/misread-tsne/)!
TABLE 7.2
2 Components TSNE on MNIST Dataset
y TSNE-2d-One TSNE-2d-Two
0 8 –1.00306 –0.128447
1 4 –4.42499 3.18461
2 8 –6.95557 –3.65212
3 7 –1.80255 7.62499
Color Palettes: Seaborn
For a list of color variations assigned to the plot shown, visit Seaborn Color Palettes
(https://seaborn.pydata.org/tutorial/color_palettes.html).
TABLE 7.3
Detailed 3-Dimensional TSNE for MNIST Dataset
The k-Nearest Neighbors (k-NN) classifier determines the category of an observed data point by majority
vote of the k closest observations around it.
The measure of this “closeness” of the data is obtained mathematically using some distance metrics. For our
purpose, we will be using the Euclidean Distance (d), which is simply the length of the straight line connecting two
distant points p and p .
1 2
2
d(p 1 , p 2 ) = √ (x 1 − x 2 ) = x1 − x2
2 2
d(p 1 , p 2 ) = √ (x 1 − x 2 ) + (y 1 − y 2 )
2 2 2
d(p 1 , p 2 ) = √ (x 1 − x 2 ) + (y 1 − y 2 ) + (z 1 − z 2 )
Based on the calculated distance with x, y, and z coordinates, the algorithm pulls out the closest k neighbors and
then does a majority voting for the predictions. However, the value of k diversely affects the algorithm and is an
important hyperparameter.
defined as:
Jaccard Index:
as:
Simplifying,
j (y, y) =
Confusion Matrix:
No. of
α(y, ŷ) =
True
∣
Model Evaluation – Jaccard Index, F1 Score, Model Accuracy, and Confusion Matrix
Model Accuracy: It measures the accuracy of the classifier based on the predicted labels and the true labels and is
y∩ ŷ
ŷ
j(y, ŷ) =
Samples + No. of
=
No. of
No. of
y∩ ŷ
y∪ ŷ
Correctly
Total
Correctly
Predicted
=
Classif ied
No. of
y∩ ŷ
y + ŷ − y∩ ŷ
Classif ied
Samples − No. of
Predictions
Predictions
Given the predicted values of the target variable as (ŷ) and true/actual values as y, the Jaccard index is defined
Predictions
Correctly
The confusion matrix is used to provide information about the performance of a categorical classifier on a set of
test data for which true values are known beforehand (Table 7.4).
TABLE 7.4
Generalized Confusion Matrix
Predicted Values
Positive (1)
Negative (0)
truly is present.
Actual Values
Positive (1)
True Positive (TP)
False Negative (FN)
False Positive (FP): Model incorrectly predicted Negative cases as Positive. Disease is diagnosed as present
and but is actually absent. (Type I error)
False Negative (FN): Model incorrectly predicted Positive cases as Negative. Disease is diagnosed as absent
but is actually present. (Type II error)
True Negative (TN): Model correctly predicted Negative cases as Positive. Disease is diagnosed as absent and
is truly absent.
F1 Score:
The F1 score is a measure of model accuracy and is calculated based on the precision and recall of each
category by obtaining the weighted average of the Precision and Sensitivity (Recall). Precision is the ratio of
correctly labeled samples to all samples and recall is a measure of the frequency in which the positive predictions
are taking place.
Precision
Recall (Sensitivity) =
=
TP
T P +F P
TP
T P +F N
Classif ied
True Positive (TP): Model correctly predicted Positive cases as Positive. Disease is diagnosed as present and
Predictions
Precision ×Recall
F 1 Score = 2 ( )
Precision+Recall
FIGURE 7.10 Confusion matrix for k-NN classification on applied t-SNE for MNIST handwritten digit dataset
7.2.2.13 Limitations of the t-SNE Algorithm
Although particularly well suited for visualization of high-dimensional dataset visualizations, there exist the
following pitfalls of t-SNE:
t-SNE scales QUADRATICALLY in the number of objects $N$ and hence it is COMPUTATIONALLY
EXPENSIVE & MEMORY INEFFICIENT.
As compared to other dimensionality reduction algorithms, it is often very time consuming and beyond a
thousand objects, it is found to be TOO SLOW TO BE PRACTICAL.
Often in the case of very high dimensional data, you may need to APPLY ANOTHER
DIMENSIONALITY REDUCTION TECHNIQUE (such as PCA for dense data or Truncated SVD for
sparse data) before using t-SNE.
compactness perimeter
area−1.0
The features obtained from these inputs are captured in the dataframe shown at the end of this section’s code
snippet.
About Breast Cancer:
Breast Cancer develops in breast cells. It can occur in both men and women, though after skin cancer it’s one
of the most common cancer diagnosed in females. It begins when the cells in the breast start to expand
uncontrollably. Eventually these cells form tumors that can be detected via X- ray or felt as lumps near the breast
area.
The main challenge is to classify these tumors into malignant (cancerous) or benign (non-cancerous). A tumor is
considered as malignant if the cells expand into adjacent tissues or migrate to distant regions of the body. A benign
tumor doesn’t occupy any other nearby tissue or spread to other parts of the body like the way cancerous tumors
can. But benign tumors may be extreme if the structure of heart muscles or neurons is pressurized.
Machine learning technique can significantly improve the level of breast cancer diagnosis. Analysis shows that
skilled medical professionals can detect cancer with 79% precision, while machine learning algorithms can reach
91% (sometimes up to 97%) accuracy.
Information on Breast Cancer:
For more information, visit Wikipedia: Breast Cancer (https://en.wikipedia.org/wiki/Breast_cancer).
Importing the Dataset:
Construct a column named label that contains the string values of the target mapping (Table 7.5):
TABLE 7.5
UCI Breast Cancer Dataset Overview
Mean Mean Mean Mean Mean Mean Mean … Worst Worst
Radius Texture Perimeter Area Smoothness Compactness Concavity Symmetry Fractal
Dimensio
565 20.13 28.25 131.2 1261 0.0978 0.1034 0.144 … 0.2572 0.06637
566 16.6 28.08 108.3 858.1 0.08455 0.1023 0.09251 … 0.2218 0.0782
567 20.6 29.33 140.1 1265 0.1178 0.277 0.3514 … 0.4087 0.124
568 7.76 24.54 47.92 181 0.05263 0.04362 0 … 0.2871 0.07039
Alternative:
You can also run the following code, provided you have cloned our github repository.
df= pd.read_excel(“/content/DIMENSIONALITY_REDUCTION/data/UCI_Breast_Cancer_Data.xlsx”)
Also, with a working internet connection, you can run:
df=
pd.read_excel(“https://raw.githubusercontent.com/khanfarhan10/DIMENSIONALITY_REDUCTION/master/data/
UCI_Breast_Cancer_Data.xlsx”)
–OR–
df=
pd.read_excel(“https://github.com/khanfarhan10/DIMENSIONALITY_REDUCTION/blob/master/data/UCI_Brea
st_Cancer_Data.xlsx?raw=true”)
Count Plot:
A countplot counts the number of observations for each class of target variable and shows it using bars on each
categorical bin (Figure 7.12).
FIGURE 7.12 Seaborn countplot for UCI breast cancer – target field
In higher dimensions, where it isn't possible to find out the patterns between the data points of a dataset,
Principal Component Analysis (PCA) helps to find out the correlation and patterns between the data points
of the dataset so that the data can be compressed from higher dimensional space to lower dimensional space
by reducing the number of dimensions without any loss of major data.
This algorithm helps in better data intuition and visualization and is efficient in segregating Linearly Separable
Data.
Chronological Steps to Compute PCA:
∑ (x i −x)(y i −y )
cov(x, y) =
n−1
Note that cov(x, y) = cov (y, x); hence, the covariance matrix is symmetric across the central diagonal. Also,
covariance(x, x) = variance(x)
2
2 ∑ (x i −x)
S =
n−1
where
S
2
sample variance
=
For 3-dimensional datasets (dimensions x, y, z), we have to calculate cov(x, y), cov(y, z), and cov (x, z) and the
covariance matrix will look like:
and similar calculations follow for higher-order datasets. For n-dimensional datasets, the number of covariance
values = .
n!
2(n−2)!
7.2.3.8 Calculating the Eigenvalues and Eigenvectors of the Covariance Matrix
Eigenvalues:
A covariance matrix is always a square matrix, from which we calculate the eigenvalues and eigenvectors. Let
covariance matrix be denoted by C , then the characteristic equation of this covariance matrix is |C − λI| = 0 The
roots (i.e., λ values) of the characteristic equation are called the eigenvalues or characteristic roots of the square
matrix C n×n . Therefore, eigenvalues of C are roots of the characteristic polynomial Δ(C − λI ) = 0.
Eigenvectors:
A non-zero vector X such that (C − λI )X = 0 or CX = λX is called an eigenvector or characteristic vector
corresponding to this lambda of matrix C n×n .
TABLE 7.6
All Eigenvalues and Eigenvectors from the PCA Dataframe
TABLE 7.7
Two Selected Eigenvalues and Eigenvectors from the PCA Dataframe
443782.61 7310.1001
0 0.005086 0.009287
1 0.002197 -0.002882
2 0.035076 0.062748
3 0.516826 0.851824
Now we know that the value of the theoretic PCA calculated from the stepwise calculation matches the PCA
from the Scikit-Learn Library. However, since there are various steps that are extensive in nature, we will now use
the SKLearn PCA henceforth.
7.2.3.14 PCA – Captured Variance and Data Lost
Explained variance ratio is the ratio between the variance attributed by each selected principal component and the
total variance. Total variance is the sum of variances of all individual principal components. Multiplying this
explained variance ratio with 100%, we get the percentage of variance ascribed by each chosen principal
component and subtracting the sum of variances from 1 gives us the total loss in variance.
Hence, the PCA components variance are as follows:
TABLE 7.8
Pandas PCA Dataframe Containing 2-Component and 3-Component PCA Values – An Overview
The primary objective of an SVM Classifier is to find out a decision boundary in the n- dimensional space
(where n is number of features) that will segregate the space into two regions where in one region the
hypothesis predicts that y=1 and in another region the hypothesis predicts that y=0.
This decision boundary is also called a hyper-plane. There could be many possible hyper-planes but the goal of
SVM is to choose those extreme points or vectors which will help to create one hyper-plane that will have
maximum margin i.e., maximum distance between the data points of two regions or classes. The hyper-plane
with maximum margin is termed as optimal hyper-plane. Those extreme points or vectors are called support
vectors and hence, this algorithm is termed Support Vector Machine.
→
Given a training data of n points: (x , y ) where y = 0 or 1
i i i
where,
→
w is the normal vector to the hyper-plane.
b ≥ 0 is the distance from the origin to the plane (or line) (Figure 7.15).
FIGURE 7.15 SVM working principle visualization in geometric coordinates
PCA is a Linear Dimensionality Reduction Algorithm and hence chooses an eigenvector with a corresponding
high eigenvalue. In some cases, in the form of highly nonlinear data spaces, this approach will fail as the
NONLINEAR COMPONENTS will be TRUNCATED (disregarded as noise) and will not count towards
the model variance a lot. It assumes that a large variance results in a low covariance, which in turn implies
high importance which may not be true 100% of the times. For this we need to shift to Kernel PCA (KPCA),
which requires a lot of space.
PCA is a SCALE VARIANT Algorithm, and hence any change of scale in any of the variables will affect the
PCA values accordingly.
For some data distributions, MEAN AND COVARIANCE DESCRIPTION IS INACCURATE. It is only
true to say that for the Gaussian/Normal Data Distributions, this algorithm performs actually well, but this
may not be correct for other distributions.
TABLE 7.9
Differences between PCA and TSNE
PCA T-SNE
Conclusion
In this chapter, we applied the concepts of dimensionality reduction in applied machine learning on various
datasets. The authors recommend to try out other datasets as well to practice and get a firm understanding of the
algorithms used in this chapter for reducing high dimensionality datasets.
Some reliable data sources are as follows:
CONTENTS
8.1 Introduction
8.2 Basic Concepts
8.2.1 Cellular Automaton
8.3 Discussions on CA Evolutions
8.3.1 Relation between Local and Global Transition Function of a Spatially Hybrid CA
8.4 CA Modeling of Dynamical Systems
8.4.1 Spatially Hybrid CA Models
8.4.2 Temporally Hybrid CA Models
8.4.3 Spatially and Temporally Hybrid CA Models
8.5 Conclusion
References
8.1 Introduction
Dynamical system is the mathematical model for computing changes over time of any physical,
biological, economic, or social phenomena [1,2]. Also, evolution of a data sequence can be viewed as an
evolution of a dynamical system. Usually dynamical system is described mathematically by differential
or difference equations. Another widely used computation model is Cellular Automata (CA) [3]. This
model was introduced by J. von Neumann and S. Ulam in 1940 for designing self-replicating systems,
which later saw applications in physics, biology, and computer science.
Neumann conceived a CA as a two-dimensional mesh of finite state machines called cells which are
locally interconnected with each other. Each of the cells change their states synchronously depending on
the states of some neighbouring cells (for details see [4–6] and references therein). Stephen Wolfram’s
work in the 1980s contributed to a systematic study of one-dimensional CA, providing the first
qualitative classification of their behavior (reported in [7,8]).
A CA is capable of modeling systems where change at micro levels of each point of a surface is
triggered by neighboring points [9–12,19]. Totality of these changes at micro level together generates an
evolution pattern at the macro level. If the local (micro level) changes be identical, then the CA is
homogeneous; otherwise the CA is hybrid. A spatially hybrid CA can be visualized to be composed of
finite celled blocks where each block has a different local transition function and within any block all the
cells follow same local transition function for all time-steps. And, for a temporally hybrid CA, though
the transition function of the cells at a particular time maybe same, the global and local transition
functions vary over different time-steps.
In this paper, we have considered only synchronous (all the cell states are updated simultaneously) CA
where the underlying topology is a one-dimensional grid line which evolves over discrete time-steps
[13,14].
Section 1.2 is devoted to fundamental results used in this chapter. In Section 1.3, we report our work
on local and global transition function of a hybrid CA. We also discuss evolution patterns of some
transition functions of a finite celled homogeneous CA with periodic boundaries which constitute a
hybrid CA. Spatially and/or temporally hybrid CA models of some discrete dynamical systems have
been designed in Section 1.4.
Thus, a CA is a computation model where finite/countably infinite number of cells are arranged in an
ordered n-dimensional grid. Each cell receives input from the neighboring cells and changes according
to the transition function. The transitions at each of the cells together induces a change of the grid pattern
[16].
Here we have considered only synchronous, homogeneous one-dimensional CA. A typical one-
dimensional CA is given below (Figure 8.1).
A CA does not have any external input and hence is self-evolving. However, the different possible
combinations of the state of a cell at any ith grid point along with the states of its adjacent cells can be
considered as inputs for the cell at the ith grid point.
Each cell works synchronously leading to evolution of the entire grid through a number of discrete
time steps. If the set of memory elements of each FSSA is {0,1}, then a typical pattern evolved over time
t (represented along horizontal axis) may be as shown in Table 8.1.
TABLE 8.1
Ct Is the Configuration of the CA (Represented along Vertical Axis) at Time t
Ai+2
Ai+1
Ai
Ai−1
.
Configuration →
⏐∣
Grid Position (i) ↓\ Time →
C : Z → Q.
Q, Q , τ ) denoted by C
C ∈ Q
0
Z
τ
0
0
1
0
.
.
C0
t=1
0
1
0
1
.
.
C1
DEFINITION 1.2.2: Let Q be a finite set of memory elements; also called the state set. The
REMARK: For a particular state set Q and a particular global transition function τ , a triple (
memory elements of the cells belonging to the set Q are placed on an ordered
line.
A global configuration is a mapping from the group of integers Z to the set Q given by
A CA (denoted by C ) is a triplet (Q, Q , τ ) where Q is the finite state set, Q is the set of all
Q Z Z
.
C2
t
Q
τ
∈ Q
C 0 = .. 001000 … ; τ (C 0 ) = τ (… 0100 …)
0
Z
τ
Q
and τ (C t) = C t+1
= … 1010 …
over time. The formal definition is given with reference to Definition 1.2.2.
=
τ
Q
0
Z
0
C 1 ; τ (C 1 ) = … 0101 …
The CA defined above has the same global transition function τ : Q → Q for all time t. However,
Z Z
there is a special class of CA called temporally hybrid CA where the global transition function varies
transition function.
Z
Q
τt
is the finite state set, Q is the set of all configurations, and τ is a global
Z
Evolution of a CA is mathematically expressed by the global transition function. However, this global
t
transition is induced by transitions of the cells at each grid point of the CA. The transition of the state of
the cell at the ith grid point of a CA at a particular time depends on the state of the ith cell C(i) or ci, and
its adjacent cells. These adjacent cells constitute the neighborhood of the ith cell. The transition of the
cell at each grid point is called local transition.
=
…
.
…
…
…
…
.
…
C 2 etc.
DEFINITION 1.2.4: For i ∈ Z, r ∈ N, let S = {i − r, …, i − 1, i, i + 1, …, i + r} ⊆ Z. Si is the i
The mapping μ : Q → Q is known as a local transition function for the ith automaton having
i
Si
radius r. Thus, ∀i ∈ Z, μ (c ) ∈ Q. So, if the local configuration of the ith cell at time t is
i i
2. τ (C) = τ (…, c , c , c , …) = …μ (c ). μ (c ). μ (c )…
i−1 i i+1 i−1 i−1 i i i+1 i+1
4. At any time, if all μ′ s are not identical, then the CA is called Spatially Hybrid CA.
i
DEFINITION 1.2.6: If for a particular CA, |Q| = 2 so that we can write Q = {0, 1}, then the CA
is said to be a binary CA or a Boolean CA.
DEFINITION 1.2.9: A local transition function μ is an m-place left shift function denoted by μLm
where m ∈ N is finite, if the state of the ith automaton ci shifts m − place
leftwards. So,∀i ∈ Z, μ (c ) = c . Lm i i+m
DEFINITION 1.2.10: A local transition function μ is an m-place right shift function denoted by
μ where m ∈ N is finite, if the state of the ith automaton c shifts
Rm i
−1
τ (C i ) = C j ⇔ τ (C j ) = C i
2. τ is stable from the initial step or after the first transition step for some q ∈ Q.
q
Proof.
1. From definition of τ , the result follows.
e
Then τ (C) = C ⁎ where C ⁎ (i) = q∀i ∈ Z and the CA becomes stable after the first step.
q
gcd(m,n)
Proof. Let an n-celled CA with periodic boundaries havethe transition function τ where m < n. Lm
As the transition pattern continues, the cell states shift m places leftwards at each discrete time step.
So, ∀i ∈ {1,2,... n},
τ (C(i)) = C(i + m)
gcd(m,n)
1. s-periodic where
s =
k
gcd(m,k)
if m < k < n
s =
k
gcd(m−k,k)
if k < m < n
2. stationary from initial step if m = k < n.
Proof. A homogeneous CA with k-celled repetitive block configuration for some finite k ∈ N can be
treated as a k-celled CA with periodic boundaries having initial configuration that is not block repetitive.
1. Thus, under τ Lm , the initial configuration reappears at every
s =
k
gcd(m,k)
step if m < k < n
s =
k
gcd(m−k,k)
step if k < m < n
2. Again, if m = k < n, then since the cell states shift k places leftwards at every discrete time step,
the initial configuration remains stationary always.
Hence the result follows.
some finite m ∈ N can be obtained where the cell states shift m places rightwards at each
discrete time step.
8.3.1 Relation between Local and Global Transition Function of a Spatially Hybrid CA
Here it is assumed that homogeneous blocks have periodic boundaries and the evolution of the blocks
are independent of each other. Hence, the leftmost cell and the rightmost cell of each block are
considered to be neighbors. The evolution of the hybrid CA globally depends on the evolution of the
composing homogeneous blocks.
−1
τ (τ (C)(i)) = C(i) = c i
EXAMPLE 1.3.1: Let us consider a hybrid CA with periodic boundaries with an initial
B1 B2
configuration C = 1010
0 010 such that cells of block B follow μ
1 R1 and
cells of B follow μ .
2 L1
B1 B1
Now (μ ) = μ . But for the cells of block B , μ (μ (c give 1011 and not 1010 .
−1
R1 L1 1 L1 R1 i ))
y t+1 = f (y t ) ⇔ C t+1 = τ (C t )
Since, C ≡ (..., c (t), c (t), c (t), c (t), c (t), ...), ∀i ∈ Z, at any particular time t, the ith cell
t i−2 i−1 i i+1 i+2
Some examples of one-dimensional discrete dynamical systems and their corresponding CA models
have been depicted here.
We will restrict ourselves to the following three types of synchronous hybrid CA.
1. Spatially Hybrid CA composed of finite celled blocks of homogeneous CA which are independent
of time.
2. Temporally Hybrid CA composed of countable cells evolve homogeneously for blocks of finite
time intervals.
3. Hybrid CA are spatial as well as temporal in nature.
For a spatially hybrid CA, composed of a finite number of cells A 0, ... ,A n , the configuration C at a
t
particular time t, is
such that for i = 0,1, ..., n, c (t) is the state of the ith cell A at time t.
i i
Here, the evolution of the system is independent of time and the transition function of different blocks
of cells are different. Thus at any time,
Let y denote a binary data sequence of finite length at a particular time t. Each bit of this binary
t
sequence corresponds to a cell of the CA such that the ith bit is represented by the state of the cell
A . i
'11010011010101101111101100010110′
μ R3 μ L2 μe μ R5
gcd(3,9)
period, i.e., 3-period.
Block II is periodic with 5
gcd(2,5)
period, i.e., 5-period.
Block III is stable from the initial step, i.e., 1-period.
Block IV is periodic with 10
period, i.e., 2-period.
gcd(5,10)
Thus, the 32-bit sequence is globally periodic with lcm(3, 5, 1, 2) period, i.e., 30-period.
A spatially hybrid CA model of the data sequence is shown in Table 8.2.
TABLE 8.2
Evolution of Binary Data Sequence
↓ A 1 …A 5 …A 9 A 10 …A 14 A 15 …A 18 …A 22 A 23 …A 28 …A 32
↓ A 1 …A 5 …A 9 A 10 …A 14 A 15 …A 18 …A 22 A 23 …A 28 …A 32
Let y denote a DNA sequence of finite length at a particular time t. Each nucleotide of the DNA
t
sequence corresponds to a cell of the CA [20]. The 4 bases Adenine, Cytosine, Guanine, and
Thymine represented by A, C, G, and T , respectively, correspond to the 4 possible states of a CA
cell.
Let a DNA sequence of length 33 initially be
bases of cells A to A shift 2 places rightwards and the bases of the cells A and A are acquired by
1 28 29 30
Block II of length 3-bases follows identity transition μ . Therefore, the bases of cells A , A , and
e 31 32
μ R2 μe
A spatially hybrid CA model of the evolution of the DNA sequence is shown in Table 8.3.
TABLE 8.3
Evolution of a DNA Sequence
↓ A 1 …A 4 …A 7 …A 10 …A 13 …A 16 …A 19 … A 22 …A 25 …A 28 …A 30 A 31 …A 33
For a temporally hybrid CA, composed of a finite number of cells A 0, …, A n , the configuration C at t
a particular time t, is
C t ≡ (c 0 (t), …, c n (t))
such that for i=0, 1, ..., n, c (t) is the state of the ithith cell A at time t.
i i
Here, the evolution of the system is time-dependent and all the cells at a particular time t follow the
same transition function μ . Thus at time t,
(t)
to a cell of the CA. If at time t, the elevator is at the pth floor, then the state c (t) of cell A will be
p p
Let a trajectory followed by the elevator starting from any floor and returning to that floor of an n-
storey building be given as follows:
Initially the elevator is at the ith floor. During the journey, the elevator stops at each floor it passes.
From the ith floor it travels up to the kth floor in m time steps. At the kth floor it waits until time m
1 2
and then it comes down to the jth floor at time m . It further goes up to the lth floor at time m and it
3 4
comes down to the ith floor at time m , where 0 ≤ i < j < k < l ≤ n and 5
This trajectory is modeled by a temporally hybrid CA with cells A , …, A as follows: Initially cell
0 n
A is at state 1.
i
TABLE 8.4
Trajectory of an Elevator in a Building
t=0 0 … 0 1 0 … 0 … 0 … 0 … 0
t=1 0 … 0 0 1 0 0 … 0 … 0 … 0
⋮ ⋮
… ⋮
… … ⋮
… ⋮
… ⋮
… ⋮
⋱
t = m1 0 … … 0 … … … 0 1 0 … … 0
⋮ ⋮
… … ⋮
… … … ⋮ ⋮ ⋮
… … ⋮
t = m2 0 … … 0 … … … 0 1 0 … … 0
⋮
0 … … 0 … … … 1 0 0 … … 0
⋮
0 … … 0 … … … 0 … … … 0
t = m3
⋮
… … ⋮
… … 1 0 ⋮ ⋮
… … ⋮
⋮
0 … … 0 … … 0 1 0 0 … … 0
⋮
0 … … 0 … … … 0 0 … … 0
⋱
t = m4 0 … … 0 … … … 0 ⋮
0 1 0 0
⋮
0 … … 0 … … … … 0 1 0 … 0
⋮
0 … … 0 … 1 0 … … … 0 … 0
t = m5 − 1 0 … … 0 1 0 0 … 0 … … … 0
t = m5 0 … 0 1 0 … … … … … … … 0
Example 8.4: Pattern of the Daily Number of COVID-19 Active Cases in India during the
Month of September 2020 (as given in [21] Modeled by a Temporally Hybrid
CA)
Let y denote the number of COVID-19 active cases on a particular day t. Each cell of the CA
t
On any particular day t, the number of active cases is reflected by its corresponding cell being in
state 1, while other cells having state 0 at day t.
According to [21], the number of active cases on August 31st is 785127 and on September 1st is
800127 . Thus, 785127 corresponds to cell A and 800127 corresponds to cell A .
78 80
If August 31st is considered to be day 0, then on day 0, cell A will have state 1 and other cells state
78
0. Clearly, the transition function of this CA on day 0 is 2-place right shift transition μ
R2 , since on day 1,
cell A acquires state 1 and all other cells state 0.
80
The number of active cases in India (from [21]) during the month of September 2020 has been given
in Table 8.5.
TABLE 8.5
Number of COVID-19 Active Cases in India during September 2020
Therefore the pattern of the active cases during September 2020 can be reflected between cells A 78
REMARK 5: This model can also be used for modeling a continuous data set by a CA when the
data can be subdivided into intervals and the dynamical system is such that the intervals evolve
over time [22].
TABLE 8.6
Pattern of COVID-19 Active Cases in India during September 2020
31/08/2020 (Day 0) …0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 …
01/09/2020 (Day 1) …0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 …
02/09/2020 (Day 2) …0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 …
03/09/2020 (Day 3) …0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 …
04/09/2020 (Day 4) …0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 …
05/09/2020 (Day 5) …0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 …
06/09/2020 (Day 6) …0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 …
07/09/2020 (Day 7) …0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 …
08/09/2020 (Day 8) …0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 …
09/09/2020 (Day 9) …0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 …
10/10/2020 (Day 10) …0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 …
11/09/2020 (Day 11) …0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 …
12/09/2020 (Day 12) …0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 …
13/09/2020 (Day 13) …0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 …
14/09/2020 (Day 14) …0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 …
15/09/2020 (Day 15) …0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 …
16/09/2020 (Day 16) …0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 …
17/09/2020 (Day 17) …0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 …
18/09/2020 (Day 18) …0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 …
19/09/2020 (Day 19) …0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 …
20/09/2020 (Day 20) …0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 …
21/09/2020 (Day 21) …0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 …
22/09/2020 (Day 22) …0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 …
23/09/2020 (Day 23) …0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 …
24/09/2020 (Day 24) …0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 …
Date ↓ …A 77 A 78 … … … … … … … … … … … … A 101 A 102 …
For a spatially and temporally hybrid CA, composed of finite number of cells A 0 , …, A n , the
configuration C at a particular time t, is
t
C t ≡ (c 0 (t), …, c n (t))
such that for i= 0, 1, …, n, c (t) is the state of the ith cell A at time t.
i i
Here the evolution of the system is independent of time and at a particular time t different blocks of
cells follow a different transition function. Thus at time t,
(t) (t) (t)
(t)
C t+1 = τ (C t ) ≡ (μ (c 0 (t)), ..., μ (c i (t)), ..., μ n (c n (t)))
0 i
Moreover, at a different time t ⁎ the transition function of the cells also changes. Thus at time t ⁎ ,
⁎ ⁎ ⁎
⁎
(t )
(t )
⁎ (t )
⁎ (t )
⁎
C t ⁎ +1 = τ (C t ⁎ ) ≡ (μ (c 0 (t )), ..., μ (c i (t )), ..., μ n (c n (t )))
0 i
Example 8.5: Signaling System of a Toll Plaza Consisting of a Finite Row of Toll Booths,
Modeled by a Hybrid CA
Let y denote the signaling configurations (also see [23]) of the entire row of tollbooths at a
t
particular time t. Each tollbooth corresponds to a cell of the CA. Let the tollbooths be in one of the
three signals red (denoted by R), yellow (denoted by Y), or green (denoted by G), representing Stop,
Pay the toll, or Go states, respectively. At time t, the state of the signal of the ith tollbooth,
corresponds to the state c (t) of cell A .
i i
The toll plaza consists of a row of 11 tollbooths initially has the signal sequence
R Y R Y Y R Y Y R R Y
R G R Y G R Y Y R R G
The entire row of tollbooths is assumed to be composed of 3 blocks with 1, 6 , and 4 tollbooths as
follows:
R G R Y G R YY R R G
Let the configuration of the toll plaza be such that tollbooth 1 is reserved for ambulances and VIPs.
Tollbooths 2 to 6 are kept for light vehicles and tollbooths 7 to 10 are for heavy vehicles. The signal of
tollbooth 1 is usually red and it changes to green whenever an ambulance or VIP’s vehicle passes
through it. At that time, say jth time, all the other tollbooths change to the red signal to restrict general
vehicular movement temporarily and ensure smooth movement of ambulances and VIPs. Again from the
next time step (j + 1), the initial configuration is restored.
Let a signaling system of the toll plaza be given as follows:
From time t = 1, Block I with 1 tollbooth follows identity transition μ ; e
Block III with 4 tollbooths follows 1-place right shift transition μ .R1
At time t = j, when an ambulance or VIP vehicle arrives, the CA follows constant transition such
that, Block I has green signal i.e., cell A has state G;
1
Block II and III have red signals i.e., cells A , …, A have state R.
2 10
TABLE 8.7
Toll Plaza Signaling
↓ A1 A2 A3 A4 A5 A6 A7 A 8 A 9 A 10 A 11
REFERENCES
1. Meiss J. D.: Discrete Dynamical Systems, Dept. of Applied Maths, University of Colorado (Nov.
2008).
2. Tabrizian P.: Shifts as Dynamical Systems, Expository paper, University of California, Berkeley
(Dec. 2010).
3. Knutson J. D.: A survey of the use of cellular automata and cellular automata-like models for
simulating a population of biological cells (2011), Graduate Theses and Dissertations, Iowa State
University, Paper 10133.
4. Banks: Cellular Automata. AI Memo 198, MIT Artificial Intelligence Laboratory, Cambridge, MA
(1970).
5. Neumann J. von: Theory of Self-Reproducing Automata, University of Illinois Press, Illinois (1966).
Edited and completed by and A. W. Burks.
6. Ulam S. M.: “On some mathematical problems connected with patterns of growth of figures”; Proc.
Symp. Appl Math., Amer. Math. Soc. 14, 215–224 (1962).
7. Wolfram Stephen: Theory and Applications of Cellular Automata, World Scientific, Singapore
(1986).
8. Wolfram S.: A New Kind of Science, Wolfram Media (2002).
9. Burks A.: Essays on Cellular Automata, Univ. of Illinois Press (1970).
10. Codd: Cellular Automata, Academic Press, New York (1968).
11. Ilachinski Andrew: Cellular Automata, a Discrete Universe (2001).
12. Schiff Joel L.: Cellular Automata – A Discrete View of the World, John Wiley and Sons (2008).
13. Ghosh S., Basu S.: “Evolution patterns of some boolean cellular automata having atmost one active
cell to model simple dynamical systems”, Bulletin of the Calcutta Mathematical Society 108 (6):
449–464 (2016).
14. Ghosh S., Basu S.: “Evolution patterns of finite celled synchronous cellular automata having atmost
one active cell”, Proceedings of The 10th International Conference MSAST 2016, pp. 154–164
(Dec. 2016).
15. Hopcroft, Ullman: Introduction to Automata Theory, Languages, and Computation, Addison-
Wesley (1979).
16. Kari J.: “Theory of cellular automata: A survey”. Theoretical Computer Science 334, 3–33 (2005).
17. Ghosh S., Basu S.: “Some algebraic properties of linear synchronous cellular automata”,
arXiv:1708.09751v1 [nlin CG] (30 Aug. 2017).
18. Liberti Leo: “Structure of the invertible CA transformations group”. Journal of Computer and
System Sciences 59: 521–536 (1999).
19. Ghosh S.: “Evolutions of some one-dimensional homogeneous cellular automata”, Complex
Systems. 30(1), 75–92 (2021). https://doi.org/10.25088/ComplexSystems.30.1.75.
20. Mizas C., Sirakoulis G. C., Mardiris V. et al: “Reconstruction of DNA sequences using genetic
algorithms and cellular automata: Towards mutation prediction?” BioSystems 92, 61–68 (2008).
21. Worldometer For COVID-19, url: https://www.worldometers.info (2020).
22. Basu S., Ghosh S.: “Fuzzy cellular automata model for discrete dynamical system representing
spread of MERS and COVID-19 virus”, Book Chapter in Internet of Medical Things for Smart
Healthcare, 267-304, Springer Nature, 2020, 10.1007/978-981-15-8097-0_11.
23. Ning Bin, Li Ke-Ping, Gao Zi-You: “Modeling fixed-block railway signaling system using cellular
automata model”, International Journal of Modern Physics C 16(11), 1793–1801 (2005).
https://doi.org/10.1142/S0129183105008308.
9
An Efficient Imputation Strategy Based on
Adaptive Filter for Large Missing Value
Datasets
CONTENTS
9.1 Introduction
9.1.1 Motivation
9.2 Literature Survey
9.3 Proposed Algorithm
9.4 Experiment Procedure
9.4.1 Data Collection
9.4.2 Data Preprocessing
9.4.3 Classification
9.4.4 Evaluation
9.5 Experiment Results and Discussion
9.6 Conclusions and Future Work
References
9.1 Introduction
In the recent world, the applications of artificial-intelligence-based
algorithms are increasing at a tremendous rate. The various fields of
application of these algorithms include medical field, stock market, finance,
education, advertising, etc. One of the important parameters required for the
efficient operation of any machine learning or deep learning algorithm is
the dataset. Even though the dataset is said to be enormously bigger, they
often suffer from missing values. The presence of missing or incomplete
values in the dataset is one of the unavoidable situations, especially in real
time as these data are often collected from a sensor, mobile phone, or in an
open environment [1]. In such situations, the effects of environmental,
device measurement error, and human error act as major contributors in the
incompleteness of the data. Such missing values can significantly degrade
the quality of the dataset obtained and they may result in the inefficient
model for the machine learning algorithm. Thus data preprocessing or data
preparation is done prior to any machine learning algorithm. It is an
important task as the efficiency and the reliability of any algorithm depends
largely on the data preprocessing. The problem of missing values in the data
preparation phase is generally attacked by either removing them or by
imputation [2]. The process of discarding or removing the missing values is
not an efficient one as they are more prone to more bias and they degrade
the efficiency of the algorithm. Moreover, they are not suitable when the
dataset is very small, where even a small amount of data may be useful for
further processing and when the size of the missing value is considerably
larger compared to the data size. Therefore, the missing values are generally
imputated. The process of adding a suitable value in the place of missing
value is called imputation [3,4].
9.1.1 Motivation
The main advantage of the imputation method is that the bias produced is
lower and the efficiency of the machine learning algorithm is improved [5].
Secondly, the imputation method can be applied to any data type and for
any percentage of missing values in the dataset [6]. Thirdly, the information
content present in the data can be retained [7].
Generally data imputation methods are classified as single and multiple
imputation methods. In the single imputation method, the complete dataset
is obtained at one time instant, whereas multiple imputation is an iterative
procedure where the imputation is done several times before the final result
is obtained [8].
The most easy and simple method of imputation of missing values is by
the mean of the available data. This method suffers from the biased dataset
with less variance, which completely ignores correlation between features.
Another well-known method is the Expectation–Maximization method. It is
a two-step approach where in E-step, the expectation of the data and
parameter of the estimate is obtained from the observed data where in the
M-step the updation of the parameters is performed using maximum
likelihood approach. The problem with this method is that they are not
suitable for all data types and for real data applications [9].
The K-Nearest Neighbor (KNN) algorithm is another widely used
method of data imputation where the nearest neighbors are used to compute
the missing values. They suffer from the problem of estimating the number
of neighbors as the distance measure plays a major role in the performance
of the KNN algorithm [10]. Other well-known methods include the singular
value decomposition (SVD) method and Least Squares (LS) method [3]. All
of the above-mentioned single imputation methods suffer from poor
accuracy due to the prediction of missing values in the single run.
The multiple imputation method includes the prediction of missing
values based on a function that is suitable for the dataset. It is an iterative
procedure where several iterations are performed to achieve the predicted
value similar to the actual value [11]. Some of the well-known states of
algorithms include the Markov Chain Monte Carlo (MCMC) algorithm,
family of stochastic gradient, least square, and linear regression algorithms
where multiple iterations are done achieve the imputed data. Many machine
learning algorithms and deep learning algorithms fall under the category of
data imputation [12]. As they capture the interaction between data, they are
more flexible and give more accurate results. Moreover, they are more
suitable when the dataset is large, complex, and unsupervised, whereas the
single imputation method cannot be used efficiently.
Thus, this chapter deals with the multiple imputation strategy based on
adaptive filters. Even though there are several adaptive filters available, the
Least Mean Square (LMS) algorithm is found to be most widely used
because of its simplicity in nature and faster convergence. In this chapter, a
novel adaptive LMS filter called the Data Imputation LMS (DI-LMS)
algorithm is proposed. This chapter is described as follows: In Section 2,
the state-of-art methods available for data preprocessing are discussed.
Also, the adaptive filter used for prediction is also dealt with. Problems
with the current approach are also analyzed. In Section 3, the proposed
method of imputation using an adaptive filter is described. The new
adaptive algorithm called the data imputation LMS adaptive algorithm is
proposed. Mathematical illustration and flow process are also described. A
detailed experiment setup is explained in Section 4. In Section 5, the
description of dataset is included and the results, comparison, and
discussion are presented. And in Section 6, the conclusion and future scope
are discussed.
Later on, several adaptive filters were used to predict the data in the
context of a Wireless Sensor Network (WSN). Most of the literature was
focused on reduction of the energy consumption by transfer of less data
packets in situations where the signal received is damaged or when the data
is lost. It can be concluded that the use of an adaptive filter results in better
prediction accuracy when compared to other stochastic processes
[15,16,17]. For truck time missing data prediction, the Recursive Least
Square (RLS) algorithm is successfully employed [12]. The experimental
results indicated that thee adaptive filters were very successful in missing
value imputation of truck travel time. Other papers successfully
investigated the prediction of the stock market [18,19,20]. Later, a
correlation-based factor is introduced into the LMS prediction algorithm to
remove the large error due to weather fluctuations [21].
Thus, from the open literature, it is evident that a LMS-based prediction
algorithm is suitable for prediction of data. Also it is found that the
conventional one is not suitable and there is a need to incorporate a
correlation-based factor in the update recursion. Moreover, very few papers
are available for data imputation using an adaptive filter.
Therefore, the main objectives of the paper are as follows:
1. To propose an adaptive filter called a data imputation LMS algorithm
that can efficiently impute missing values.
2. To propose a new parameter that is based on a correlation factor
between the reference and available value to update the coefficients of
the adaptive filter.
3. To check the validity of the proposed technique for different types of
simulated missing values of varying degrees as well as for a real-time
dataset with missing values. Finally, the proposed technique is
validated by analyzing the performance with different classifiers.
difference between the desired and estimated output given by e(n) = d(n) −
y(n). The update equation for LMS [13] is
w (n + 1) = w (n) + μe (n)x(n) (9.1)
The error and the weight update recursion are given by the following two
equations:
Thus, it can be seen that depending on the correlation vector C (n), the
influence of uncorrelated data on the update weight prediction is reduced.
The algorithm in the form of flow diagram is given in Figure 9.3.
FIGURE 9.3 Proposed data imputation LMS algorithm.
In the data preprocessing stage, firstly the datasets are normalized [0-1] to
remove the difference between maximum and minimum values of the
attributes. In this step, data imputation is performed. For a fair comparison,
the general state-of-the-art methods of imputation, namely zero imputation,
mean imputation, simple tree, KNN imputation, and Support Vector
Machines (SVM) methods of imputation were used for comparison of the
proposed method [1]. Then the dataset is divided into two sets, one is an
incomplete dataset and the other is a complete dataset. The initial weight
vector is calculated based on the correlation between the available inputs
and the present input. The missing value after calculation is directly
imputed using the proposed method. The proposed algorithm is depicted as
pseudo code, as shown in Table 9.1.
TABLE 9.1
TABLE 9.1
Dataset Information
9.4.3 Classification
This step involves the classification and evaluation with the imputed
datasets. Four different classifiers, namely KNN, SVM, Multilayer
Perceptron (MLP), and Random Forests (RF) are used for the classification
purpose.
9.4.4 Evaluation
Accuracy and root mean square error (RMSE) are used for evaluation. The
accuracy is the ratio between the true value and total instance and RMSE is
the square root of the mean of the error square obtained between predicted
and actual values.
TABLE 9.2
The Dataset with the MAR Missing Mechanism
Heart Disease
10 20 30 40 50
Zero imputation 85.6 85.53 85.23 85.11 85.10
Mean imputation 85.64 85.2 85.31 83.01 84.87
Simple tree 86.2 86.1 86.00 86.23 86.56
KNN imputation 86.55 86.6 86.45 86.24 86.47
SVM 86.32 86.4 86.35 86.7 86.21
Proposed DI-LMS 86.7 86.8 86.40 86.4 86.44
Banknote Authentication
Method Degree of Missing (%)
10 20 30 40 50
Zero imputation 99.78 99.8 99.7 99.80 99.75
Mean imputation 99.65 99.70 99.68 99.57 99.60
Simple tree 99.75 99.70 99.64 99.58 99.49
Heart Disease
TABLE 9.3
The Dataset with the MCAR Missing Mechanism
Heart Disease
10 20 30 40 50
Zero imputation 85.40 85.3 85.10 85.01 85.04
Mean imputation 85.52 85.21 85.22 83.20 84.77
Simple tree 86.08 86.17 85.89 86.07 86.14
KNN imputation 86.09 86.12 86.21 86.14 86.24
SVM 86.65 86.4 86.35 86.47 86.31
Proposed DI-LMS 86.45 86.63 86.40 86.1 86.50
Banknote Authentication
Method Degree of Missing (%)
10 20 30 40 50
Zero imputation 99.64 99.65 99.64 99.70 99.31
Mean imputation 99.35 99.42 99.47 99.27 99.50
Simple tree 99.65 99.63 99.51 99.18 99.62
KNN imputation 99.77 99.55 99.51 99.63 99.45
SVM 99.82 99.78 99.79 99.78 99.80
Heart Disease
TABLE 9.4
The Dataset with the MNAR Missing Mechanism
e ataset w t t e N ss g ec a s
Heart Disease
10 20 30 40 50
Zero imputation 84.6 84.5 84.6 84.10 84.02
Mean imputation 84.64 84.66 84.45 84.41 84.11
Simple tree 84.2 84.21 84.12 84.10 84.01
KNN imputation 84.55 84.45 84.44 84.41 84.31
SVM 84.32 84.44 84.51 84.51 84.60
Proposed DI-LMS 84.7 84.45 84.56 84.55 84.74
Banknote Authentication
Method Degree of Missing (%)
10 20 30 40 50
Zero imputation 97.68 97.41 97.53 97.41 97.12
Mean imputation 97.65 97.12 97.54 97.56 97.41
Simple tree 97.75 97.64 97.62 97.21 97.01
KNN imputation 97.87 97.85 97.74 97.72 97.62
SVM 97.89 97.78 97.84 97.81 97.80
Proposed DI-LMS 97.44 97.41 97.87 97.82 97.81
Occupancy Detection Dataset
Method Degree of Missing (%)
10 20 30 40 50
Zero imputation 88.25 88.02 88.25 88.12 88.03
Mean imputation 89.20 89.18 89.11 89.21 89.74
Simple tree 89.30 89.30 89.27 89.17 89.45
KNN imputation 89.51 89.41 89.21 89.11 89.10
SVM 90.02 90.11 91.00 91.01 91.00
Proposed DI-LMS 91.01 91.01 91.02 91.15 90.03
Telecom Customer-Churn
Method Degree of Missing (%)
10 20 30 40 50
Zero imputation 87.25 87.20 87.10 86.89 87.90
Mean imputation 87.21 87.26 87.28 87.40 87.20
Heart Disease
TABLE 9.5
Accuracy Evaluation of Proposed Method of Imputation for Horse Coli
Dataset
Method Classifiers
TABLE 9.6
RMSE Analysis for Different Imputation Strategies for Heart Disease
Dataset
10 20 30 40 50
Thus, it is evident from Table 9.6 that the zero imputation has the highest
RMSE and the proposed imputation has the smallest RMSE (except from
the 40% case) when compared to other imputation methods. However, it
should be also noted that as the degree of missing value increases, the
RMSE also increases for all methods of imputation due to the uncertainty
associated with the dataset due to missing values. Further, the additional
advantage claimed by the proposed method is that the complexity is lower
when compared to conventional LMS as only a highly correlated dataset is
used for prediction.
REFERENCES
1. Cheng, Ching-Hsue, Jing-Rong Chang, and Hao-Hsuan Huang. “A
novel weighted distance threshold method for handling medical
missing values”. Computers in Biology and Medicine 122 (2020):
103824.
2. Bertsimas, Dimitris, Colin Pawlowski, and Ying Daisy Zhuo. “From
predictive methods to missing data imputation: an optimization
approach”. The Journal of Machine Learning Research 18, no. 1
(2017): 7133–7171.
3. Lin, Wei-Chao, and Chih-Fong Tsai. “Missing value imputation: a
review and analysis of the literature (2006–2017)”. Artificial
Intelligence Review 53, no. 2 (2020): 1487–1509.
4. Wu, Pan, Lunhui Xu, and Zilin Huang. “Imputation methods used in
missing traffic data: a literature review”. In International Symposium
on Intelligence Computation and Applications, pp. 662–677.
Singapore: Springer, 2019.
5. Rado, Omesaad, Muna Al Fanah, and Ebtesam Taktek. “Performance
analysis of missing values imputation methods using machine learning
techniques”. In Intelligent Computing-Proceedings of the Computing
Conference, pp. 738–750. Springer, Cham, 2019.
6. Baraldi, Amanda N., and Craig K. Enders. “An introduction to modern
missing data analyses”. Journal of School Psychology 48, no. 1 (2010):
5–37.
7. Chiu, Chia-Chun, Shih-Yao Chan, Chung-Ching Wang, and Wei-
Sheng Wu. “Missing value imputation for microarray data: a
comprehensive comparison study and a web tool”. BMC Systems
Biology 7, no. Suppl 6 (2013): S12.
8. Farhangfar, Alireza, Lukasz A. Kurgan, and Witold Pedrycz. “A novel
framework for imputation of missing values in databases”. IEEE
Transactions on Systems, Man, and Cybernetics-Part A: Systems and
Humans 37, no. 5 (2007): 692–709.
9. Molenberghs, Geert, and Geert Verbeke. “Multiple imputation and the
expectation-maximization algorithm”. Models for Discrete
Longitudinal Data (2005): 511–529.
10. Zhang, Shichao. “Nearest neighbor selection for iteratively kNN
imputation”. Journal of Systems and Software 85, no. 11 (2012): 2541–
2552.
11. Rubin, Donald B. Multiple Imputation for Nonresponse in Surveys.
Vol. 81. John Wiley & Sons, 2004.
12. Karimpour, Abolfazl, Amin Ariannezhad, and Yao-Jan Wu. “Hybrid
data-driven approach for truck travel time imputation”. IET Intelligent
Transport Systems 13, no. 10 (2019): 1518–1524.
13. Haykin, Simon S. Adaptive-Filter Theory. Pearson Education India,
India, 2008.
14. Goodwin, Graham C., and Kwai Sang Sin. Adaptive Filtering
Prediction and Control. Courier Corporation, Mineola, New York,
2014.
15. Ganjewar, Pramod, Selvaraj Barani, and Sanjeev J. Wagh. “A
hierarchical fractional LMS prediction method for data reduction in a
wireless sensor network”. Ad Hoc Networks 87 (2019): 113–127.
16. Stojkoska, Biljana, Dimitar Solev, and Danco Davcev. “Data
prediction in WSN using variable step size LMS algorithm”. In
Proceedings of the 5th International Conference on Sensor
Technologies and Applications, 2011.
17. Dias, Gabriel Martins, Boris Bellalta, and Simon Oechsner. “A survey
about prediction-based data reduction in wireless sensor networks”.
ACM Computing Surveys (CSUR) 49, no. 3 (2016): 1–35.
18. Wesen, J. E., V. Vermehren, and H. M. de Oliveira. “Adaptive filter
design for stock market prediction using a correlation-based criterion”.
arXiv preprint arXiv:1501.07504 (2015).
19. Huang, Shian-Chang, Chei-Chang Chiou, Jui-Te Chiang, and Cheng-
Feng Wu. “A novel intelligent option price forecasting and trading
system by multiple kernel adaptive filters”. Journal of Computational
and Applied Mathematics 369 (2020): 112560.
20. Garcia-Vega, Sergio, Xiao-Jun Zeng, and John Keane. “Stock returns
prediction using kernel adaptive filtering within a stock market
interdependence approach”. Expert Systems with Applications 160
(2020): 113668.
21. Ma, Dongchao, Chenlei Zhang, and Li Ma. “A C-LMS prediction
algorithm for rechargeable sensor networks”. IEEE Access 8 (2020):
69997–70004.
22. Dua, D. and Graff, C. UCI Machine Learning Repository [
http://archive.ics.uci.edu/ml]. Irvine, CA: University of California,
School of Information and Computer Science, 2019.
23. Alcalá-Fdez, Jesús, Alberto Fernández, Julián Luengo, Joaquín Derrac,
Salvador García, Luciano Sánchez, and Francisco Herrera. “Keel data-
mining software tool: dataset repository, integration of algorithms and
experimental analysis framework”. Journal of Multiple-Valued Logic
& Soft Computing 17 (2011): 255–287.
10
An Analysis of Derivative-Based Optimizers on Deep Neural
Network Models
CONTENTS
10.1 Introduction
10.2 Methodology
10.2.1 SGD
10.2.2 SGD with Momentum
10.2.3 RMSprop
10.2.4 Adagrad
10.2.5 Adadelta
10.2.6 Adam
10.2.7 AdaMax
10.2.8 NADAM
10.3 Result and Analysis
10.4 Conclusion
References
10.1 Introduction
Deep learning, a subset of machine learning, enables computer systems to solve the problem without being
explicitly programmed. Convolutional neural networks are one of the most commonly used techniques with the
exponential growth in image data and most of the research showed remarkable results in computer vision domain
using the same. To train a deep neural network with high dimensional and non-linear data is the most challenging
optimization problem. Most of the applications are designed, considering the goal to design a model. Optimization
techniques take in part to improve many deep learning architectures such as VggNet [1], ResNet [2], DenseNet [3]
EffieicnetNet [4], GoogleNet [5] and many more with increasing complexity and depth. The convolutional neural
network is one of the classes of neural networks, the role of optimization to search global optima by training the
model and very speedy convergence using derivative-based optimization techniques like the gradient descent
algorithm [6].
Traditional machine learning algorithms outline many problems like a computational bottleneck, prerequisite of
domain knowledge, expert understanding, and curse of dimensionality so most of the domains make use of the
neural network to design various applications. To train the deep neural network with high dimensional data is very
challenging for nonconvex problems and if samples are low in numbers then it leads to the problem of overfitting.
Depending upon the problem, experts have to design a model that gives faster and better results by reducing the
error loss function with the help of hyperparameter tuning that results in improvement in model prediction. To tune
the model, it is necessary to adjust the weights as per error loss and the path selection in the correct direction using
optimization techniques and interest in finding the global minima for that loss function. Selection of suitable
learning rate is also a challenging task if the learning rate is too small, converge goes too long, and if the learning
rate is increased then neural network model will not concentrate. The selection of suitable optimization techniques
improves the model performance; selection of an improper technique results in a network to be stuck in local
minima while training and that result does not to improve the model. If a model gets the global minima, there is no
assurance that the model remains in the global minima. Hence, to deal with these challenges, and to improve the
model, an analysis of performance and better understanding of the behavior of optimization techniques is
necessary. There are many methods available to deal with these challenges. Hence, this work contributed in the
direction of an experimental comparative analysis of the eight most commonly used derivative-based optimization
techniques using convolutional neural networks on four different network architectures VGG [1], RestNet [2],
DenseNet [3], EffieicnetNetB5 [4] using COVID-19 dataset (https://www.kaggle.com/bachrr/detecting-covid-19-
in-x-ray-images-with-tensorflow/) and tried to find out the answers such as how fast, accurate, and stable each
algorithm is able to tackle the relevant, optimal minima during the training of the model. The performance of each
technique is evaluated using the loss function and accuracy and convergence speed.
The proposed work is organized as follows: Section 2 represents methodology of the proposed system with a
detailed explanation of each optimizer and Section 3 gives a brief description of result and analysis of each
technique. Table 10.1 represents the different parameters used throughout this work.
TABLE 10.1
Notations/Parameters and Their Meaning, Which Are Used in the Methodology
α Learning rate
W Weight vector
∆W New weight vector
J(θ) Objective function
∇ θJ(θ) Gradient of objective function
(x(i):y(i)) Training examples input: corresponding labels
b Batch size
L Loss function
∊ Very small value added to escape the divide by zero/numerical stability constant
β1,β2 Hyperparameters that are to be tuned
gt : cost gradient with respect to current layer
df (x,w)
dw
10.2 Methodology
At the time of building deep neural networks, there are lots of things that are essential and need to be defined
before training the model. Building the model starts with the first thing, which is to provide the input layers
succeeded by different dense layers and the last output layer. After completion of building the model, the function
of the optimizer is to compile the model using different loss functions and metrics applied to find out the
performance of the model. The role of the loss function/cost function is to compute the error between the desired
value and actual value generated by the model and direct the optimizer to move in the correct direction, as
mentioned in Equation (10.1), whereas the optimizer helps the model to reduce error and improve the model
performance and gives better results. There are several error loss functions [7,8] and optimizers [9] that help the
model to improve performance in various situations. Many of the optimization algorithms propose faster
convergence, complexity, more parameters, and the performance of the algorithm dependent upon the problem that
is to be solved. To select the best method, analyze each parameter and different values of the problem. This process
is time consuming. Selecting the correct optimizer for deep neural network models is evaluative to get the best
results in a limited time.
10.2.1 SGD
SGD is simple and effective, but requires an essential tuning of hyperparameters. Here, learning is complicated as
input samples in each layer depend on parameters of all preceding layers [10]. This algorithm in each reiteration
randomly selects limited samples and calculates the gradients for those samples instead of all, in order to train the
model rather than selecting the whole dataset. Selecting the whole dataset for training helps in getting the minima
without noise; for a very large dataset this does not work. SGD helps in selecting random samples and updates the
parameters for each training example (x(i):y(i)), which benefits in increasing the convergence speed and also saves
the memory [11].
n
n
i
(i) (i)
4. w = w − α k gt
5. End
n
∇W ∑
n
i
L (f ( x
(i)
, w), y
(i)
)
4. Compute velocity v = γw − α k gt
5. Update w ← w + v
6. End
10.2.3 RMSprop
RMSprop [18,19] is identical to Adaprop, which helps to solve the limitations of the Adagrad optimizer. It is based
on the concept of adaptive learning rate optimization, where learning rate changes over time. Here, the first
average of square of the gradient is computed exponentially and then the learning rate decides the direction of the
gradient, suggested learning rate 0.0001, and epsilon is set to be 1e-7. The third step includes an update of the step;
initially it has been set but needs to be tuned.
n i
5. Calculate Δθ = − ∈
⁎ gt
√qt+∈
6. Return θ… = …θ + Δθ
7. End
10.2.4 Adagrad
Adagrad [20] is a derivative-based optimization technique to calculate the gradient [21], like other algorithms
discussed here. Adagrad works well on a sparse dataset, and learns the individual features by adopting the learning
rate in keeping with parameters. If parameters have higher gradients, then it slows the learning rate, whereas if
parameters have lower gradients, it increases the learning rate and the model is learned quickly. Learning rate/step
size is reciprocally proportional to the sum of the squares of all past gradients of the parameter. This algorithm
works on aggregation of squared gradients. As during training, the model aggregate sum that increases the step
size is reduced imperceptibly and therefore the model stops learning additional information. This problem is
solved with Adadelta. Dean et al. [19] proved that Adagrad improved the limitations of SGD and was more robust
than SGD.
n
n
i
(i) (i)
4. Calculate Δθ = − ∈
⁎ gt
√qt+∈
5. Return θ… = …θ + Δθ
6. End
10.2.5 Adadelta
This algorithm [22] is similar to SGD and extension for the Adagrad algorithm. Here the term delta calculates the
variance between the present weights and the updated weights and uninterruptedly learns the weights even after
many updates. This algorithm is computationally expensive.
n i
√E[g2]t+∈
gt
10.2.6 Adam
Adam, Adaptive Moment Estimation [20,23], is one of the most commonly used optimizers. This algorithm gives
better results compared to other optimizers, but recently [23] has been found that it may take different search
directions than the original direction. Adam optimizer is derived from RMSProp and AdaGrad [23]. This algorithm
works on utilizing the momentum factor to the current gradient to calculate, with the help of previous gradients.
This optimizer is most popularly adopted to train the neural network. It includes two steps:
1. Momentum factor is added to RMSprop to rescale the gradients and computes the gradient square and
exponential changing average of the gradient.
2. Biases are updated by calculating the first and second moment estimates. The decay rates (β1, β2) are selected
very low close to 1.
d. Calculate ˆ
pt Bias correction as first average ˆ
pt = pt (1 − β1
t
)
e. Calculate ˆ
qt Bias correction as second average ˆ t
qt = qt (1 − β2 )
f. Update parameter ΔW = ΔW − 1 − α. ˆ
pt /√ qt+ ∈
3. Return ΔW
10.2.7 AdaMax
AdaMax [18] is the revision of Adam, dependent on the infinity norm as the high value of p norm is mostly
unstable so ℓ 1 and ℓ 2 norms are mostly preferred. ℓ ∞ norm have a stable behavior so the author proposed
AdaMax, which simplifies and stabilizes the algorithm. In this algorithm, constrained infinity norm based on max
operation and bias are not adjustable toward zero.
Algorithm 7 Adamax
Default setting: α = 0.0001, β1 = 0.9, β2 = 0.999 and ∊ = 10 -8
9. θt = θt − 1 − αt
ˆ
pt
√ ˆ
q̂ t+∈
TABLE 10.2
Performance Evaluation (for Epochs=10) of Optimizers Using Training Accuracy and Validation Accuracy under CO
Train Val Train Val Train Val Train Val Train Val Train Val Tr
Acc Acc Acc Acc Acc Acc Acc Acc Acc Acc Acc Acc A
VGG16 0.55 0.36 0.53 0.53 0.55 0.49 0.45 0.50 0.36 0.50 0.56 0.49 0.5
ResNet50 0.56 0.48 0.60 0.47 0.60 0.48 0.55 0.54 0.39 0.50 0.62 0.48 0.5
DenseNet201 0.51 0.49 0.78 0.95 0.87 0.95 0.66 0.80 0.54 0.42 0.84 0.93 0.7
Efficient NetB5 0.51 0.50 0.61 0.49 0.60 0.52 0.56 0.51 0.42 0.53 0.61 0 .49 0.6
Adam optimizer performed well for training accuracy as well as for validation accuracy on DenseNet201
architecture and the second RMSprop gave a good result for the same architecture, whereas the performance of the
SGD optimizer is steady on all four models.
Table 10.3 represents the model loss during training and validating the system for 10 epochs. It has been
observed that there are distinguishable differences between the performance of the optimizers. For training loss
and validation loss, RMSProp and Adam again performed uniformly, as did the previous one. From the
observation, it is suggested that Adam and RMSProp are well-grounded optimizers.
TABLE 10.3
Performance Evaluation (for Epochs=10) of Optimizers Using Training Loss and Validation Loss under COVID-19
Train Val Train Val Train Val Train Val Train Val Train Val T
Loss Loss Loss Loss Loss Loss Loss Loss Loss Loss Loss Loss
VGG16 0.70 0.71 0.78 0.74 0.55 0.49 1.17 0.97 1.10 0.86 0.76 0.69 0
ResNet50 0.72 0.70 0.67 0.72 0.60 0.48 0.70 0.74 1.47 1.14 0.65 0.69 0
DenseNet201 0.79 0.76 0.45 0.23 0.87 0.95 0.62 0.53 0.85 0.85 0.33 0.23 0
Efficient NetB5 0.70 0.69 0.67 0.72 0.60 0.52 0.68 0.70 0.42 0.53 0.67 0.71 0
In this work, the step size and the learning rate of models was constant. As shown in Figure 10.2, a combined
plot for all the models represents training, validation accuracy and training, validation loss considering
steps_per_epoch = 20, epochs = 10, and learning rate α = 0.0001. Figure 10.2 represents the combined plot for
accuracy and loss by compiling a model for 10 epochs.
FIGURE 10.2 The 16 plots represent training, validation accuracies and training, validation loss on the COVID-19 dataset by applying eight different
optimization algorithms: SGD, SGDM, RMSProp, Adagrad, Adadelta, Adam, Adamax, Nadam, and four different deep neural architectures: VGG16,
DenseNet201, ResNet50, and EfficientNetB5, respectively. Parameter setting is considered during the training model as steps_per_epoch = 20, epochs =
10, and learning rate α = 0.0001.
The performance of the model was evaluated after 100 epochs and the results show DenseNet201 architecture
performed well for SGD, SGDM, RMSProp, Adam, Adamax, and Nadam optimizers with the highest accuracy of
98% using the Adam optimizer, as shown in Table 10.4.
TABLE 10.4
Performance Evaluation (for Epochs=100) of Optimizers Using Training Accuracy and Validation Accuracy under C
Train Val Train Val Train Val Train Val Train Val Train Val Tra
Acc Acc Acc Acc Acc Acc Acc Acc Acc Acc Acc Acc Ac
Optimizer/Architectures SGD SGDM RMSProp AdaGrad AdaDelta Adam A
Train Val Train Val Train Val Train Val Train Val Train Val Tra
Acc Acc Acc Acc Acc Acc Acc Acc Acc Acc Acc Acc Ac
VGG16 0.77 0.76 0.77 0.76 0.88 0.90 0.52 0.50 0.46 0.50 0.88 0.88 0.82
ResNet50 0.61 0.43 0.61 0.43 0.80 0.85 0.71 0.88 0.56 0.47 0.99 0.95 0.92
DenseNet201 0.90 0.94 0.90 0.94 0.98 0.94 0.54 0.51 0.50 0.46 0.99 0.96 0.7
Efficient NetB5 0.62 0.50 0.62 0.50 0.61 0.49 0.49 0.46 0.52 0.50 0.78 0.87 0.60
As shown in Table 10.5, training and validation loss of all the models was observed. The analysis represents the
minimum loss for samples achieved using the Adam and RMSProp optimizers with 0.05 with the model compiled
for 100 epochs. Though in this work, step size and the learning rate kept constant, elementally the model requires
learning rate or step_per_size should change with the number of iterations. As per observations in Figure 10.3, for
Adam and RMSProp gradient descent curve inclines flatter towards minima, the more the models leading towards
minima considered as the the optimum solution. It has been analysed that the original learning rate may not be
sufficient to converge the solution. Hence from above, it has been suggested that the learning rate is dynamic
instead of static throughout training the model. Manually changing the learning rate sometimes helps or choosing
the scheduler policies.
TABLE 10.5
Performance Evaluation (for Epochs=100) of Optimizers Using Training Loss and Validation Loss under COVID-19
Train Val Train Val Train Val Train Val Train Val Train Val T
Loss Loss Loss Loss Loss Loss Loss Loss Loss Loss Loss Loss
VGG16 0.50 0.86 0.50 0.53 0.29 0.28 0.75 0.69 0.98 0.72 0.31 0.30 0
ResNet50 0.64 0.69 0.64 0.69 0.47 0.42 0.58 0.71 0.71 0.74 0.05 0.14 0
DenseNet201 0.23 0.18 0.23 0.18 0.05 0.16 0.72 0.67 0.88 0.80 0.05 0.10 0
Efficient NetB5 0.66 0.71 0.66 0.71 0.66 0.71 0.69 0.70 0.69 0.69 0.48 0.41 0
FIGURE 10.3 The 16 plots represent training, validation accuracies and training, and validation loss on COVID19 dataset by applying eight different
optimization algorithms: SGD, SGDM, RMSProp, Adadelta, Adam, Adamax, Adagrad, Nadam, and four different deep neural architectures: VGG16,
DenseNet201, ResNet50, and EfficientNetB5, respectively. Parameter setting is considered during training model as steps_per_epoch = 20, epochs =
100, and learning rate α = 0.0001.
10.4 Conclusion
In this work, we have made an attempt to determine how the choice of gradient-based optimizers changes the
performance of the models by applying different optimizers on the COVID-19 dataset. Selection of the optimizer
to train the model is dependent on various parameters like input data, learning rate, and so on. As per observation,
the Adam optimizer is the best choice on different types of neural network architectures, as it helps models to train
fast. Adam takes optimal steps during initial training but Adam still fails to beat the convergence as compared to
SGD. To train network models having more numbers of layers like complex model DenseNet201, adaptive
learning rate methods will be a good choice. The future direction of research considers changing the learning rate
dynamically during training either manually or by using different schedulers like exponential scheduler or cosine
annealing scheduler.
REFERENCES
1. Simonyan, Karen and Andrew Zisserman, “Very deep convolutional networks for large-scale image
recognition”, Conference paper at (ICLR 2015), pp. 1–14.
2. He, K., X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition”, in Proceedings of 2016
IEEE Conference on Computer Vision and Pattern Recognition (Las Vegas, NV, USA, 2016), pp. 770–778.
3. Huang, G., Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Densely connected convolutional networks”, in
Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition, (Honolulu, HI, USA,
2017), pp. 2261–2269.
4. Tan, Mingxing and Quoc V. Le, “EfficientNet: Rethinking model scaling for convolutional neural networks”,
in Proceedings of the 36th International Conference on Machine Learning (Long Beach, California, PMLR
97, 2019).
5. Szegedy, C., W. Liu, Y. Jia, P. Sermanet, S. Reed, and D. Anguelov et al “Going deeper with convolutions”,
in Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition (Boston, MA, USA,
2015), pp. 1–9.
6. Ruder, S., “An overview of gradient descent optimization algorithms”, 2016. Retrieved from
http://arxiv.org/abs/1609.04747 (29 October 2020).
7. Czarnecki Wojciech Marian, “On loss functions for deep neural networks in classification”,
arXiv:1702.05659v1 [cs.LG] (18 February 2017).
8. Nie, Feiping, Zhang Xuan Hu, and Xuelong Li, “An investigation for loss functions widely used in machine
learning”, Communications in Information and Systems, 18(1): 37–52 (2018).
9. Le, Quoc V., Jiquan Ngiam, Adam Coates, Abhik Lahiri, Bobby Prochnow, and Andrew Y. Ng, “On
optimization methods for deep learning”, Appearing in Proceedings of the 28 th International Conference on
Machine Learning (Bellevue, WA, USA, 2011).
10. Ioffe, Sergey and Christian Szegedy, “Batch normalization: Accelerating deep network training by reducing
internal covariate shift”, Retrieved arXiv:1502.03167 [cs.LG] (23rd March 2015).
11. Ruder, Sebastian, “An overview of gradient descent optimization algorithms”, arXiv:1609.04747v2 [cs.LG]
(15 June 2017).
12. Polyak, B. T. and A. B. Juditsky, “Acceleration of stochastic approximation by averaging”,” SIAM Journal on
Control and Optimization, 30(4): 838–855 (1992).
13. Lan, Guanghui, “An optimal method for stochastic composite optimization”,” Mathematical Programming,
133(1–2): 365–397 (2012).
14. Roux, Nicolas L., Mark Schmidt, and Francis Bach, “A stochastic gradient method with an exponential
convergence rate for finite training sets”, In Advances in Neural Information Processing Systems (NIPS)
(2012), pp. 2663–2671.
15. Johnson, Rie and Tong Zhang, “Accelerating stochastic gradient descent using predictive variance reduction”,
In Advances in Neural Information Processing Systems (2013), pp. 315–323.
16. Defazio, Aaron, Francis Bach, and Simon Lacoste-Julien, “SAGA: A fast incremental gradient method with
support for non-strongly convex composite objectives”, In Advances in Neural Information Processing
Systems (2014), pp. 1646–1657.
17. Bottou, Léon, Frank E. Curtis, and Jorge Nocedal, “Optimization methods for large-scale machine learning”,
SIAM Review, 60(2): 223–311 (2018).
18. Kingma, Diederik P. and Jimmy Lei Ba, “Adam: A method for stochastic optimization”, arXiv:1412.6980v9
[cs.LG] (30 January 2017).
19. Dean J., G. S. Corrado, R. Monga, A. Y. Chen et al, “Large scale distributed deep networks”. NIPS 2012:
Neural Information Processing Systems, 1–11. http://papers.nips.cc/paper/4687-large-scale-distributed-deep-
networks.pdf (2012).
20. Kingma, Diederik and Jimmy Ba, “Adam: A method for stochastic optimization”, Published as a conference
paper at the 3rd International Conference for Learning Representations (San Diego, pp. 1–13, 2015),
arXiv:1412.6980 [cs.LG].
21. Duchi J., E. Hazan, and Y. Singer, “ “Adaptive subgradient methods for online learning and stochastic
optimization”, Journal of Machine Learning Research, 12, 2121–2159 (2011). Retrieved from
http://jmlr.org/papers/v12/duchi11a.html.
22. Zeiler M. D., “ADADELTA: An adaptive learning rate method”, Retrieved from
http://arxiv.org/abs/1212.5701 (22 December 2012).
23. Reddi, Sashank J., Satyen Kale, and Sanjiv Kumar, “On the convergence of Adam and beyond”. In
International Conference on Learning Representations (2018). arXiv abs/1904.09237.
24. Dozat, T., “Incorporating Nesterov Momentum into Adam”, ICLR Workshop 1, 2013–2016 (2016).
Section III
Applications of Data Science and
Data Analytics
11
Wheat Rust Disease Detection Using Deep Learning
CONTENTS
11.1 Introduction
11.2 Literature Review
11.3 Proposed Model
11.4 Experiment and Results
11.5 Conclusion
References
11.1 Introduction
Nowadays, technologies deliver to humankind the ability to produce enough food for billions of
people over the world. Even though the world is producing huge amounts of crops to ensure food
security of the people, there are a lot of factors that are threats in the process of ensuring food
security. The threats that basically occur on crops can be with climate changes, pollinators, and
plant diseases. Plant diseases are not only threats to global food security, but they also have
devastating consequences on smallholding families in Ethiopia, who are responsible for
supporting many in one family. In Ethiopia, this crop is the second most [1] used crop to ensure
food security and covers an estimated 17% of Ethiopia’s total agricultural land use. It is not only a
critical crop to smallholder farmers’ livelihood but also a means of ensuring food security for
millions of people in Ethiopia.
The outbreak of wheat rust disease is threatening the crop production gain of the country in
wheat-producing regions of Ethiopia, following a long period of El-Niño caused by drought across
the country. The wheat rust was first spotted in Southern Nations (SNNP) regions and Oromia.
The crop-damaging fungus has since spread to different regional states, especially to the northern
part of the country, and especially to Amhara and Tigray regions. These diseases basically occur
by funguses, well known to bring stunting in plants and cause pre-harvest losses between 50% and
100%. Wheat rusts are known to cause stunting in plants and pre-harvest losses of between 50
and, in severe cases, 100% [2], the diseases have been detected in nearly 2,200 communities and
on close to 300,000 hectares of land across the country. The application of machine learning,
particularly deep learning in agriculture fields, is showing promising result. Deep learning can be
summarized as a sub-field of machine learning [3] that study statistical models called deep neural
networks. Early work on deep learning, or rather Cybernetics, as it was called at the time, was
done in the 1940s and 1960s, and describes biologically inspired models such as Perceptron,
Adaline, or Multi-layer Perceptron. Then, a second wave called Connectionism came in the years
1960–1980 with the invention of “backpropagation.” This algorithm persists today and is currently
the algorithm of choice for optimizing deep neural networks. The ambition for creating a feasible
neural network for detection and recognition of images that have automatic feature extraction and
translation invariance promotes a new category of neural networks called Convolution Neural
Network (ConvNet). These networks have special types of processing layers in the extractor
module that learns to extract discriminated features from the given input image.
TABLE 11.1
Some Selected Work Relevant to This Study
Ramcharan et Applied fine-tuning using pre- 99.35% The study constrained to single
al. (2017) trained deep learning models. leaf images with homogenous
[4] background
Ref. Technique Accuracy Gaps
Mohanty et Applied different pre-trained 99.48% Accuracy degrades when the
al. (2016) networks to train it on model tested on images from
[5] laboratory images real cultivation field
Ferentinos et Fine-tuned Resnet50 model 87% Images are segmented manually
al. (2018) by expert technicians
[6]
Practically, deep learning [9,10] is a subset of machine learning with more power and flexibility
than traditional machine learning approaches. The biggest advantage of deep learning algorithms
is that they try to learn high-level features from data in an incremental manner. This prevents the
need for manual feature extractions, which is done by human intervention in traditional machine
learning. Deep learning emulates the human brain [11], and it works with unstructured data,
especially for applications like computer vision (image processing) and speech recognition.
FIGURE 11.2 Sample original image before augmentation and augmented images using original image.
There are two kinds of images, the first row with four images, which are the normal RGB
images (before segmentation), and the four images with their respective below are segmented
images for their respective original images on the above. Therefore, by segmenting all images in
the training set to feed to the model, it can easily classify them by identifying the patterns made in
the segmentation part. In this case, the healthy images hold solid black without any pattern and the
infected images form a specific pattern, which is totally different from the healthy images and by
comparing images with patterns and without patterns it perfectly classifies them into healthy or
infected crops.
After performing all the evaluations on the grayscale and color-segmented dataset on the model,
experiments with top results are discussed below. The model is evaluated using different
parameters that can affect the efficiency of the model. These parameters are a number of epochs;
train test split ratio, learning rate, and dropout ratio. Different parameter ratios are used for the
evaluation purpose.
Table 11.2 summarizes top results found for the evaluation of the model on three types of
dataset types, which are used for the purpose of comparison. These dataset types are the grayscale
image type, which is a type of images with only one color channel, the RGB image types, which
are images containing more than 16 million colors in one RGB image and the RGB segmented
images which are images proposed in this study.
TABLE 11.2
Summary Table for Model
TABLE 11.3
Result Summary of the Model on Grayscale Images
No. Epochs Learning Rate Test Dropout Training Time Accuracy Error
Ratio (minutes)
This sounds, as the learning rate decreases, the accuracy of the model decreases proportionally,
which sounds like decreasing the learning rate means, decreasing in the speed of the learning
ability of the model. The model starts to degrade on the 300th epoch and this shows for grayscale
images it has nothing more to extract and learn because there is only one channel; this makes it
lose more features from the data and that will degrade the efficiency of the model after the 200th
epoch. This result will force us to use RGB images, which have three channels and can contain
more than 16 million colors. This helps the model to extract and learn from the color nature of
images and helps to prevent losing information from the images, which can be extracted from the
color of the images since in the case of this study, wheat rusts are identified by their colors from
the healthy wheat images.
11.4.6 Result Comparison of the Model on RGB Images Based on Learning Rate
From Table 11.4, we can take results from rows 4, 5, and 6, which have the same parameters
except they differ in their learning rates, in which results in row 4 has a LR of 0.001, results in
row 5 with a LR of 0.0001, and results in row 6 with a LR of 0.0001. By comparing the three
results with respect to their learning rate, we can conclude the effect of the learning rate on the
model. Table 11.4, row 4: Results in row 4 are evaluated using 300 epochs with a test ratio of 25%
and dropout ratio of 50%, and LR of 0.001, which results in an accuracy of 99.27% taking a
training time of 196.24 minutes.
TABLE 11.4
Result Summary of the Model on RGB Images
No. Epochs Learning Rate Test Dropout Training Time Accuracy Error
Ratio (minutes)
As we can see from the classification report of Figure 11.6, the model achieves a 100% training
accuracy and it has a few misclassifying data which are explained below in the confusion matrix.
The above confusion matrix shows that the model works perfectly when tested on the images it
already knows or trained on. But it has some misclassifying images on validation data, which are
three misclassifying images from the total 411 validation data. The model misclassified two
healthy wheat images as infected and one infected wheat image as healthy as stated on the
confusion matrix for validation data.
FIGURE 11.6 Classification report of validation and training data for Table 11.4, row 4 and confusion matrix of validation and
training data for Table 11.4, row 4.
FIGURE 11.7 Classification report of result in Table 11.4, row 5.
The above classification report shows that by decreasing the learning rate from 0.001 to 0.0001
from the results in row 4, there is little change in the accuracy of the model, which decreases from
99.27% to 97.57%. But it still achieves a training accuracy of 100%, which cannot be an issue of
discussion as long as the model knows the training data when we fit the model. What matters is
achieving high accuracy on the validation data. It’s obvious that as the accuracy of the model
decreases, misclassifying data increases inversely, so in order to minimize the number of
misclassifying data, we need to increase the accuracy of the model. In Figure 11.8, it is shown that
as the accuracy of the results in row 5 decreased, we can figure it out as the numbers of
misclassifying images are increased from the previous result.
In Figure 11.8, there are a total of 10 misclassified images that increased from the previous
result in Table 11.4, row 4, the one with only three misclassified images. In the results of row 5,
there are four healthy wheat images which are classified as infected wheat images and six infected
wheat images which are classified as healthy. But here there is no misclassified image for the
training data, which means it has achieved a training accuracy of 100%. Results in row 6 are
evaluated using 300 epochs with a test ratio of 25% and dropout ratio of 50%, and LR of 0.00001,
which results in an accuracy of 96.84%, taking a training time of 165.61 minutes.
FIGURE 11.9 Classification report of result in Table 11.4, row 6.
As the results of the three evaluations are discussed above in detail, we can easily conclude the
effect of the learning rate on the model based on the results the model brought. Starting from the
results in row 4, the model is trained using LR 0 f 0.001 and achieved an accuracy of 99.27%,
whereas with results in row 5, it is evaluated using LR of 0.0001 which is lesser learning rate from
the previous model and achieved an accuracy of 97.57%, which has also decreased accuracy from
the previous result; finally, we continue decreasing the LR, like the results in row 6, which
decreases the LR to 0.00001, it still decreased the accuracy to 96.23%. This clearly shows that
decreasing the LR when still other parameters are the same, also decreases the efficiency of the
model. Therefore, the model LR value of 0.001 has achieved an efficient model. Bear in mind that
we have the same parameters for all three models and the only difference is their learning rate
value:
TABLE 11.5
Comparison Table for Results with Different Dropout Rates
It helps us to overcome overfitting by setting outputs on previous neurons, so that the model
couldn’t generalize features which drives it to overfit easily. Dropout drops some neurons from the
previously hidden layers, but that doesn’t mean we have to drop out all the neurons from the
previous layer that affects the model to not learn some features from its preceding layers. In the
model, a 30% (0.3) dropout rate has performed best, but it depends on the amount of dataset we
have and this dropout rate might not work on other datasets, so we have to bear in mind that the
dropout rate performed best on the model doesn’t necessarily mean it will perform the same on
another dataset type and model.
In Figure 11.11, we can see that the model with a dropout rate 0.3 (30%) has an f1-score of
almost 1.00, which means it has a better performance than the model with a dropout rate of 0.5
(50%), both models performing an accuracy of 100% when tested on training data.
As we can see in Figure 11.12, it is clear that the dropout ratio of 0.3 (30%) worked well, with
only two misclassified images, but if we see the confusion matrix with a dropout rate of 0.5
(50%), there are three misclassified images.
FIGURE 11.12 Confusion matrix comparison based on dropout rate.
There is no common standard for splitting the training and test ratio because it depends on the
type of data used, the ability of the model to filter information quickly, and the amount of data
used to train our model. In this model, the ratio that performs best is a 75% train and 25% test split
ratio. After training the model on grayscale images, the model couldn’t achieve a satisfying result.
So it is needed to try it on RGB images so that the model can extract more features from colored
images. As discussed in the above section, we have achieved the highest accuracy level on the
model after trying different methods and dataset types on RGB images, achieving 99.51%
accuracy. Segmentation helps to get more important and only the needed features from the image,
and that will increase the accuracy of the model because the model can get more refined
information from the images. Therefore, the wheat dataset is color segmented to increase the
accuracy of the model. This increases the accuracy to 99.76%.
11.5 Conclusion
Three dataset types are used in the study to conduct the experiments for the model, and these
dataset types are grayscale image dataset, a dataset which only contains one-channel image, RGB
dataset, a dataset with three channel images and RGB color-segmented dataset, a dataset which is
RGB and segmented with the disease color code. This model has achieved an accuracy of 86.62%
with 200 epochs, 0.001 learning rate, and a 50% dropout rate. This result is improved when the
model is trained on the RGB image dataset, which climbed to an accuracy of 99.51%. Finally,
after segmenting the images using the color of the infected images, the model extracted better
information than the previous model and achieved an accuracy of 99.76% with 300 training
epochs, the learning rate of 0.001, and the dropout rate of 30%. In the future, we collect a variety
of data and a large set will help us to take this study a long way forward to help the agricultural
system.
REFERENCES
1. Tefera, Nigussie. “Technical efficiency of fertilizer-use and farm management practices:
evidence from four regions in Ethiopia”. (2020).
https://www.researchgate.net/publication/344200750_Technical_Efficiency_of_Fertilizer-
use_and_Farm_Management_Practices
2. Sticklen, Mariam. “Transgenic, cisgenic, intragenic and subgenic crops”. Advances in Crop
Science and Technology 3, no. 2 (2015): e123.
3. Miceli, P. A., W. Dale Blair, and M. M. Brown. “Isolating random and bias covariances in
tracks”. In 2018 21st International Conference on Information Fusion (FUSION), pp. 2437–
2444. IEEE, Cambridge, UK, 2018.
4. Ramcharan, Amanda, Kelsee Baranowski, Peter McCloskey, Babuali Ahmed, James Legg,
and David P. Hughes. “Deep learning for image-based cassava disease detection”. Frontiers
in Plant Science 8 (2017): 1852.
5. Mohanty, Sharada Prasanna, David Hughes, and Marcel Salathe. “Using deep learning for
image-based plant disease detection”. arXiv (2016): arXiv- 1604.
6. Ferentinos, Konstantinos P. “Deep learning models for plant disease detection and
diagnosis”. Computers and Electronics in Agriculture 145 (2018): 311–318.
7. Too, Edna Chebet, Li Yujian, Sam Njuki, and Liu Yingchun. “A comparative study of fine-
tuning deep learning models for plant disease identification”. Computers and Electronics in
Agriculture 161 (2019): 272–279.
8. Picon, Artzai, Aitor Alvarez-Gila, Maximiliam Seitz, Amaia Ortiz-Barredo, Jone Echazarra,
and Alexander Johannes. “Deep convolutional neural networks for mobile capture device-
based crop disease classification in the wild”. Computers and Electronics in Agriculture 161
(2019): 280–290.
9. Mahapatra, Sambit. “Why deep learning over traditional machine learning”. Towards Data
Science (2018).
10. Singh, Asheesh Kumar, Baskar Ganapathysubramanian, Soumik Sarkar, and Arti Singh.
“Deep learning for plant stress phenotyping: trends and future perspectives”. Trends in Plant
Science 23, no. 10 (2018): 883–898.
11. Hurwitz, J., and D. Kirsch. Machine Learning Machine Learning for Dummies. J. Wiley,
June 2018.
12
A Novel Data Analytics and Machine
Learning Model Towards Prediction and
Classification of Chronic Obstructive
Pulmonary Disease
Sridevi U.K., Sophia S., Boselin Prabhu S.R., Zubair Baig, and P.
Thamaraiselvi
CONTENTS
12.1 Introduction
12.2 Literature Review
12.3 Research Methodology
12.3.1 Logistical Regression Model for Disease Classification
12.3.2 Random Forest (RF) for Disease Classification
12.3.3 SVM for Disease Classification
12.3.4 Decision Tree Analyses for Disease Classification
12.3.5 KNN Algorithm for Disease Classification
12.4 Experiment Results
12.5 Concluding Remarks and Future Scope
12.6 Declarations
References
12.1 Introduction
COPD is an acute inflammatory lung disease that results in blocked flow of
air from lungs. COPD disease is categorized by recurrent respiratory
problems and breathing difficulty, due to trachea defects typically triggered
by strong connections to toxic pollutants or chemicals. The common
indications of COPD include difficulty in breathing, prolonged cough,
mucus formation, and occurrence of wheezing. COPD is normally formed
by continuous contact with poisonous gases, but maximum occurrence
happens by cigarette smoking. Individuals possessing COPD will be in
greater danger in attaining heart diseases, lung cancers, etc. Emphysema
and chronic bronchitis are the two most prevalent disorders attributed to
COPD. These two disorders typically come about all together and may
show variation in severity amid patients with COPD. Chronic bronchitis is
the occurrence of swelling over the linings of bronchial valves that transport
airflow from and to the lungs. The major symptoms include prolonged
cough and formation of sputum. Emphysema corresponds to the situation in
which alveoli at the terminal of the bronchioles of lungs were damaged
owing to the exposure towards cigarette smoking, other poisonous gases,
and particulate matters.
Even though COPD is a progressive disorder that gets worse over time,
COPD was found to be fully curable. By appropriate management,
numerous patients with COPD have achieved better symptom control and
improved life quality, and also have attained reduced complications of other
related disorders. In today’s world, people are facing infectious diseases
because of the state of the climate and their changing lifestyles. So, earlier-
stage prediction of disease has become an essential task.
Yet, it becomes too difficult for the doctors to determine the diseases
precisely based on diagnosis. One of the most difficult objectives is to
determine the type of disease correctly. For overcoming these
consequences, data mining plays a vital part in predicting the disease.
Medical sciences have resulted in increased quantity of medical data for
every year. Owing to this increase in data growth towards medical and
healthcare field, exact prediction of medical information is a challenging
task so as to benefit in the early treatment of patients. Several original
issues and considerations associated with interpretation and information
extraction with growing accumulation of big data were observed. In an age
where everything is tangible, basic statistical study of the past decades is
possibly not enough. The healthcare industry will benefit from the
recognition of similar patterns in different patients in this setting.
Data mining discovers the secret pattern knowledge in vast number of
patient data by using disease data. COPD was identified as a major global
public health problem because of its associated impairments and increased
death rate. Despite the increasing usefulness and effectiveness of supervised
ML algorithms in the modeling of predicting diseases, there still seems to
be improvement needed in the range of study. Primarily, we identified few
study papers that used various supervised learning processes for predicting
diseases to perform a detailed analysis. This work therefore seeks in
identifying current findings amid dissimilar techniques for supervised ML
strategies, their accuracy of output, and the severe diseases under study.
Furthermore, the benefits and drawbacks of various machine learning
algorithms under supervision are summarized.
The research findings will allow analysts to identify recent trends and
reasons in disease forecasting using machine learning algorithms that are
supervised and thereby refine their study goals. Recent findings have
demonstrated that numerous patients have acquired COPD without having
smoking habits. Additional aspects such as air pollution and lifestyle shall
also affect the patients.
The chapter has been structured as follows. A detailed introduction on
chronic obstructive pulmonary disease and how data mining impacts COPD
is discussed in Section 12.1. A well-organized and state-of-the-art literature
review encompassing machine learning models, Bayesian networks,
classification methodologies, binary logistic regression algorithms, k-
nearest neighbor algorithm, convolutional neural network method, deep
learning methods, classification and regression tree approach, etc., and how
these methods impact COPD disease prediction are vividly discussed in
Section 12.2. The research methodology featuring training/testing data,
feature extraction, and ML-based classification algorithm applicable
towards COPD patients are elaborated in Section 12.3. Moreover, various
disease classification models including logistic regression method, random
forest method, etc., are also explained in this section. The experimental
results and the derived outcomes from this present investigation are
enumerated in Section 12.4. The concluding remarks, along with future
scope, are discussed in Section 12.5.
(12.2)
12.3.2 Random Forest (RF) for Disease Classification
The random forest method is an integrated model that is capable of
predicting the data by integrating the decisions from a base model
sequence. However, it is capable of reducing the variance by preventing
over model fit. The base model class shall be mathematically expressed as
given below in Equation (12.3)
where last model g is combination of base classifier model fi. Here, every
base classifier is defined as a modest decision tree. Therefore, it is a
valuable approach that takes multiple learning algorithms into consideration
so as to obtain the best prediction model for the treatment of COPD
patients.
Step 1: From the dataset, separate the records with COPD disease and
non-COPD disease.
Step 2: For the new unlabeled record, calculate the Euclidean
distance.
Step 3: Calculate the least distance in both the patient’s classes.
Step 4: By computing the smallest distance in both classes, assign the
query record into the class which possesses the smallest
distance.
TABLE 12.1
Correlation between the Features
We have divided the dataset into 20% and 80% for testing them in an
unpredictable real-time situation, with 20% as the test set and 80% being
the train set. The predictive logistical regression accuracy score based on 80
training data and 20 test data is observed as 82%. Model accuracy by using
the Jaccard similitude value is 0.832. Figure 12.7 displays the matrix of
uncertainty for the real and expected values. Table 12.2, with its results,
shall permit us to infer that for training purposes, it is not always
appropriate to consider large quantities of data from records and that it shall
be relevant to often use the minimum collection of usable data records.
TABLE 12.2
Comparison of the Considered Machine Learning Models
Undoubtedly, a collection of 441 data records will offer entire details and
shall render classifications and forecasts as accurate as possible. Support
vector machine calculation delivered an accuracy of 73%, logistic
regression with 72%, random forest with 74%, decision tree with 74%, and
KNN with a highest accuracy of 75%.
However, as per the classification results over COPD, severity level
using the K-Nearest Neighbor algorithm, the attained accuracy is 75%, and
the corresponding F1 score was 81%, which is higher compared to the
accuracy obtained using other algorithms.
12.6 Declarations
Source of funding
This research work received no specific grants from government,
commercial, or non-profit funding agencies.
REFERENCES
1. Ma, X., et al “Comparison and development of machine learning tools
for the prediction of chronic obstructive pulmonary disease in the
Chinese population”. Journal of Translational Medicine 18 no. 146
(2020).
2. Swaminathan, S., et al “A machine learning approach to triaging
patients with chronic obstructive pulmonary disease”. PLoS One 12
no. 11 (2017).
3. Wytrychiewicz, Kinga, et al “Smoking status, body mass index,
health-related quality of life, and acceptance of life with illness in
stable outpatients with COPD”. Frontiers in Psychology 10 (2019).
4. Himes, B. E., et al “Prediction of chronic obstructive pulmonary
disease (COPD) in asthma patients using electronic medical records”.
Journal of the American Medical Informatics Association: JAMIA, 16
no. 3 (2009): 371–379.
5. Matheson, M. C., et al “Prediction models for the development of
COPD: a systematic review”. International Journal of Chronic
Obstructive Pulmonary Disease 13 (2018): 1927–1935.
6. Macaulay, Dendy, et al “Development and validation of a claims-
based prediction model for COPD severity”. Respiratory Medicine 107
(2013): 1568–1577.
7. Aramburu, A., et al “COPD classification models and mortality
prediction capacity”. International Journal of Chronic Obstructive
Pulmonary Disease 14 (2019): 605–613.
8. Manian, P. “Chronic obstructive pulmonary disease classification,
phenotypes and risk assessment”. Journal of Thoracic Disease, 11 no.
14 (2019): S1761–S1766.
9. Luo, L., et al “Using machine learning approaches to predict high-cost
chronic obstructive pulmonary disease patients in China”. Health
Informatics Journal, 26 no. 3 (2019): 1577–1598.
10. Xu, W., et al “Differential analysis of disease risk assessment using
binary logistic regression with different analysis strategies”. The
Journal of International Medical Research 46 no. 9 (2018): 3656–
3664.
11. Bellou, V., et al “Prognostic models for outcome prediction in patients
with chronic obstructive pulmonary disease: systematic review and
critical appraisal”. The BMJ 367 (2019).
12. D., Dahiwade, G., Patle, and E., Meshram. “Designing disease
prediction model using machine learning approach”. 3rd International
Conference on Computing Methodologies and Communication
(ICCMC) (2019): pp. 1211–1215.
13. Yogesh, Thorat, et al “Diagnostic accuracy of COPD severity grading
using machine learning features and lung sounds”. European
Respiratory Journal 54 (2019): PA3992.
14. Peng, J., et al “A machine-learning approach to forecast aggravation
risk in patients with acute exacerbation of chronic obstructive
pulmonary disease with clinical indicators”. Scientific Reports 10 no.
3118 (2020).
15. González, G., et al “Disease staging and prognosis in smokers using
deep learning in chest computed tomography”. American Journal of
Respiratory and Critical Care Medicine 197 no. 2 (2018): 193–203.
16. Mariani, M. C., Tweneboah, O. K., and Bhuiyan, M. “Supervised
machine learning models applied to disease diagnosis and prognosis”.
AIMS Public Health 6 no. 4 (2019): 405–423.
17. Esteban, C., et al “Development of a decision tree to assess the
severity and prognosis of stable COPD”. European Respiratory
Journal 38 no. 6 (2011): 1294–1300.
18. Uddin, S., et al “Comparing different supervised machine learning
algorithms for disease prediction”. BMC Medical Informatics and
Decision Making 19 no. 281 (2019).
19. James, G., et al “An introduction to statistical learning”. Springer Texts
in Statistics (2013): 1–419.
13
A Novel Multimodal Risk Disease Prediction
of Coronavirus by Using Hierarchical LSTM
Methods
CONTENTS
13.1 Introduction
13.2 Related Works
13.3 About Multimodality
13.3.1 Risk Factors
13.4 Methodology
13.4.1 Naïve Bayes (NB)
13.4.2 RNN-Multimodal
13.4.3 LSTM Model
13.4.4 Support Vector Machine (SVM)
13.4.5 Performation Evaluation
13.4.5.1 Accuracy
13.4.5.2 Specificity
13.4.5.3 Sensitivity
13.4.5.4 Precision
13.4.5.5 F1-Score
13.5 Experimental Analysis
13.6 Discussion
13.7 Conclusion
13.8 Future Enhancement
References
13.1 Introduction
People are currently suffering from the pandemic of coronavirus, an
infectious disease; the occurrence activated in Wuhan. This disease is
spread by individuals, and healthcare found that many cases were reported
in Wuhan. The virus was spread by traveling outside China. This virus
spread to 216 countries and territories throughout the world as of May 11,
2020. Around 4.8+ crore people suffer from this pandemic, 3.4+ cr people
recovered from this virus, and 1.2 cr patients died due to this virus.
The novel coronavirus was recognized on January 7, 2020, and renamed
coronavirus by the World health organization (WHO). As the epidemic was
observed and tested in China Laboratories, testing performed on January 5,
2020, Chinese authorities starved of the suspected SARS virus. Since the
epidemic was detected in places besides China, Melbourne has confirmed
that the virus diagnosis lab has effectively brought up the PDIII (Peter
Doherty Institute for Infection and Immunity) pathogen cell culture lab on
January 5, 2020 [1].
The rapidly increasing statistics for the pandemic [2] clearly show that
coronavirus is rapidly spreading worldwide. In struggling to control the
spread, the scarcity of diagnostic tests and coronavirus as a novel infection
complicated non-symptoms. The raised instances of the epidemic inevitably
may overburden the healthcare system, as physicians and hospital staff are
overloaded in dealing with new cases and hospitals, particularly the ICU.
Studies have focused on medications and epidemic-control vaccines, but
clinical trials are a minimum of a one-year-long process (Figure 13.1).
FIGURE 13.1 Total cases of coronavirus throughout world (from world meter).
The other one was recommended for the adapted Long Short-Term
Memory (LSTM) form and is the appropriate treatment system
classification. The LSTM networks have been successful ways to
determine, analyze, and forecast data since anomalies in data can occur
among significant events. LSTMs are shown as a targeted sequencing in
situations of sentences. In contrast, in this situation, the context depends on
the input sequence, which may be associated effectively with explorative
and diminishing regression issues that may arise in the learning of
conventional RNNs.
A long-term illness’s complexity results in mortality for the individual
who suffered from coronavirus and previously established underlying
conditions such as cardiac arrest, renal disease, diabetes mellitus, and
malignancy. Numerous fatalities based on these medical conditions are
involved in this instance. A machine learning approach was utilized to
assess the probability of chronically ill patients affected by a coronavirus,
including the prevalence and intensity of an epidemic. Controlled
pandemics and stochastic models indicate several possible outcomes.
Analysis can also facilitate perception for demographic changes,
uncertainty in travel patterns, and incorporate epidemic prevention insights.
Even if coronavirus has insignificant pathogens in most of the population, a
high prevalence of morbidity and mortality for individuals with associated
cognitive impairment requires additional care [3].
The rest of the chapter is divided into various sections. The proposed
work is discussed in Section 13.1, and literature studies are addressed in
Section 13.2. The chapter’s theme is explored in Section 13.3, and the
methodology of the proposed work is explained in Section 13.4, followed
by the outcomes of the relevant experiment described in Section 13.5.
Section 13.6 discusses the proposed method and its significance, followed
by concluding remarks in Section 13.7, and the future direction in Section
13.8.
13.4 Methodology
The framework of the proposed multimodal coronavirus risk prediction
incorporate three aspects: the multimodal extractor, the attention
bidirectional recurring neural network, and the diagnosis prediction node
will be implemented in this portion. A multimodal convolutional layer
integrates several information categories based on three components,
detection and treatment extraction feature generated, the spread of
infections extraction and classification, and an insightful model
combination.
13.4.2 RNN-Multimodal
The framework integrates normalized text characteristics and diagnostic
information in simple RNN instead of using a complicated feature
combination and then utilizes standardized RNN performances explicitly to
predict the final diagnosis.
Recurring networks have the fundamental concept of having channels.
Alliterations are allowed to be using knowledge from previous system
passes [47]. The length of the processing is determined by many variables,
but it is not unspecified. Information can be regarded as degrading, with
even lesser current information [48].
13.4.5.1 Accuracy
Precision is the number of observations that the algorithm makes for all
sorts of predictions. The total number of correct marks (TP + TN) is
determined by the total number of datasets (P + N) of chronic disease:
True Positives+True Negatives
Accuracy =
All Samples
13.4.5.2 Specificity
A metric that tells us about the proportion of patients without chronic
diseases that are expected as not chronic diseases (also known as a true
negative rate) is the model specificity:
True Negatives
Specif icity =
True Negatives+False Positives
13.4.5.3 Sensitivity
Sensitivity is the metric that says the number of patients who currently have
chronic illnesses and who are diagnosed with the classification algorithms
of chronic diseases (the true positive rate, reminder, or chance of detection):
True Positives
Sensitivity =
True Positives+False Negatives
13.4.5.4 Precision
Precision is a measure that tells about the percentage of chronically ill
patients diagnosed. The positive predictive value is known as (PPV):
T rue P ositive
Precision =
T rue P ositive+F alse P ositive
13.4.5.5 F1-Score
The harmonic mean (average) of performance is the F1 score (also called
accuracy, F score, F test, and recall):
T rue P ositive
Recall =
T rue P ositive+F alse N egative
Accuracy: 99%
Precision: 1.0
Recall: 0.9
The model design for diagnosing COVID-19 was based on RNN and
LSTM. Each model was designed to detect COVID-19, using the COVID
dataset. The detector for COVID-19 was trained and tested on the collected
dataset, 70% for training, and 30% as a remainder for testing. To reduce the
imbalance of data, the class weight technique was applied. Adjust weight
inversely proportional to class frequencies in the input data. Each model
was able to distinguish between data for higher classification accuracy.
The result of the study was very encouraging in the field of diagnosing
methods in COVID-19. Multiple epochs were applied to train the model of
RNN and LSTM. The use of the machine learning technique was effective.
The prediction model for RNN achieved a higher accuracy of 96.6% and an
F-measure of 94%. Also, the LSTM attained an accuracy of 94.8% and an
F-measure of 92%. The figures show the performance of both neural
network models concerning classification accuracy (Figures 13.8 and 13.9).
13.6 Discussion
This analysis shows in many different ways the usefulness of multi-models
to evaluate coronavirus-appropriate therapeutic cases. Initially, observations
are probably better than in textbooks, and that no other way of working is
better. Nevertheless, it can be thoroughly tested since there were too few
learning and actual results. More systematically recorded classification
results are usually recommended for some other method instructional
strategy on such significant improvements. The framework is
straightforward and can be effectively intended to convey via an assessment
for a strategic plan. There were no specific optimal solution methods there.
Both proposed models obtain a better result. Comparing our model
accuracy and precision with other models, the RNN, COVID-19 prediction
is best, receiving 96.6% for accuracy and 96.4% for precision. The best
training process was gained as the difference between the training and
validation became closer. A robust COVID-19 detector built as the F-
measure improved to 0.97. The metrics of AUC were impressive as the
model achieved 0.9. The figure below shows the result of the metric AUC.
Thus, the COVID-19 diagnosis model trained on the x-ray data provides
superior performance metrics (Figure 13.10).
13.7 Conclusion
The scientific study indicates that a high chronic disease risk of mortality in
patients is associated with more than 70% of the patient’s total expenditure
on chronic illness medication. In developed countries worldwide, treatment
is a significant crisis. Serious illness is the primary cause of mortality, as
per the clinical report. The long-term illness diagnosis is far more
meaningful to prevent the risk of surviving. The results are comparable to
the analytical techniques, including Bayes, LSTM Support Vector, and
Recurrent Neural Network. The maximum precision of the recurrent neural
network is when the recent input loop is recommended for an earlier
diagnosis of heart disease through classifiers as a computational model
system for LSTM. LSTM models showed better outcomes compared with
conventional methods (with and without window duration) in predicting the
risk of chronic diseases affected by the COVID-19 diagnosis.
REFERENCES
1. Nature. “Coronavirus latest: Australian lab first to grow virus outside
China”. Available online: https://www.nature.com/articles/d41586-
020-00154-w.
2. Website: https://www.teradata.com/Blogs/Advanced-Analytics-for-
coronavirus-Trends-Patterns-Predictions.
3. Kakulapati, V., et al, “Risk analysis of coronavirus caused death by the
probability of patients suffering from chronic diseases – a machine
learning perspective”. JCR 2020; 7(14): 2626–2633. doi:
10.31838/jcr.07.14.499.
4. Mukherjee, H., et al, “Shallow convolutional neural network for
coronavirus outbreak screening using chest X-rays”. (2020).
https://doi.org/10.36227/techrxiv.12156522.v1.
5. Rajinikanth, V., et al “Harmony-search and Otsu based system for
coronavirus disease (COVID19) detection using lung CT scan
images”. arXiv preprint arXiv:2004.03431, 2020.
6. Das, D., et al, “Truncated inception net: coronavirus outbreak
screening using chest X-rays”. doi: https://doi.org/10.21203/rs.3.rs-
20795/v1.
7. Fong, S., et al, “Finding an accurate early forecasting model from
small dataset: a case of 2019-ncov novel coronavirus outbreak”. arXiv
preprint arXiv:2003.10776, 2020.
8. Fong, S., et al, “Composite Monte Carlo decision making under high
uncertainty of novel coronavirus epidemic using hybridized deep
learning and fuzzy rule induction”. Applied Soft Computing 2020; 93:
106282.
9. Website: https://www.newscientist.com/article/2236846-coronavirus-
risk-of-death-rises-with-age-diabetes-and-heart-
disease/#ixzz6J2szlsdL
10. Li, B., et al, “Prevalence and impact of cardiovascular metabolic
diseases on coronavirus in China”. Clinical Research in Cardiology
2020; 109(3).
11. Porcheddu, R., et al, “Similarity in case fatality rates (CFR) of
COVID-19/SARS-COV-2 in Italy and China”. Journal of Infection in
Developing Countries 2020; 14: 125–128.
12. The Novel coronavirus Pneumonia Emergency Response
Epidemiology Team. “The epidemiological characteristics of an
outbreak of 2019 novel coronavirus disease (COVID-19) e China”.
China CDC Weekly 2020; 41(2).
13. Website: https://www.kaggle.com/virosky/novel-coronavirus-covid-
19-italy-dataset
14. Gupta, R., et al, “Clinical considerations for patients with diabetes in
times of coronavirus epidemic”. Diabetology & Metabolic Syndrome
2020 Mar 10; 14(3): 211–212.
15. Liu, J., et al, “Deep EHR: chronic disease prediction using medical
notes”. 2018, https://arxiv.org/abs/1808.04928
16. Brisimi, T.S., et al, “Predicting chronic disease hospitalizations from
electronic health records: an interpretable classification approach”.
Proceedings of the IEEE 2018; 106(4): 690–707.
17. Zhang, X., et al, “A novel deep neural network model for multi-label
chronic disease prediction”. Frontiers in Genetics 2019; 10: 351.
18. Kriplani, H., et al, “Prediction of chronic kidney diseases using deep
artificial neural network technique”. Computer Aided Intervention and
Diagnostics in Clinical and Medical Images, Springer, Berlin,
Germany, pp. 179–187, 2019.
19. Deepika, K., et al, “Predictive analytics to prevent and control chronic
disease”, in Proceedings of the International Conference on Applied
and Aeoretical Computing and Communication Technology
(iCATccT), pp. 381–386, IEEE, Bangalore, India, July 2016.
20. Kim, C., et al, “Chronic disease prediction using character-recurrent
neural network in the presence of missing information”. Applied
Sciences 2019; 9(10): 2170.
21. Gautam, R., et al, “A comprehensive review on nature inspired
computing algorithms for the diagnosis of chronic disorders in human
beings”. Progress in Artificial Intelligence 2019; 8(4): 401–424.
22. Rojas, E.M., et al, “Contributions of machine learning in the health
area as support in the diagnosis and care of chronic diseases”.
Innovation in Medicine and Healthcare Systems, and Multimedia,
Springer, Berlin, Germany, vol. 145, pp. 261–269, 2019.
23. Jia, W., et al, “Predicting the outbreak of the hand-foot-mouth diseases
in China using recurrent neural network”. 2019 IEEE International
Conference on Healthcare Informatics (ICHI). IEEE, pp. 1–4, 2019.
24. Hamer, W.B., et al, “Spatio-Temporal prediction of the epidemic
spread of dangerous pathogens using machine learning methods”.
ISPRS International Journal of Geo-Information 2020; 9: 44.
25. Mezzatesta, et al, “A machine learning-based approach for predicting
the outbreak of cardiovascular diseases in patients on dialysis”.
Computer Methods and Programs in Biomedicine 2019; 177: 9–15.
26. Philemon, M.D., et al, “A review of epidemic forecasting using
artificial neural networks”. International Journal of Epidemiologic
Research 2019; 6: 132–143.
27. Abdulkareem, et al, “Risk perception and behavioral change during
epidemics: comparing models of individual and collective learning”.
PLoS One 2020; 15(1): e0226483.
28. Jiménez, F., et al, “Feature selection based multivariate time series
forecasting: an application to antibiotic resistance outbreaks
prediction”. Artificial Intelligence in Medicine 2020; 104: 101818.
29. Ochodek, M., et al, “Deep learning model for end-to-end
approximation of COSMIC functional size based on use-case names”.
Information and Software Technology 2020; 103: 106310.
30. Wen, S., et al, “Real-time identification of power fluctuations based on
LSTM recurrent neural network: a case study on Singapore power
system”. IEEE Transactions on Industrial Informatics 2019; 15: 5266–
5275.
31. Yuan, J., et al, “A novel GRU-RNN network model for dynamic path
planning of mobile robot”. IEEE Access 2019; 7: 15140–15151.
32. Guan, W.J., et al, “Clinical characteristics of 2019 novel coronavirus
infection in China”. medRxiv Feb 9, 2020.
https://doi.org/10.1101/2020.02.06.20020974.
33. Chen, N., et al, “Epidemiological and clinical characteristics of 99
cases of 2019 novel coronavirus pneumonia in Wuhan, China: a
descriptive study”. Lancet 2020; 395: 507e.
34. Huang, C., et al, “Clinical features of patients infected with 2019 novel
coronavirus in Wuhan, China”. Lancet 2020; 395: 497e506.
https://doi.org/10.1016/S0140-6736(20)30183-5.
35. Liu, K., et al, “Clinical characteristics of novel coronavirus cases in
tertiary hospitals in Hubei Province”. Chinese Medical Journal (Engl)
2020; 133(9): 1025–1031.
36. Wang, D., et al, “Clinical characteristics of 138 hospitalized patients
with 2019 novel coronavirus-infected pneumonia in Wuhan, China”.
JAMA 2020; 323(11): 1061–1069.
37. Wu, Z., et al, “Characteristics of and important lessons from the
coronavirus disease 2019 (COVID-19) outbreak in China. Summary of
a report of 72 314 cases from the Chinese Center for Disease Control
and Prevention”. JAMA Feb 24, 2020; 323(13): 1239–1242. doi:
10.1001/jama.2020.2648.
38. Wu, H., et al, “Secular trends in all-cause and cause-specific mortality
rates in people with diabetes in Hong Kong, 2001–2016: a
retrospective cohort study”. Diabetologia 2020 Apr; 63(4): 757–766.
39. Huang, Y.T., et al, “Hospitalization for ambulatory-care-sensitive
conditions in Taiwan following the SARS outbreak: a population-
based interrupted time series study”. Journal of the Formosan Medical
Association 2009; 108: 386–394.
40. Morra, M.E., et al, “Clinical outcomes of current medical approaches
for Middle East respiratory syndrome: a systematic review and meta-
analysis”. Reviews in Medical Virology 2018; 28: e1977. doi:
10.1002/rmv.1977.
41. Haddad, L.B., et al, “Pregnant women and the Ebola crisis”. New
England Journal of Medicine 2018; 379: 2492–2493.
42. Li, Q., et al, “Early transmission dynamics in Wuhan, China, of novel
coronavirus-infected pneumonia”. New England Journal of Medicine
2020; 382(13): 1199–1207.
43. Zhou, Z., et al, “Effect of gastrointestinal symptoms on patients
infected with COVID-19”, Gastroenterology 2020; 158(8): 2294–
2297. doi: https://doi.org/10.1053/j.gastro.2020.03.020.
44. Assmann, G., et al, “Simple scoring scheme for calculating the risk of
acute coronary events based on the 10-year follow-up of the
prospective cardiovascular Münster (PROCAM) study”. Circulation
2002; 105(3): 310–315.
45. Gandhi, R. “Naive Bayes classier, towards data science”. 2018.
https://towardsdatascience.com/naive-bayes-classier-81d512f50a7c
accessed 25th April, 2020
46. Mehta, M., et al, “SLIQ: A fast scalable classier for data mining”. In:
and Apers, P., Bouzeghoub M., Ardarin G. Proceedings of the 5th
International Conference on Extending Database Technology, Pringer-
Verlag, Berlin, pp. 18–32, 1996.
47. Sak, Haim, et al, “Long short-term memory based recurrent neural
network architectures for large vocabulary speech recognition”, 2014,
arXiv:1402.1128
48. Zhang, Y.-D., et al, “Fractal dimension estimation for developing
pathological brain detection system based on Minkowski-Bouligand
method”. IEEE Access 2016; 4: 5937–5947.
49. Gurwitz, J., et al, “Contemporary prevalence and correlates of incident
heart failure with preserved ejection fraction”. The American Journal
of Medicine 2013; 126(5): 393–400. 59. Clinical Classifications
Software (CCS).
50. Islam, M.M., et al, “Prediction of breast cancer using support vector
machine and K-Nearest neighbors”. In: 2017 IEEE Region 10
Humanitarian Technology Conference (R10-HTC). IEEE, pp. 226–229,
2017.
51. Mavroforakis, M.E., et al, “A geometric approach to Support Vector
Machine (SVM) classification”. IEEE Transactions on Neural
Networks 2006; 17(3): 671–682.
14
A Tier-based Educational Analytics
Framework
CONTENTS
14.1 Introduction
14.2 Related Works
14.3 The Three-Tiered Education Analysis Framework
14.3.1 Structured Data Analysis
14.3.1.1 Techniques for Structured Data Analysis
14.3.1.2 Challenges in Structured Data Analysis
14.3.2 Analysis of Semi-Structured Data and Text Analysis
14.3.2.1 Use Cases for Analysis of Semi-Structured and
Text Content
14.3.2.2 Challenges of Semi-Structured/Textual Data
Analysis
14.3.3 Analysis of Unstructured Data
14.3.3.1 Analysis of Unstructured Data: Study and Use
Cases
14.3.3.2 Challenges in Unstructured and Multimodal
Educational Data Analysis
14.4 Implementation of the Three-Tiered Framework
14.5 Scope and Boundaries of the Framework
14.6 Conclusion and Scope of Future Research
Note
References
14.1 Introduction
The digitization of education in recent years has resulted in large volumes
of educational data. This voluminous data with high velocity and variety
can be harnessed, processed, and analyzed to obtain valuable insights
leading to better learning outcomes. The potential influence of data mining
analytics on the students’ learning processes and outcomes in higher
education has been recognized [1]. Many educational institutions have
educational databases that are underutilized and could be potentially used
for data mining. Educational data mining (EDM) is an important tool that
analyzes the data collected from learning and teaching and applies
techniques from machine learning for predicting student’s future behavior
through detailed information such as student’s grades, knowledge,
achievements, motivation, and attitude [2].
Online learning and digital content which received a fillip during the
pandemic-induced lockdown is today at the cusp of transformative changes
and educational analytics has a major role to play in this transformation.
The multimodal, multidisciplinary, and non-sequential nature of learning
today has generated large volumes of structured, semi-structured, and
unstructured data requiring multifarious techniques of data analysis. This
paper proposes a three-tiered framework for academic data analysis. The
first tier of our model proposes structured data analysis for numerical
evaluation and prediction of student performance. Statistical techniques like
aggregation, correlations, and regression analysis can be used for data
summarization, data correlation, and predictive modeling. Machine learning
techniques like association mining, classification, and prediction can reveal
patterns, categorize data, and forecast outcomes. Unsupervised learning
techniques like clustering yield performance-based clusters and discover
outliers which could be students with exceptional learning capabilities or
students with learning challenges.
The second tier of our framework is based on semi-structured content.
This tier examines the semi-structured data and textual content to perform
qualitative analysis. Though textual content is typically considered to be
unstructured data, our three-tiered framework includes textual analysis in
the second tier. We propose the use of text mining techniques, NLP, and
computational linguistics for mining tasks such as student feedback
analysis, automated assignment valuation, and valuation of subjective
answer scripts. The analysis can be further extended to include the creation
of intelligent linguistic models that can auto-generate answers to FAQs.
Online classes, proctored examinations, online tests, and assignments
have become a norm today. Still images, audio clips, and video recordings
of online classes generate multi-modal content which can be analyzed using
multi-modal analysis techniques. The third tier of our framework analyzes
multi-modal data to obtain deeper insights relating to student/teacher
involvement, instructional strategy and class engagement. The deep
learning networks like Convoluted Neural Networks (CNN), Recursive
Neural Networks (RNN), and Deep Belief Network (DBN) that have
exhibited enhanced machine learning outputs can be leveraged for
discovering student interest, participation, effectiveness of teaching-
learning process, and overall impact.
This work proposes a framework for educational analysis specifying
inputs, techniques, and deliverables at each tier. It follows a use-case based
approach depicting the different users, the techniques used, and the
resulting outputs. This framework provides implementation guidelines for
educational data analysis and advocates a model for phased implementation
in the higher education institutes (HEIs) in India. It focuses on analysis for
enhanced learning outcomes in the context of higher education and does not
include in its scope the analysis of administrative, financial, marketing, or
promotional data. It uses the term educational data analysis to mean
academic data analysis and uses both these terms interchangeably.
To the best of our knowledge, there are no previous works in educational
data analysis that propose a modular multi-tiered approach specifying the
inputs and techniques at each tier, and this is the distinct contribution of our
paper.
This paper is structured as follows. Section 2 examines the related works
in the area of educational analytics. Section 3 describes the three-tiered
framework with the use cases and the possible challenges. Section 4
discusses the implementation guidelines and challenges with a focus on the
Indian context. Section 5 examines the scope and boundaries of the
framework and Section 6 concludes the paper by listing possible areas of
future research.
14.2 Related Works
The usage of analytics in the higher educational sector holds both promise
and challenges. The main applications of learning analytics are tracking and
predicting learners’ performance as well as identifying potential
problematic issues and students at risk [3]. This paper also talks about how
analysis of big data can help to identify weaknesses in student learning and
comprehension so as to determine whether or not improvements to the
curriculum are necessary. Learning analytics (LA) has evolved from a mere
score aggregation and predictive platform to the one that also helps identify
outliers like underperforming/high-performing students and helps in the
process of personalization of content [4]. Although LA is maturing, its
overall potential is yet to be realized. This poses the question of how we can
facilitate the transfer of this potential into learning and teaching practice
[5]. This paper also points out the importance of data as the bedrock for
analysis. Data plays a pivotal role in analysis and hence a careful
understanding and evaluation of the data required for analysis and the
analytic techniques to be adopted is essential. The potential of learning
analytics which is yet to be optimally realized is also discussed [6].
Although vast volumes of data are available in Learning Management
Systems (LMS), it is only recently that education institutions have begun to
dip into the deep waters of data analysis and machine learning to gather
insights into teaching quality and student learning experiences. The
importance of data for analytics programs also finds mention in this paper.
It points out the critical role of data and discusses how data sources
employed in learning analytics have evolved from a single source of student
learning data (e.g., LMS) to integrating multiple data sources.
Educational data analysis techniques need to go beyond mere counting
and aggregation. We recognize that an analysis program is not just about
simple quantitative measures, but requires consideration of multiple
variables and their interactions. The significance of going beyond mere
counting is also mentioned [7]. There is a need for holistic measures which
combine variables from multiple sources to arrive at more accurate
conclusions. In our paper, we have proposed a three-tiered approach to data
analysis which enables multiple measures to be obtained from different and
diverse data sources.
Learning analytics is a multidisciplinary subject, which combines
advanced statistical techniques, machine learning algorithms, epistemology,
and educational policies. Although it may seem promising to automate
many measurements and predictions about learning and teaching, the sole
focus on outcomes, as the primary target of learning analytics without
consideration of teaching-learning processes can have detrimental
consequences. It is imperative that all concerned stakeholders are a part of
the analytics journey [7]. Learning analysis that does not promote effective
learning and teaching is susceptible to the use of trivial measures such as
for e.g. increased number of log-ins into an LMS, as a way to evaluate
learning progression. In order to avoid such undesirable practices, the
involvement of the relevant stakeholders like learners, instructors,
instructional designers, information technology support, and institutional
administrators is necessary at all stages of development of appropriate
measures, their implementation, and evaluation of learning.
Predictive analytic models should not be interpreted in isolation. It is
necessary to consider and correlate all relevant factors and parameters that
come into play. A study of the cause, effect, and interplay of the parameters
can help us benefit from educational analysis. For example, course
completion and retention rates of underperforming students should be
related to effective intervention strategies so as to help at-risk students
succeed [8].
Implementing a learning analytics program is both a complex and a
resource-intensive effort. There are a range of implementation
considerations and potential barriers to adopting educational data mining
and learning analytics, including technical challenges, institutional capacity,
legal, and ethical issues [9]. Successful application of educational data
mining and learning analytics will not come without effort, cost, and a focus
on promoting a data-centric culture in educational institutions.
The various techniques relating to data and learning analytics are
discussed [4]. The use of multimodal data in learning analysis is discussed
[10,11]. The privacy and ethical issues relating to the usage of data for
educational analytics are important [11] and although we have not discussed
this in detail in our work, its significance in the overall scheme is
acknowledged.
14.3 The Three-Tiered Education Analysis Framework
The education sector is experiencing a transformation in the method of
imparting education, mode of delivery, and the method of assessment. This
transformation has generated volumes of structured, semi-structured, and
unstructured data. The three-tiered framework proposed by us is based on
these three categories of data. Structured data is tabular, has a schema, can
be organized as rows and columns, and typically includes numeric,
alphanumeric, date/time, and Boolean values. Relational databases are
generally used to store this data and SQL is used for manipulating/querying
this data. Semi-structured data does not have a rigid structure or a schema.
However, it contains tags that identify separate semantic elements and
enforce hierarchies of records and fields within the data. XML data and
JSON are some examples of semi-structured data. Student feedback
containing ordinal values as well as textual content is an example of semi-
structured data. Unstructured data does not conform to any data model and
cannot be stored as rows and columns. It includes streaming data on the
WWW like web pages, audio, video, and social media content. NO-SQL
databases are generally used to store this data and Application
Programming Interface (API) functions are used for manipulation. Machine
learning algorithms including deep learning architectures have shown
promising results in image processing and multi-modal analysis for better
classification and predictive modeling. Figure 14.1 gives an overview of the
use cases for educational data analysis. This figure depicts the actors or
primary stakeholders contributing to the generation of educational data. The
student, faculty, placement officer/cell, management, and the
administrator/system (admin) are the main actors in an academic institution
and the actions taken by them or their interactions with the system results in
the generation of data. The data analysis modules use this data as input and
generate various aggregative, predictive, and diagnostic models that can be
leveraged by faculty, management, and students themselves for informed
decision making. The following subsections discuss in detail the three-
tiered educational analysis model using a use case approach.
FIGURE 14.1 Educational data analysis.
a. Classification
Note
1. Course plans are subject-wise and topic-wise plans devised by the
faculty members with an objective of spacing out the topics that need
to be taught and ensure timely completion of syllabus. Course plans
are typically semi-structured in nature as they contain structured
content like the planned_date, completion_date, module, etc. and
textual content like topics and remarks.
REFERENCES
1. Aldowah, Hanan, Hosam Al-Samarraie, and Wan Mohamad Fauzy.
“Educational data mining and learning analytics for 21st century
higher education: A review and synthesis”. Telematics and Informatics
37 (2019): 13–49.
2. Ahuja, Ravinder, Animesh Jha, Rahul Maurya, and Rishabh
Srivastava. “Analysis of educational data mining”. In Harmony Search
and Nature Inspired Optimization Algorithms, pp. 897–907. Springer,
Singapore, 2019.
3. Avella, John T., Mansureh Kebritchi, Sandra G. Nunn, and Therese
Kanai. “Learning analytics methods, benefits, and challenges in higher
education: A systematic literature review”. Online Learning 20, no. 2
(2016): 13–29.
4. Bienkowski, Marie, Mingyu Feng, and Barbara Means. “Enhancing
teaching and learning through educational data mining and learning
analytics: An issue brief”. US Department of Education, Office of
Educational Technology 1 (2012): 1–57.
5. Viberg, Olga, Mathias Hatakka, Olof Bälter, and Anna Mavroudi. “The
current landscape of learning analytics in higher education”.
Computers in Human Behavior 89 (2018): 98–110.
6. Joksimović, Srećko, Vitomir Kovanović, and Shane Dawson. “The
journey of learning analytics”. HERDSA Review of Higher Education
6 (2019): 27–63.
7. Gašević, Dragan, Shane Dawson, and George Siemens. “Let’s not
forget: Learning analytics are about learning”. TechTrends 59, no. 1
(2015): 64–71.
8. Jayaprakash, Sandeep M., Erik W. Moody, Eitel JM Lauría, James R.
Regan, and Joshua D. Baron. “Early alert of academically at-risk
students: An open source analytics initiative”. Journal of Learning
Analytics 1, no. 1 (2014): 6–47.
9. İnan, Ebru, and Martin Ebner. “Learning analytics and MOOCs”. In
International Conference on Human-Computer Interaction, pp. 241–
254. Springer, Cham, 2020.
10. Giannakos, Michail N., Kshitij Sharma, Ilias O. Pappas, Vassilis
Kostakos, and Eduardo Velloso. “Multimodal data as a means to
understand the learning experience”. International Journal of
Information Management 48 (2019): 108–119.
11. Hoel, Tore, Dai Griffiths, and Weiqin Chen. “The influence of data
protection and privacy frameworks on the design of learning analytics
systems”. In Proceedings of the Seventh International Learning
Analytics & Knowledge Conference, pp. 243–252, 2017.
12. Tan, Pang-Ning, Michael Steinbach, and Vipin Kumar. “Classification:
Basic concepts, decision trees, and model evaluation”. Introduction to
Data Mining 1 (2006): 145–205.
13. Han, Jiawei, Jian Pei, and Micheline Kamber. Data Mining: Concepts
and Techniques. Elsevier, 2011.
14. Miller, George A. “WordNet: A lexical database for English”.
Communications of the ACM 38, no. 11 (1995): 39–41.
15. Baccianella, Stefano, Andrea Esuli, and Fabrizio Sebastiani.
“Sentiwordnet 3.0: An enhanced lexical resource for sentiment
analysis and opinion mining”. In LREC 10, no. 2010 (2010): 2200–
2204.
16. Abdulkarim M. N. “Classification and Retrieval of Research
Classification and Retrieval of Research Papers: A Semantic
Hierarchical Approach.” PhD diss., Christ University, 2010.
17. Winkler Rainer, and Matthias Soellner Unleashing the potential of
chatbots in education: A state-of-the-art analysis(2018).
18. Javed, Nazura, and B. L. Muralidhara. “Automating corpora
generation with semantic cleaning and tagging of tweets for multi-
dimensional social media analytics”. International Journal of
Computer Applications 127, no. 12 (2015): 11–16.
19. Pennington, Jeffrey, Richard Socher, and Christopher D. Manning.
“Glove: Global vectors for word representation”. In Proceedings of the
2014 Conference on Empirical Methods in Natural Language
Processing (EMNLP), pp. 1532–1543, 2014.
20. Andrade, Alejandro, Ginette Delandshere, and Joshua A. Danish.
“Using multimodal learning analytics to model student behaviour: A
systematic analysis of behavioural framing”. Journal of Learning
Analytics 3, no. 2 (2016): 282–306.f
21. Poria, Soujanya, Erik Cambria, Newton Howard, Guang-Bin Huang,
and Amir Hussain. “Fusing audio, visual and textual clues for
sentiment analysis from multimodal content”. Neurocomputing 174
(2016): 50–59.
22. Nassauer, Anne, and Nicolas M. Legewie. “Analyzing 21st century
video data on situational dynamics—Issues and challenges in video
data analysis”. Social Sciences 8, no. 3 (2019): 100.
23. Nassauer, Anne, and Nicolas M. Legewie. “Video data analysis: A
methodological frame for a novel research trend”. Sociological
Methods & Research 50 (2018): 0049124118769093. doi:
https://doi.org/10.1177/0049124118769093.
24. Hannafin, Michael, Arthur Recesso, Drew Polly, and J. W. Jung.
“Video analysis and teacher assessment: Research, practice and
implications”. Digital Video for Teacher Education: Research and
Practice, pp. 164–180. Routledge, Abingdon, 2014.
25. Abbas, Qaisar, Mostafa EA Ibrahim, and M. Arfan Jaffar. “Video
scene analysis: An overview and challenges on deep learning
algorithms”. Multimedia Tools and Applications 77, no. 16 (2018):
20415–20453.
26. Lee, Lap-Kei, and Simon KS Cheung. “Learning analytics: Current
trends and innovative practices”. Journal of Computers in Education 7,
no. 1 (2020): 1–6.
15
Breast Invasive Ductal Carcinoma Classification Based on
Deep Transfer Learning Models with Histopathology Images
Saikat Islam Khan, Pulak Kanti Bhowmick, Nazrul Islam, Mostofa Kamal Nasir, and Jia Uddin
CONTENTS
15.1 Introduction
15.2 Background Study
15.2.1 Breast Cancer Detection Based on Machine Learning Approach
15.2.2 Breast Cancer Detection Based on Deep Convolutional Neural Network Approach
15.2.3 Breast Cancer Detection Based on Deep Transfer Learning Approach
15.3 Methodology
15.3.1 Data Acquisition
15.3.2 Data Preprocessing Stage
15.3.3 Transfer Learning Model
15.3.3.1 Visual Geometry Group Network (VGGNet)
15.3.3.2 Residual Neural Network (ResNet)
15.3.3.3 Dense Convolutional Networks (DenseNet)
15.4 Experimental Setup and Results
15.4.1 Performance Evaluation Metrics
15.4.2 Training Phase
15.4.3 Result Analysis
15.4.4 Comparison with Other State of Art Models
15.5 Discussion with Advantages and Future Work
15.5.1 Discussion
15.5.2 Advantages
15.5.3 Future Works
15.6 Conclusion
References
15.1 Introduction
Breast cancer is one of the most prominent reasons for death worldwide. In the United States, cancer is
the second prominent reason of death. According to a study [1], most women are diagnosed with breast
cancer in the United States, excluding skin cancer. In this study, it is also stated that breast cancer is the
second principal cause of cancer-related death in women. About 12% of women in America are
diagnosed with breast cancer at least once in their lifetime. Women from different races and ethnicities
had breast cancer and also died from it [1].
As the mortality rate in breast cancer is pretty high, it is essential to detect the cancer type as early as
possible. To detect breast cancer, one of the most reliable methods is the inspection of histopathological
images. To reduce human error in making critical decisions, machine learning has helped from the last
few years. Especially deep learning models are used for early detection of breast cancer from
histopathological images. Even some popular machine learning techniques like random forest, support
vector machine, and Bayesian networks are also used for early detection and diagnosis of breast cancer.
But deep learning models made things easier to classify breast cancer from histopathological images
[2].
Machine learning techniques for histopathological image diagnosis has come a long way. Different
types of techniques are used. Computer-aided diagnosis, content-based image collection, and
discovering new clinicopathological are included as machine learning applications in modern pathology.
Machine-learning-based methods used in the diagnosis of digital histopathological images contain
extract feature, collection, and classification of features. While performing feature extraction stage,
identical features from the input images are extracted. These identical features result in better
classification performance.
Recently, deep learning models have gained much popularity for high availability of pre-trained
models, which are trained with the huge dataset. It made things easier to train a model for new tasks.
Deep learning has a serious data dependency. It usually needs massive data to train a model from
scratch. By definition, transfer learning is a machine learning technique where a model trained for a task
is reused as the beginning point for a model on a different task. Transfer learning has solved this data
dependency problem. It also has the capability to work with fewer data. Transfer learning has gained
high popularity also for its less computation time [3].
Among the pre-trained models, ResNet50 pre-trained on the ImageNet dataset is one of the most
popular. It has gained huge popularity for classification and segmentation tasks. ResNet50 has been
performed better in malaria cell image classification on microscopic cell images [4]. In this study [5],
ResNet-50 pre-trained in architecture has been used to classify malware sample software from byte plot
grayscale images with 98.62% accuracy. In another study [6], only the last layer of the ResNet-50
model has been trained in performing the classification task. ResNet-50 DNN architecture has also been
used for various tasks, including face recognition, measuring gender and ethnicity bias, and skin lesion
classification [6,7].
VGG19 and DenseNet201 are another two very famous pre-trained models that are often used to
transfer learning. DenseNet201 has been used for diagnosis of multiple skin lesions. These pre-trained
deep learning models are used for early detection of diseases, for different classification tasks, and
segmentation of defect areas in the medical images. In a study [8], transfer learning-based approach
VGG19 pre-trained model is used for detecting computer-generated images in the region of the eyes. In
another study [9], a VGG19 pre-trained model with transfer learning approach is used for fault
diagnosis with a very promising accuracy.
This study proposed a transfer learning-based approach with three popular pre-trained models
ResNet50, VGG19, and DenseNet201 to detect the IDC, which is the most deadly breast cancer type.
To detect IDC from images, we have used ResNet50, VGG19, and DenseNet201 pre-trained models
individually. For extracting features from the images, we have used weights for the pre-trained models
and only trained the dense layer with the dataset images. We have got the highest 96.55% accuracy from
the DenseNet201 pre-trained model and achieved 87.27% and 79.61% accuracy from the VGG19 and
ResNet50 pre-trained architectures, respectively.
15.3 Methodology
Figure 15.1 demonstrates the model architecture used in this study. The architecture starts with the
image extraction and label loading from the dataset. Then several pre-processing techniques are
performed before splitting the dataset into training, testing, and validation. The data augmentation
technique is performed to increase the number of dataset images. Finally, the transfer learning model is
built to train the model and for classifying the IDC label. The following subsection presents a detailed
description of such a model.
FIGURE 15.1 Model architecture.
TABLE 15.1
Supplementary Description for BreakHis Dataset
TABLE 15.2
Data Augmentation Strategies with Parameter Value
1 zoom_range 2
2 rotation_range 90
3 shear_range .2
4 width_shift_range .1
5 height_shift_range .1
6 horizontal_flip True
7 vertical_flip True
The transfer learning models are deeper than the CNN approach and utilized more dynamic
relations among the alternating layers.
It required less computational power than the traditional CNN method for the model already
trained on the ImageNet database containing more than a million images. We need to train only the
last fully connected layers.
CNN performed poorly on a small image dataset and faced the overfitting issue where the transfer
learning model limits such an issue by using the pre-trained weights.
This study uses three transfer learning models, including VGG19, ResNet50, and DenseNet201, to find
the IDC type in the BreakHis dataset. We use the pre-trained weights from those models, but alter the
fully connected layers to integrate such models into our classification method. The following
subsections present the description of building the transfer learning models.
Precision (P): is a ratio of correctly identified IDC (+) class from cases that are predicted as IDC (+).
TP
P =
TP+FP
F1-Score: maintains the balance between precision and recall and determine the test accuracy.
P∗R
F1 ∗ Score = 2 ∗
P+R
TNR: is a probability that an actual predicted IDC (−) class will test as IDC (−).
TN
TNR =
TN+FP
FPR: is a probability that an actual predicted IDC (+) class will test as IDC (−).
FP
FPR =
FP+TN
TABLE 15.3
Hyper-Parameter Values Used for Training
VGG19 ResNet50 Loss Function Optimizer Function Binary Cross Entropy Adam
DenseNet201 Metrics Epochs Batch Size Learning Rate Accuracy 2532 .0001
15.4.3 Result Analysis
Figure 15.7 presents the confusion matrix and the ROC curve extracted from VGG19, ResNet50, and
DenseNet201 models. For testing the models, a total of 145 microscopic images have been used for the
dataset. Among them, 75 images were used as an IDC (−) class, and the other 70 images were used as
an IDC (+) class. The VGG19 pre-trained model classifies 127 images correctly, but misclassified 18
images. From 127 images, a total of 67 images are classified as IDC (−) (True Positive), and a total of
60 images is classified as IDC (+) (True Negative). The ResNet50 model classified 116 images
correctly, but misclassified 29 images. Compared to the VGG19 model, the performance of the
ResNet50 model during the classification is poor. This model performance considered inferior since it
confused a total of 22 IDC (+) images as IDC (−), which would be a disaster. The DenseNet201 pre-
trained model outperformed both VGG19 and ResNet50 models by classifying 140 images correctly.
Where only five images misclassified by the DenseNet201 model. In this work, the ROC curve is used
to determine the capability of the binary classifier. The best area under the curve (AUC) found for the
DenseNet201 pre-trained model is 0.973, indicating its perfect stability. Although VGG19 also
performed well with .875 AUC, ResNet50 shows the model is less stable in the ROC curve.
FIGURE 15.7 Confusion matrix and ROC curve for the three transfer learning models. (a) VGG19 model, (b) ResNet50 model, and (c)
DenseNet201 model.
The model’s performance is compared by calculating precision, recall, F1-Score, FPR, and FNR.
Table 15.4 presents a detailed description of such parameters. The VGG19 pre-trained model achieved
an average precision of 87.62%, a recall of 86.93, and 87.27% accuracy. The ResNet50 pre-trained
model performed well for IDC(−) images, but negatively impacted IDC(+) images. This model
achieved an average precision of 81.41%, a recall of 79.62%, and 79.61% accuracy. Moreover, the
DenseNet201 pre-trained model outperformed the other two pre-trained models by demonstrating
higher parameter value. The DenseNet201 achieved an average precision of 96.58%, recall of 96.52%,
and 96.55% accuracy. The FPR is almost zero, and TNR is close to one, which indicates the model’s
best performance. A simulation result of the DenseNet201 model classifying IDC images is presented in
Figure 15.8, where all the images are classified correctly.
TABLE 15.4
Classification Result Obtained from the Three Transfer Learning Models
TABLE 15.5
Comparison of Our Proposed Model with Other State-of-the-Art Models Using BreakHis Dataset
15.5.1 Discussion
In this study, the Break His dataset is used to detect whether the breast tissues are IDC type or not. The
dataset contains a total of 7907 microscopic breast tissue images. Among them, a total of 3,832
microscopic images are used, which contains IDC (−) (benign tissue) and IDC (+) (malignant tissue)
breast images. A total of three transfer learning models, including VGG19, ResNet50, and
DenseNet201, are used in this study for the early detection of IDC type. The transfer learning model
required less computational power of such models are already trained on the ImageNet database
containing millions of images. It allows the researcher to use the pre-trained weights for their study. In
this study, we use the transfer learning models pre-trained weights, but replace the fully connected
layers with the GlobalAveragePooling2D layer, two BatchNormalization layers, and three Dense layers.
We only train the replaced fully connected layer in our study.
Several data augmentation techniques, including zooming, rotation, shearing, height, and width shift
range, horizontal and vertical flip, are carried out to increase the number of microscopic images.
Managing the overfeeding dilemma is one critical activity of designing such models. The overfitting
issue makes the model unstable and less effective when it comes to classifying the test images. The
model will wrongly predict the breast tissue if such an issue presents in the model. For solving such
issues, proper tuning of the model is required. One solution is to introduce the dropout layer in the pre-
trained model. The dropout layer will drop some of the neurons, such that the model can train on some
unordered data. Figure 15.9 presents an example of the dropout process. In the proposed fine-tuned pre-
trained models, we use two dropout layers to handle such an issue. We can observe from Figure 15.6
that the overfitting issue did not occur during training the model. After comparing the models, the
DenseNet201 pre-trained model showed 96.55% accuracy, which outperformed all other state-of-the-art
models.
15.5.2 Advantages
Nevertheless, the key benefits of this research are summarized as follows:
Most of the existing literature based on the patchwise breast cancer classification, where our study
is based on the subject wise classification, including IDC (+) and IDC (−).
The transfer learning models provide segmentation free feature extraction strategies that do not
require any handcrafted feature extraction approaches relative to the conventional machine
learning methods.
A total of 3,832 microscopic images was used for IDC detection.
Less computational power and time required during training the model.
The overfitting issue did not occur during the training of the model.
Only 25 epochs are required to achieve more than 95% accuracy.
The fine-tuned DenseNet201 pre-trained model outperformed all the other existing literature by
showing great performance.
We train our model only for 25 epochs. Further training will help the model to achieve higher
accuracy.
More data augmentation strategy, including cropping, translation, color, brightness, and saturation,
could be used. Such a strategy will increase the dataset size and help to achieve higher accuracy.
More layers may be applied to the pre-trained model fine-tuned process, which will expand the
number of training parameters.
15.6 Conclusion
From the previous decades, we have observed the necessity of medical imaging, e.g., magnetic
resonance (MR), computed tomography (CT), positron emission tomography (PET), mammography,
ultrasound, X-ray, and so on, for the early identification, analysis, and treatment of diseases. Most of the
methods needed human experts such as radiologists and pathologists. However, for the reason the large
fluctuations in pathology and probable fatigue of human experts, in recent past researchers and doctors
have started to privilege from the computer-assisted diagnosis (CAD) method. In the recent past, the
CAD system has been upgraded using machine learning and deep learning methods. However, machine
learning techniques needed a native feature extractor method, which is very costly and time-consuming.
A deep-learning-based approach limits such issue by providing a segmentation-free approach. In this
study, automated invasive ductal carcinoma (IDC) detection is performed to extract the microscopic
images feature using deep transfer learning techniques, including ResNet50, VGG19, and DenseNet201
pre-trained models. For classifying the IDC, the features extracted from the pre-trained models are fed
into a fully connected layer. In the experiment, the DenseNet201 model achieved the highest 96.55%
accuracy that outperformed all the other state-of-art models, where ResNet50 and VGG19 pre-trained
models achieved 79.61% and 87.27% accuracy, respectively. Finally, to detect biomedical solutions and
biological samples, the proposed DenseNet201 fine-tuned pre-trained model could be a great candidate.
REFERENCES
1. DeSantis CE, Fedewa SA, Goding Sauer A, Kramer JL, Smith RA, Jemal A. Breast cancer
statistics, 2015: Convergence of incidence rates between black and white women. CA: a cancer
journal for clinicians. 2016 Jan;66(1):31–42
2. Al Bataineh, Ali. “A comparative analysis of nonlinear machine learning algorithms for breast
cancer detection”. International Journal of Machine Learning and Computing 9, no. 3 (2019):
248–254.
3. Tan, Chuanqi, Fuchun Sun, Tao Kong, Wenchang Zhang, Chao Yang, and Chunfang Liu. “A
survey on deep transfer learning”. In International Conference on Artificial Neural Networks, pp.
270–279. Springer, Cham, 2018.
4. Reddy, A. Sai Bharadwaj, and D. Sujitha Juliet. “Transfer learning with ResNet-50 for malaria
cell-image classification”. In 2019 International Conference on Communication and Signal
Processing (ICCSP), pp. 0945–0949. IEEE, 2019.
5. Rezende, Edmar, Guilherme Ruppert, Tiago Carvalho, Fabio Ramos, and Paulo De Geus.
“Malicious software classification using transfer learning of ResNet-50 deep neural network”. In
2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), pp.
1011–1014. IEEE, 2017.
6. Acien, Alejandro, Aythami Morales, Ruben Vera-Rodriguez, Ivan Bartolome, and Julian Fierrez.
“Measuring the gender and ethnicity bias in deep models for face recognition”. In Iberoamerican
Congress on Pattern Recognition, pp. 584–593. Springer, Cham, 2018.
7. Khan, Muhammad Attique, Muhammad Younus Javed, Muhammad Sharif, Tanzila Saba, and
Amjad Rehman. “Multi-model deep neural network based features extraction and optimal selection
approach for skin lesion classification”. In 2019 International Conference on Computer and
Information Sciences (ICCIS), pp. 1–7. IEEE, 2019.
8. Carvalho, Tiago, Edmar RS De Rezende, Matheus TP Alves, Fernanda KC Balieiro, and Ricardo
B. Sovat. “Exposing computer generated images by eye’s region classification via transfer learning
of VGG19 CNN”. In 2017 16th IEEE International Conference on Machine Learning and
Applications (ICMLA), pp. 866–870. IEEE, 2017.
9. Wen, Long, X. Li, Xinyu Li, and Liang Gao. “A new transfer learning based on VGG-19 network
for fault diagnosis”. In 2019 IEEE 23rd International Conference on Computer Supported
Cooperative Work in Design (CSCWD), pp. 205–209. IEEE, 2019.
10. Komura, Daisuke, and Shumpei Ishikawa. “Machine learning methods for histopathological image
analysis”. Computational and Structural Biotechnology Journal 16 (2018): 34–42.
11. Veta, Mitko, Paul J. Van Diest, Robert Kornegoor, André Huisman, Max A. Viergever, and Josien
PW Pluim. “Automatic nuclei segmentation in H&E stained breast cancer histopathology images”.
PLoS One 8, no. 7 (2013): e70221.
12. Basavanhally, Ajay, Shridar Ganesan, Michael Feldman, Natalie Shih, Carolyn Mies, John
Tomaszewski, and Anant Madabhushi. “Multi-field-of-view framework for distinguishing tumor
grade in ER+ breast cancer from entire histopathology slides”. IEEE Transactions on Biomedical
Engineering 60, no. 8 (2013): 2089–2099.
13. Punitha, S., A. Amuthan, and K. Suresh Joseph. “Benign and malignant breast cancer segmentation
using optimized region growing technique”. Future Computing and Informatics Journal 3, no. 2
(2018): 348–358.
14. Zeebaree, Diyar Qader, Habibollah Haron, Adnan Mohsin Abdulazeez, and Dilovan Asaad Zebari.
“Machine learning and region growing for breast cancer segmentation”. In 2019 International
Conference on Advanced Science and Engineering (ICOASE), pp. 88–93. IEEE, 2019.
15. Badawy, Samir M., Alaa A. Hefnawy, Hassan E. Zidan, and Mohammed T. GadAllah. “Breast
cancer detection with mammogram segmentation: A qualitative study.”” International Journal of
Advanced Computer Science and Application 8, no. 10 (2017).
16. Yildirim, Ozal, Ru San Tan, and U. Rajendra Acharya. “An efficient compression of ECG signals
using deep convolutional autoencoders.”” Cognitive Systems Research 52 (2018): 198–211.
17. Araújo, Teresa, Guilherme Aresta, Eduardo Castro, José Rouco, Paulo Aguiar, Catarina Eloy,
António Polónia, and Aurélio Campilho. “Classification of breast cancer histology images using
convolutional neural networks.”” PLoS One 12, no. 6 (2017): e0177544.
18. Kandel, Ibrahem, and Mauro Castelli. “A novel architecture to classify histopathology images
using convolutional neural networks”. Applied Sciences 10, no. 8 (2020): 2929.
19. Rahman, Md Jamil-Ur, Rafi Ibn Sultan, Firoz Mahmud, Sazid Al Ahsan, and Abdul Matin.
“Automatic system for detecting invasive ductal carcinoma using convolutional neural networks”.
In TENCON 2018-2018 IEEE Region 10 Conference, pp. 0673–0678. IEEE, 2018.
20. Cruz-Roa, Angel, Hannah Gilmore, Ajay Basavanhally, Michael Feldman, Shridar Ganesan,
Natalie Shih, John Tomaszewski, Anant Madabhushi, and Fabio González. “High-throughput
adaptive sampling for whole-slide histopathology image analysis (HASHI) via convolutional
neural networks: Application to invasive breast cancer detection”. PLoS One 13, no. 5 (2018):
e0196828.
21. Guan, Shuyue, and Murray Loew. “Breast cancer detection using synthetic mammograms from
generative adversarial networks in convolutional neural networks”. Journal of Medical Imaging 6,
no. 3 (2019): 031411.
22. Talo, Muhammed. “Automated classification of histopathology images using transfer learning”.
Artificial Intelligence in Medicine 101 (2019): 101743.
23. Celik Y, Talo M, Yildirim O, Karabatak M, Acharya UR. Automated invasive ductal carcinoma
detection based using deep transfer learning with whole-slide images. Pattern Recognition Letters.
2020 May 1;133:232–239.
24. Khan, SanaUllah, Naveed Islam, Zahoor Jan, Ikram Ud Din, and Joel JPC. Rodrigues. “A novel
deep learning based framework for the detection and classification of breast cancer using transfer
learning”. Pattern Recognition Letters 125 (2019): 1–6.
25. Vesal, Sulaiman, Nishant Ravikumar, AmirAbbas Davari, Stephan Ellmann, and Andreas Maier.
“Classification of breast cancer histology images using transfer learning”. In International
Conference Image Analysis and Recognition, pp. 812–819. Springer, Cham, 2018.
26. Kassani, Sara Hosseinzadeh, Peyman Hosseinzadeh Kassani, Michal J. Wesolowski, Kevin A.
Schneider, and Ralph Deters. “Breast cancer diagnosis with transfer learning and global pooling”.
arXiv preprint arXiv:1909.11839, 2019.
27. BreakHis dataset. Available at: https://web.inf.ufpr.br/vri/databases/breast-cancer-
histopathological-database-breakhis/ [Online Accessed: 30 September, 2020].
28. Simonyan, Karen, and Andrew Zisserman. “Very deep convolutional networks for large-scale
image recognition”. arXiv preprint arXiv:1409.1556, 2014.
29. He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep residual learning for image
recognition”. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 770–778, 2016.
30. Huang, Gao, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. “Densely
connected convolutional networks”. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pp. 4700–4708, 2017.
31. Hao, Wangli, and Zhaoxiang Zhang. “Spatiotemporal distilled dense-connectivity network for
video action recognition”. Pattern Recognition 92 (2019): 13–24.
32. Nejad, Elaheh Mahraban, Lilly Suriani Affendey, Rohaya Binti Latip, and Iskandar Bin Ishak.
“Classification of histopathology images of breast into benign and malignant using a single-layer
convolutional neural network”. In Proceedings of the International Conference on Imaging, Signal
Processing and Communication, pp. 50–53, 2017.
33. Zhi, Weiming, Henry Wing Fung Yueng, Zhenghao Chen, Seid Miad Zandavi, Zhicheng Lu, and
Yuk Ying Chung. “Using transfer learning with convolutional neural networks to diagnose breast
cancer from histopathological images”. In International Conference on Neural Information
Processing, pp. 669–676. Springer, Cham, 2017.
34. Song, Yang, Ju Jia Zou, Hang Chang, and Weidong Cai. “Adapting fisher vectors for
histopathology image classification”. In 2017 IEEE 14th International Symposium on Biomedical
Imaging (ISBI 2017), pp. 600–603. IEEE, 2017.
35. Nahid, Abdullah-Al, Mohamad Ali Mehrabi, and Yinan Kong. “Histopathological breast cancer
image classification by deep neural network techniques guided by local clustering”. BioMed
Research International (2018): 1–20.
36. Mehra, Rajesh. “Breast cancer histology images classification: Training from scratch or transfer
learning?” ICT Express 4, no. 4 (2018): 247–254.
16
Prediction of Acoustic Performance Using
Machine Learning Techniques
CONTENTS
16.1 Introduction
16.2 Materials and Methods
16.3 Proposed Methodology
16.3.1 Step 1: Data Preprocessing
16.3.2 Step 2: Fitting Regression Model
16.3.3 Building a Backward Elimination Model
16.3.4 Building the Model Using Forward Selection Model
16.3.5 Step 3: Optimizing the Regressor Model—Mean Squared
Error
16.3.6 Step 4: Understanding the Results and Cross Validation
16.3.7 Step 5: Deployment and Optimization
16.3.7.1 Structural Parameters of Each Layer Material Is
Shown in Table 16.1
16.4 Results and Discussions
16.4.1 Error Analysis and Validating Model Performance for All
Test Samples
16.5 Conclusion
References
16.1 Introduction
In this day and age, noise and vibration can be accounted detrimental in
various mechanical systems such as home appliances, automobiles, and in
different infrastructures. These harmful effects can be effectively controlled
using sound absorbing materials. Such absorptive materials play a vital role
in efficiently converting the impinging acoustic energy into heat. Since the
vitality contained in sound waves is typically less, the amount of heat
produced is also negligible. Hence, the usage of sandwich panels plays a
cardinal role in the contraction of noise in different applications.
Strengthening acoustics along with mechanical properties is made possible
by utilizing numerous natural fiber reinforcements. Overall, it is evident that
natural fibers have relatively higher sound absorption capability compared to
petro-chemical-based fibers. From the existing literatures [1]. It can be
observed that, acoustic properties can be enhanced by employing various
macro-sized reinforcement fibers which account for the pore size, number,
and porosity [2].
The usage of conventional absorbent materials such as polyurethane foam,
mineral fiber composites and glass fiber felts can be used for sound proofing
[3]. Poly-Urethane foams (PU) displays high sound and energy absorption
which thereby motivated us to prepare the multi-layer sandwich foam for
noise control applications [4]. Owing to the fact that open-celled materials
are poor absorbers at low-sound frequencies, materials such as fibrous,
foam, and granular materials, are preferred as good sound-absorbing
materials [5]. Porosity, pore size, pore opening, thickness, static flow
resistivity, and so on are few parameters that affects sound absorption
coefficient [6]. Furthermore, analysts have made composite materials with
common or artificial fibers [7]. While the impedance tube-based transfer
function is utilized to evaluate the sound absorption coefficients for a
polyurethane foam sample, ASTM E 1050 standard is followed for test setup
[8].
Although analytical models have bridged the gap between experimental
and theoretical prediction, certain number of studies [9] have shown that soft
computing skills played a prominent role in predicting sound absorption
coefficients of different materials. Subsequently, the advancements in the
field of materials synthesis and characterizations have benefited to a large
extent by using machine learning techniques and AI methods for composite
materials [10]. These methods help in solving abstruse problems efficiently
due to their excellent optimization in addition with their unique algorithms
that pave their way for endless possibilities amongst various researchers
[11].
In this study, sound absorption coefficient is predicted using Machine
Learning Regression models by taking certain parameters including
thickness, area density, porosity and pore size as inputs, and estimating
sound absorption coefficient of each combination. The various layers of
materials include polyurethane foam and wood powder, whose percentages
differs between 5% and 15%, respectively.
It will be conceivable to assess the acoustic performance of a material for
various arrangements without the need to perform acoustic estimations [12].
Furthermore, numerous regression analysis has been performed on
composites. Dong [13] studied on dimension variation prediction for
composites with finite element analysis and regression modeling. In other
words, MLR is based on a linear least-squares fitting process which requires
a trace element or property for determination in each source or source
category [14]. Phusavat and Aneksitthisin [15] applied Multiple Linear
Regressions (MLR) for establishing the interrelationship among
productivity, price recovery, and profitability. By using the multiple-linear
regression model, the interrelationships between profitability, productivity,
and price-recovery were explicitly demonstrated. On the basis of the results
obtained from both ANN and regression analysis, we can predict the various
properties of fabric accurately [16]. The most commonly used criterion to
evaluate model performance is coefficient of determination (R2), which is
not only robust but also an easy and reliable measure for indicating the
performance of a model.
The objective of our work is to estimate the sound absorption coefficient
of the composite material by using the most optimal regression model.
RT m2−m1 m4−m3
ϕ = 1 − [ − ]
Vt P2−P1 P4−P3
2
(α) = 1 − |R|
H12 is the acoustic transfer function and p1 and p2 are the two measured
acoustic pressure of the two microphones See Figure 16.1.
The process can be carried out with the following steps: data pre-
processing, building the model, and fine-tuning and deployment as shown in
Figure 16.2.
TABLE 16.1
Structural Parameter of Each Layer Material
TABLE 16.2
Sound Absorption Values
Index Sample α125 α250 α500 α1000 α2000 α4000 ᾱ
Code
FIGURE 16.3 Mean square error (MSE) at various frequencies for multiple linear regression—“all-
in-one approach”.
FIGURE 16.4 Measured and predicted sound absorption coefficients for a random test sample—
ADBD – “all-in-one approach”.
The significance level assumed for the study is 10%. We have kept all the
16 independent variables first and then eliminated the unimportant features
based on its p-value. At different frequencies (α125, α250, α500, α1000, α2000,
α4000), the number of relevant features varied and it is reduced to 4, 3, 7, 4,
6, 6 that have p-value above 0.5. We have re-trained the regression model
using the relevant features that are found using this backward elimination
method and the results are summarized below.
Improvements in prediction and reduction in mean square error (MSE)
were observed. Figure 16.5 depicts the improvement in prediction of the
sound absorption coefficients for our random sample—ADBD.
FIGURE 16.5 Measured and predicted sound absorption coefficients for a random test sample—
ADBD “Backward Elimination approach”.
FIGURE 16.9 The measured and predicted sound absorption coefficient at 125 Hz for forward
selection.
TABLE 16.3
Average Mean Square Errors of Test Samples at Each Central Frequency
α125 0.0000675
Sound Absorption Coefficient Average Mean Square Error (MSE)
α250 0.00098
α500 0.0037
α1000 0.0048
α2000 0.000807
α4000 0.0028
16.5 Conclusion
This research work was carried out to study the behavior of different models
on the ability to predict sound absorption coefficient values. It was noted
that the linear regression-based models are best suited for this. The detailed
analysis was performed and it is observed that the prediction results are as
good as the experimental results that are carried out in the lab setup. It is
also observed that the mean square error is significantly lower and is
negligible. This study could be extended with different layers of composites
and other techniques will be explored in the future.
REFERENCES
1. Sung, G., Kim, J. W., & Kim, J. H. (2016). “Fabrication of
polyurethane composite foams with magnesium hydroxide filler for
improved sound absorption”. Journal of Industrial and Engineering
Chemistry, 44, 99–104.
2. Agrawal, A., Kaur, R., & Walia, R. S. (2017). “PU foam derived from
renewable sources: Perspective on properties enhancement: An
overview”. European Polymer Journal, 95, 255–274.
3. Liu, J., Bao, W., Shi, L., Zuo, B., & Gao, W. (2014). “General
regression neural network for prediction of sound absorption
coefficients of sandwich structure nonwoven absorbers”. Applied
Acoustics, 76, 128–137.
4. Yuvaraj, L., Jeyanthi, S., Thomas, N. S., & Rajeev, V. (2020). “An
experimental investigation on the mechanical and acoustic properties of
silica gel reinforced sustainable foam”. Materials Today: Proceedings,
27, 2293–2296.
5. Lv, Z., Li, X., & Yu, X. (2012). “The effect of chain extension method
on the properties of polyurethane/SiO2 composites”. Materials &
Design, 35, 358–362.
6. Guan, D., Wu, J. H., Wu, J., Li, J., & Zhao, W. (2015). “Acoustic
performance of aluminum foams with semiopen cells”. Applied
Acoustics, 87, 103–108.
7. Sekar, Vignesh, Fouladi, Mohammad, Namasivayam, Satesh, &
Sivanesan, Sivakumar. (2019). “Additive manufacturing: A novel
method for developing an acoustic panel made of natural fiber-
reinforced composites with enhanced mechanical and acoustical
properties”. Journal of Engineering, 2019, 1–19.
8. Seybert, A. (2010). “Notes on absorption and impedance
measurements”. Astm E1050, 1–6.
9. Buratti, C., Barelli, L., & Moretti, E. (2013). “Wooden windows: Sound
insulation evaluation by means of artificial neural networks”. Applied
Acoustics, 74(5), 740–745.
10. Kumar, S., Batish, A., Singh, R., & Singh, T. P. (2014). “A hybrid
Taguchi-artificial neural network approach to predict surface roughness
during electric discharge machining of titanium alloys”. Journal of
Mechanical Science and Technology, 28(7), 2831–2844.
11. Iannace, G., Ciaburro, G., & Trematerra, A. (2019). “Fault diagnosis
for UAV blades using artificial neural network”. Robotics, 8(3), 59.
12. Iannace, G., Ciaburro, G., & Trematerra, A. (2020). “Modelling sound
absorption properties of broom fibers using artificial neural networks”.
Applied Acoustics, 163, 107239.
13. Dong, Chensong, Zhang, Chuck, Liang, Zhiyong, & Wang, Ben.
(2004). “Dimension variation prediction for composites with finite
element analysis and regression modeling”. Composites Part A:
Applied Science and Manufacturing, 35(6), 735–746.
14. Henry, R., Lewis, C., Hopke, P., & Williamson, H. (1984). “Review of
receptor model fundamentals”. Atmospheric Environment (1967),
18(8), 1507–1515.
15. Phusavat, K., & Aneksitthisin, E. (2000). “Interrelationship among
profitability, productivity and price recovery: Lessons learned from a
wood-furniture company”. Proceedings of industrial engineering
network, Petchaburi, Thailand.
16. Ogulata, S. & Sahin, Cenk & Ogulata, Tugrul & Balci, Onur. (2006).
“The prediction of elongation and recovery of woven bi-stretch fabric
using artificial neural network and linear regression models”. Fibres
and Textiles in Eastern Europe, 14.
Section IV
Issue and Challenges in Data
Science and Data Analytics
17
Feedforward Multi-Layer Perceptron
Training by Hybridized Method between
Genetic Algorithm and Artificial Bee Colony
CONTENTS
17.1 Introduction
17.2 Nature-Inspired Metaheuristics
17.3 Genetic Algorithm Overview
17.4 Proposed Hybridized GA Metaheuristic
17.5 MLP Training by GGEABC
17.6 Simulation Setup and Results
17.7 Conclusion
References
17.1 Introduction
Artificial neural network (ANN) represents the most advanced method in
the machine learning domain. Machine learning (ML) is an area of artificial
intelligence (AI) that gives facility to the computer system to learn from the
data. There is accelerated growth in this field in the last decade because of
the increase in training speed effected by GPUs and deep learning. ANNs
are biologically inspired algorithms, which mimic the learning process of
the human brain. ANNs have broad utilization in different areas, such as
regression, classification, pattern recognition, and different forecasting
problems [1–5].
ANN’s architecture starts with the input layer, and the final layer
represents the classification layer that predicts the class. There are one or
more hidden layers between the input and output layers. The first layer
contains an equal number of nodes (units) as the number of input features,
the number of nodes in the hidden layer is a hyperparameter that should be
tuned and the output layer represent the classification layer which has an
equal number of nodes as the classes to predict. The units between different
layers are connected by modifiable connection weights. Bias is an
additional parameter for adjusting the weighted sum of the input, and move
the activation function result towards the negative or positive side. The
output value of one neuron is calculated by applying an activation function
to the weighted sum and bias as follows:
z = a(W x + b) (17.1)
where z expresses the output value, a is the activation function, the input is
denoted by x, W indicates to the weigh, and b is the bias term.
The most popular activation function is sigmoid, other activation
functions are tanh, and Rectified Linear Unit (ReLU) [6].
The value of the weights and biases should be determined and it
represents an optimization problem. For multi-layer perceptron (MLP)
neural networks there are two widely used methods for optimizing the value
of the connection weights, gradient-based training, and stochastic
optimization algorithms. Backpropagation [7] is a gradient-based algorithm,
which has a disadvantage, as it gets trapped in local minima. Other
stochastic gradient descent (SGD) optimizations are momentum, rmsprop,
adam, adadelta, adamax, and adagrad [8, 9, 10].
To address this issue of getting trapped in the local minima, in this study,
we introduce a model for optimizing the value of the connection weights in
MLP by the hybridized metaheuristic algorithm. The connection weigh
optimization belongs to the NP-hard class, where computing the optimal
solution is intractable and cannot be solved by exact optimization
techniques. Metaheuristic approaches gives satisfactory resolution in
solving NP-hard optimization issues. Thus, in this paper, we develop and
present an improved metaheuristic algorithm with the hybridization of
artificial bee colony algorithm (ABC) and genetic algorithm (GA).
17.2 Nature-Inspired Metaheuristics
Metaheuristic algorithms are inspired by nature and represent a field of
stochastic optimization. Stochastic optimization methods employ some
level of randomness and search for optimal or near-optimal solutions.
Metaheuristic algorithms are powerful in optimization problems, where the
search space is extensive and exact, brute-force search methods would fail.
Brute-force methods implicitly explore all the possibilities in the search
space, which takes too much time; on the other hand, metaheuristic
techniques use randomization to explore the search space. There are two
main phases in each metaheuristic algorithm, intensification, and
diversification. The diversification (also called exploration) process is
responsible for search space exploration globally, while the intensification
(also called exploitation) does the local search, around the current fittest
solution. For achieving good results, it is very important to make a good
balance between intensification and diversification. Two main categories of
the metaheuristic algorithms are the swarm intelligence (SI) and the
evolutionary algorithms (EAs).
Swarm intelligence algorithms are motivated by group (collective)
intelligence developed from nature where the overall population drives to
intelligent global action. Numerous swarm-intelligence-based algorithms
are introduced with various designs, but all of them have certain common
characteristics. First, the population is generated randomly in the
initialization phase. Randomization is utilized for search space investigation
and avoiding becoming stuck in local optima. The location of each solution
is changed after each iteration, and when the algorithm meets the
termination condition, outputs the best solution. The position update allows
the system to evolve, and to converge toward the global optimum.
Swarm-intelligence-based metaheuristic approaches have successful
implementation in various areas, such as convolutional neural network
architecture design in deep learning [11], image clustering [12–14],
implementation in cloud computing to schedule tasks (user requests) [15],
etc.
The biological evolution process motivates the EAsn, and it has three
sub-categories, namely, Evolutionary Programming [16], Evolutionary
Strategies [17], and Genetic Algorithms [18].
Each evolutionary algorithm incorporates the following elements:
solution definition, random initialization of the population, fitness function
evaluation, reproduction, selection, replacement strategy, and termination
criteria. Metaheuristic hybridization is a very successful category of
metaheuristic study. The hybrid algorithm is a parallel or distributed
implementation of two or more algorithms. It blends benefits of various
metaheuristic algorithms using algorithmic ingredients from distinct
optimization methods or a mixture of metaheuristics with various methods
from artificial intelligence.
There are many successful implementations of hybridized metaheuristics
in different fields. In deep learning, hybrid metaheuristic variations are
applied for convolutional neural network architecture optimizations [19–
21], dropout probability estimation in convolutional neural networks [22],
different applications in cloud computing [23,24], and wireless sensor
optimization problems [25].
The genetic algorithm has some disadvantages, such as the mutation rate,
crossover rate selection, and fast converge could lead to premature
convergence.
pi =
f (x i )
n , f (x i ) =
1
(17.4)
∑ f (x j ) 1+f i
j=1
where p denotes the probability, and f (x) indicates the individual’s fitness
value.
After the position of each solution is updated, the greedy selection
mechanism is applied between the old and new solutions, the fitness value
is evaluated of the new solutions, and it is compared to the old solutions’
fitness. The new individual alters the old one, if its value of the fitness is
better; otherwise, the old individual is retained.
If an employed bee is not showing improvement through the course of
the iteration, and it reaches the defined number of trials, the scout bee
generates a new random solution.
The proposed algorithm is named as a genetically guided enhanced
artificial bee colony, in short GGEABC, and Algorithm 17.1 presents the
details of the pseudocode.
n w = (n x × n h + n h ) + (n h × n o + n o ) (17.5)
where nw denotes the length of a solution, the input feature size is denoted
by nx, the hidden unit number is indicated by nh, and no denotes the output
layer size.
The solution encoding is presented in Figure 17.3.
FIGURE 17.3 Solution encoding.
M SE =
1
∑
n
i=1
(y i − ŷ i )
2
(17.6)
n
where the number of instances are denoted by n, y denotes the actual value,
and the predicted value is denoted by yˆ.
Steps of the algorithm’s procedure:
X norm =
X i −X min
X max −X min
(17.7)
where Xnorm denotes the normalized value, the ith input feature us denoted
by Xi, and the maximum and minimum value of the corresponding feature is
denoted by Xmax, and Xmin, respectively.
Two-thirds of the entire dataset is utilized for training purposes, while 1/3
is utilized for testing the model. We used five different metrics to evaluate
the proposed approach; the accuracy (acc.), specificity (spec.), sensitivity
(sens), geometric mean (gmean), and area under the curve (AUC) metrics,
which are popular metrics in the field of medicine, and the equations are
formulated as:
sens =
TN
F N +T P
(17.10)
AU C = 1/(T P + F P )(T N + F N ) ∫
−1
TP d FP (17.12)
0
where TP denotes the true positive, TN the true negative, FP the false
positive, and FN the false negative values from the confusion matrix.
The configuration setup is done in a similar fashion as in the work [35].
The simulation results are compared to the results of the algorithms
reported in [35].
The size of the population is set to 50, and each solution is initialized
between –1 and 1. To show consistency, the proposed GGEABC is carried
out 30 times; 250 generation in each run. Table 17.1 depicts the control
parameters and the corresponding values.
TABLE 17.1
GGEABC's Control Parameters
Tables 17.2–Table 17.4 present the best, mean, and worst statistical
results and comparisons with other metaheuristic approaches on the same
dataset.
TABLE 17.2
Results of the Diabetes Dataset
TABLE 17.3
Breast Cancer Dataset Results
TABLE 17.4
Results of the SAheart Dataset
Figure 17.4 illustrates the algorithm comparison of the best result on five
metrics.
FIGURE 17.4 Algorithm comparison of best results on five metrics.
17.7 Conclusion
This study proposes an approach for feedforward MLP training by the
hybridization of the GA and ABC algorithm. The approach is named GGE-
ABC, which optimizes the connection weights in the MLP. The objective is
to find the right values of the connection weights, which result in low test
error rate. For performance evaluation, we used two well-known medical
benchmark datasets and five different metrics, namely, accuracy, specificity,
sensitivity, geometric mean (g-mean), and AUC. The obtained results are
compared to similar metaheuristic based methods. The simulation results
prove the robustness and efficiency of GGEABC, and it outperformed the
other nine approaches. Hybridizing the genetic algorithm with the swarm-
intelligence-based ABC algorithm makes a good trade-off between the
diversification and intensification phases and avoid getting trapped in the
local optima, to have fast convergence speed, as well as achieve a high
accuracy rate.
We can conclude according to the finding in this work, that the GGEABC
is very competitive over the current approaches.
In future work, we are going to include other large-scale datasets and
optimize other hyperparameters in the network, as well as to apply the
method on image datasets and to optimize the architecture.
Acknowledgment
The paper is supported by the Ministry of Education, Science and
Technological Development of Republic of Serbia, Grant No. III-44006 and
the Science Fund of the Republic of Serbia, Grant No. 6524745, AI-
DECIDE.
REFERENCES
1. Jürgen Schmidhuber. “Deep learning in neural networks: An
overview”. Neural Networks, 61: 85–117, 2015.
2. M. Braik, A. Sheta, and Amani Arieqat. “A comparison between GAs
and PSO in training ANN to model the TE chemical process reactor”.
Proceedings of the AISB 2008 symposium on swarm intelligence
algorithms and applications, Vol. 11, 2008.
3. C. Nightingale, D. J. Myers, and R. Linggard. Introduction Neural
Networks for Vision, Speech and Natural Language, pages 1–4.
Springer Netherlands, Dordrecht, 1992.
4. Sankhadeep Chatterjee, Sarbartha Sarkar, Sirshendu Hore, Nilanjan
Dey, Amira S Ashour, and Valentina E Balas. “Particle swarm
optimization trained neural network for structural failure prediction of
multistoried RC buildings”. Neural Computing and Applications,
28(8): 2005–2016, 2017.
5. Amir Mosavi, Pinar Ozturk, and Kwokwing Chau. “Flood prediction
using machine learning models: Literature review”. Water, 10(11):
1536, 2018.
6. Vinod Nair and Geoffrey E. Hinton. “Rectified linear units improve
restricted Boltzmann machines”. In Proceedings of the 27th
International Conference on Machine Learning, ICML’10, pages 807–
814, USA, 2010. Omnipress.
7. David E. Rumelhart, Geoffrey E Hinton, and Ronald J Williams.
“Learning representations by back-propagating errors”. Nature,
323(6088): 533–536, 1986.
8. John C. Duchi, Elad Hazan, and Yoram Singer. “Adaptive subgradient
methods for online learning and stochastic optimization”. Journal of
Machine Learning Research, 12: 2121–2159, 2011.
9. Matthew D. Zeiler. “Adadelta: An adaptive learning rate method”.
arXiv preprint arXiv:1212.5701, 2012.
10. Diederik P. Kingma and Jimmy Ba. “Adam: A method for stochastic
optimization”. arXiv preprint arXiv:1412.6980, 2014.
11. Timea Bezdan, Eva Tuba, Ivana Strumberger, Nebojsa Bacanin, and
Milan Tuba. “Automatically designing convolutional neural network
architecture with artificial flora algorithm”. In and Milan Tuba, Shyam
Akashe, and Amit Joshi, editors, ICT Systems and Sustainability, pages
371–378. Springer Singapore, Singapore, 2020.
12. Eva Tuba, Ivana Strumberger, Nebojsa Bacanin, Timea Bezdan, and
Milan Tuba. “Image clustering by generative adversarial optimization
and advanced clustering criteria”. In and Ying Tan, Yuhui Shi, and
Milan Tuba, editors, Advances in Swarm Intelligence, pages 465–475.
Springer International Publishing, Cham, 2020.
13. Eva Tuba, Ivana Strumberger, Timea Bezdan, Nebojsa Bacanin, and
Milan Tuba. “Classification and feature selection method for medical
datasets by brain storm optimization algorithm and support vector
machine”. Procedia Computer Science, 162: 307–315, 2019. 7th
International Conference on Information Technology and Quantitative
Management (ITQM 2019): Information Technology and Quantitative
Management Based on Artificial Intelligence.
14. Ivana Strumberger, Eva Tuba, Nebojsa Bacanin, Miodrag Zivkovic,
Marko Beko, and Milan Tuba. “Designing convolutional neural
network architecture by the firef1ly algorithm”. In Proceedings of the
2019 International Young Engineers Forum (YEF-ECE), Costa da
Caparica, Portugal, pages 59–65, 2019.
15. Nebojsa Bacanin, Timea Bezdan, Eva Tuba, Ivana Strumberger, Milan
Tuba, and Miodrag Zivkovic. “Task scheduling in cloud computing
environment by grey wolf optimizer”. In 2019 27th
Telecommunications Forum (TELFOR), pages 1–4. IEEE, 2019.
16. D.B. Fogel and IEEE Computational Intelligence Society.
Evolutionary Computation: Toward a New Philosophy of Machine
Intelligence. IEEE Series on Computational Intelligence. Wiley, 2006.
17. Hans-Georg Beyer and Hans-Paul Schwefel. “Evolution strategies – A
comprehensive introduction”. Natural Computing, 1(1): 3–52, Mar
2002.
18. David E. Goldberg. Genetic Algorithms in Search, Optimization and
Machine Learning. Addison-Wesley Longman Publishing Co., Inc.,
Boston, MA, USA, 1st edn edition, 1989.
19. Eva Tuba, Ivana Strumberger, Nebojsa Bacanin, Timea Bezdan, and
Milan Tuba. “Optimizing convolutional neural network
hyperparameters by enhanced swarm intelligence metaheuristics”.
Algorithms, 13(3): 67, 2020.
20. Nebojsa Bacanin, Timea Bezdan, Eva Tuba, Ivana Strumberger, and
Milan Tuba. “Monarch butterfly optimization based convolutional
neural network design”. Mathematics, 8(6): 936, 2020.
21. Timea Bezdan, Miodrag Zivkovic, Eva Tuba, Ivana Strumberger,
Nebojsa Bacanin, and Milan Tuba. “Glioma brain tumor grade
classification from MRI using convolutional neural networks designed
by modified FA”. In International Conference on Intelligent and Fuzzy
Systems, pages 955–963. Springer, 2020.
22. N. Bacanin, E. Tuba, T. Bezdan, I. Strumberger, R. Jovanovic, and M.
Tuba. “Dropout probability estimation in convolutional neural
networks by the enhanced bat algorithm”. In 2020 International Joint
Conference on Neural Networks (IJCNN), pages 1–7, 2020.
23. Timea Bezdan, Miodrag Zivkovic, Milos Antonijevic, Tamara
Zivkovic, and Nebojsa Bacanin. “Enhanced flower pollination
algorithm for task scheduling in cloud computing environment”. In
and Amit Joshi, Mahdi Khosravy, and Neeraj Gupta, editors, Machine
Learning for Predictive Analysis, pages 163–171. Springer Singapore,
Singapore, 2021.
24. Timea Bezdan, Miodrag Zivkovic, Eva Tuba, Ivana Strumberger,
Nebojsa Bacanin, and Milan Tuba. “Multi-objective task scheduling in
cloud computing environment by hybridized bat algorithm”. In
International Conference on Intelligent and Fuzzy Systems, pages 718–
725. Springer, 2020.
25. Miodrag Zivkovic, Nebojsa Bacanin, Eva Tuba, Ivana Strumberger,
Timea Bezdan, and Milan Tuba. “Wireless sensor networks life time
optimization based on the improved firef1ly algorithm”. In 2020
International Wireless Communications and Mobile Computing
(IWCMC), pages 1176–1181. IEEE, 2020.
26. John Henry Holland et al Adaptation in Natural and Artificial
Systems: An Introductory Analysis with Applications to Biology,
Control, and Artificial Intelligence. MIT Press, 1992.
27. Dervis Karaboga and Bahriye Basturk. “On the performance of
artificial bee colony (abc) algorithm”. Applied Soft Computing, 8(1):
687–697, 2008.
28. Milan Tuba and Nebojsa Bacanin. “Artificial bee colony algorithm
hybridized with firef1ly algorithm for cardinality constrained mean-
variance portfolio selection problem”. Applied Mathematics &
Information Sciences, 8(6): 2831, 2014.
29. Nadezda Stanarevic, Milan Tuba, and Nebojsa Bacanin. “Modified
artificial bee colony algorithm for constrained problems
optimization. ” Int J Math Models Methods Applied Sci, 5(3): 644–
651, 2011.
30. Milan Tuba, Nebojsa Bacanin, and Nadezda Stanarevic. “Adjusted
artificial bee colony (abc) algorithm for engineering problems”.
WSEAS Transaction on Computers, 11(4): 111–120, 2012.
31. Nebojsa Bacanin, Milan Tuba, and Ivona Brajevic. “Performance of
object-oriented software system for improved artificial bee colony
optimization” Int J Math Comput Simul, 5(2): 154–162, 2011.
32. Olvi L Mangasarian and William H Wolberg. “Cancer diagnosis via
linear programming”. Technical report, University of Wisconsin-
Madison Department of Computer Sciences, 1990.
33. William H Wolberg and Olvi L Mangasarian. “Multisurface method of
pattern separation for medical diagnosis applied to breast cytology”.
Proceedings of the National Academy of Sciences, 87(23): 9193–9196,
1990.
34. JE Rossouw, JP Du Plessis, AJ Benadé, PC Jordaan, JP Kotze, PL
Jooste, and JJ Ferreira. “Coronary risk factor screening in three rural
communities. The CORIS baseline study”. South African Medical
Journal = Suid-Afrikaanse tydskrif vir geneeskunde, 64(12): 430,
1983.
35. Ali Asghar Heidari, Hossam Faris, Ibrahim Aljarah, and Seyedali
Mirjalili. “An efficient hybrid multilayer perceptron neural network
with grasshopper optimization”. Soft Computing, 23(17): 7941–7958,
2019.
18
Algorithmic Trading Using Trend Following
Strategy: Evidence from Indian Information
Technology Stocks
CONTENTS
18.1 Introduction
18.2 Literature Survey
18.2.1 Data and Period of Study
18.3 Methodology
18.4 Results and Discussions
18.5 Conclusions
18.5.1 Future Scope
References
18.1 Introduction
The growth of big data is due to rapid progress in this digitalized world,
which resulted in an immeasurable increase in data generated and shared in
every domain of work, ranging from public administration, non-government
organizations, business, academic research, etc. Such data with complex
characteristics has become difficult to process with traditional methods and
techniques. Big Data is characterized as the seven Vs, namely Volume,
Velocity, Variety, Veracity, Valence, Value, and Variability. Volume refers to
the large size of data, growing exponentially. As per Dobre and Xhafa [1],
2.5 quintillion bytes of data are produced globally, whereas Gantz and
Reinsel [2] predicted that by 2020, 40 zettabytes of data is expected
worldwide. Velocity is defined as the speed at which information is
processed or retrieved. It is necessary to have sophisticated data processing
techniques capable of processing big data at a higher speed. Variety is
defined as a type of data varying from textual to multimedia content.
Textual data can either be structured, semi-structured, or unstructured. Most
data, about 90%, is considered to be unstructured. However, multimedia
content may be in the form of video, audio, or images. Variety indicates
data complexity due to its existence in different forms, thereby challenging
to process. Veracity is defined as the quality of the data, the lack of which
affects accuracy [3]. It arises because of data uncertainty, due to
inconsistency in large datasets [4]. Valence speaks of data connectivity in
the form of a graph [5]. Value refers to the process of transforming data into
useful information, capable of generating revenues in business, or provide
better insights to marketers in order to better understand customers, etc.
Data may change continuously and may be inconsistent over time, and
hence is termed as Variability [5]. Variability reduces the meaning of data,
and such inconsistency is commonly faced in stock prices [4]. Traditionally
big data is characterized with 3 Vs, and therefore, Gartner [6] defines big
data as “high-volume, high-velocity, and high-variety information assets
that demand cost-effective, innovative forms of information processing for
enhanced insight and decision-making.”
Development in data analytics has made it possible to capture, measure,
and analyze a large amount of unstructured data. This facilitates to process
any kind of data generated from any device giving valuable output, which is
termed as datafication [7]. Saggi and Jain [4] mention various analytical
techniques for processing data, namely classification, regression, clustering,
graph analytics, association analyses, and decision trees. However, big data
analytical tools include machine learning, artificial neural network, data
mining, deep learning, and natural language processing.
In the field of banking and finance, a massive technological
transformation has taken place, resulting in generation of petabytes of data,
both in structured and unstructured form, giving birth to big data in finance.
In this domain, digital evolution has drifted traditional operations to digital
in order to handle customers [8]. In the banking sector, big data analytics
helps in preventing financial frauds and judicious sanction of big-ticket
corporate loans. While disbursing loans, data analytics help bankers to
analyze credit risk by segmenting customer profiles. This is done by
examining all possible financial networks of probable customers. This
facilitates to examine financial nature and solvency of the customer, and
helps the banker to take a strong judicious decision. It also helps in
monitoring sanctioned loans, thereby reducing non-performing loans in
banks. Data analytics is essential in assessing different kinds of financial
risks, namely credit risk, operation risk, etc., with robust models. In the
stock market, data analytics finds applications in predicting stock price.
This helps in making smart investment decisions by relying on historical
stock returns and analyzing real-time sentiments, economic and political
scenarios, and even micro-level company fundamentals.
However, Sun et al. [9] pointed out significant challenges in financial big
data analytics. It may be difficulties associated with organizing
heterogeneous financial data effectively, in order to develop efficient
business models. There also arises complications in implementing financial
data analytics to approach critical topics like risk modeling and hence,
requires expertise. Further, it also becomes challenging to provide complete
safety, security, and privacy of such massive and confidential financial data.
High-Frequency Trading (HFT) has become an essential subject of study
in the stock market, as thousands of orders need to be executed in seconds.
HFT generates a massive amount of data, which is processed, and orders
are executed swiftly and perfectly using algorithms. This resulted in the
requirement of algorithmic trading, involving different kinds of trading
strategies. This paper adopts Trend Following Strategy to study Indian
Information Technology (IT) stocks featured in NIFTY 50, during FY
2019-20. Moving Averages at 11 and 22 days are computed for each of the
respective companies, and the pairwise correlation coefficient is estimated.
The efficiency of Trend Following Strategy is assessed for each of the IT
stocks, and individually for “BUY-SELL” and “SELL-BUY” trade
executions.
The remainder of the paper is organized as follows. Section 2 covers the
literature explaining data analytics in various fields of finance. Section 3
discusses the model with a flowchart. Section 4 discusses the results.
Section 5 ends up with conclusions and future scope of the study.
18.2 Literature Survey
Research has been carried out in recent times in various domains of finance
like auditing [10], accounting [11], banking [12,13], and stock markets
[14,15,16] using data analytics.
Earley [10] has enlisted applications of data analytics in auditing.
Auditors can examine a large number of financial transactions with
analytical tools and hence, improves the quality of audit. It takes leverage
of technology to scrutinize, thereby reducing financial frauds. Auditors also
have the advantage of providing additional services, serving their clients
efficiently. Appelbaum et al. [11] proposed the Managerial Accounting Data
Analytics (MADA) framework, which uses a balance scorecard. MADA
helps in assessing three types of business analytics, namely descriptive,
prescriptive, and normative, implemented in the business for financial,
customer, internal process, and learning and growth. Thus, it becomes an
important question, whether accountants and auditors face problems to
survive in this era of big data. To answer this question, Richins et al. [17]
provided a conceptual framework and explains, as an accountant
professional has expertise in problem-solving capabilities, they can work
together with data scientists, thereby indicating big data is a compliment for
the job of accountants.
Data analytics is widely used in the banking sector. In banks, big data is
used in supply chain finance to obtain credit reports and to execute e-wiring
transactions. It is also reported by Hung et al. [12] that big data analytics
has the advantage of better marketing without compromising on risk
management. Srivastava et al. [18] claim that analytics will help banks in
India to get better accounting information, thereby will have a competitive
advantage by making appropriate decisions. Even to analyze banking
stability and to answer the fundamental question “Why do banks fail?” data
analytics is used. Viviani and Hanh [13] used text analytics to explain that
loan and bank management are critical factors responsible for a bank’s
failure in the United States.
Data analytics has grown its importance in the stock market for
predicting real-time stock prices based on the arrival of stock-specific and
macro news. With High-Frequency Trading gaining importance, data
analytics finds application in algorithmic trading. Lee et al. [14] used big
data analytics to analyze stock market reactions for 54 investment
announcements in NASDAQ and NYSE listed companies between 2010
and 2015. The study indicated that announcements on investment have a
positive impact on the stock market. It was also observed that investors’
valuations on big data analytics are higher for larger companies over small
companies. The arrival of stock-specific news can impact stock prices.
Groß-Klußmann and Hautsch [15] used a high-frequency VAR model to
study stock returns, volatility, trading volumes, and bid-ask spreads. The
study is conducted by considering stocks in the London Stock Exchange for
intraday high-frequency data, where information and news are exploited to
identify its effects. It is also observed that sentiments influence stock prices;
however, profitability is reduced with an increase in the bid-ask spreads
[15].
A data analytic technique was proposed using fuzzy logic, where the
effect of a hurricane in the stock market is predicted [19]. Nann et al. [20]
aggregated messages from a microblogging platform and segregated such
messages for individual stocks. Sentiment analysis is performed on such
text to gather information in order to generate buy and sell calls. Such a
technique, when applied for S&P 500 stocks, was observed to outperform
the index, securing a positive return of 0.49% per trade and 0.24% when
adjusted to the market. Pang et al. [16] proposed two models, namely deep
long short-term memory neural network with embedded layer and long
short-term memory neural network with automatic encoder for forecasting
stock prices. It was observed that accuracy is higher for embedded layer
deep long short-term memory neural network, with accuracy being 57.2%
compared to 56.9% when experimented with Shanghai A-shares composite
index. Sigo [21] indicated that machine learning techniques like the
artificial neural network have the advantage to predict stock price
accurately, if information arising from stocks are efficiently preprocessed,
thereby resulting in long-term capital gain.
Algorithmic Trading finds importance to improve an investor’s stock
return and information efficiency. However, it has a negative impact on
slow traders by increasing the selection cost. High correlation is observed
among strategies followed by algorithmic traders, though it does not
deteriorate the quality of the market [22].
Major Algorithmic Trading includes Mean Reversion, Momentum
Strategy, Statistical Arbitrage, and Trend Following Strategy. The mean
reversion strategy indicates that the asset price will revert to the asset’s
long-term mean, hence capitalizing on extreme asset price changes when
overbought or oversold. From the name itself, the momentum strategy
indicates to buy assets when they are rising and sell them when they lose
momentum. Statistical Arbitrage is a group of trading strategies applied on
a diversified portfolio constructed using various securities, in order to
minimize risk. Trend Following Strategy follows market movement,
thereby buying when the prices go up and sell when prices go down.
Various indicators are used in Trend Following Strategy, out of which
Moving Average is one of the most prominent.
Extant literature has described the different analytical techniques and
their applications in various domains of finance. Literature has also outlined
the importance of algorithmic trading. However, literature lacks to study the
efficiency of Trend Following Strategy, specifically for “BUY-SELL” and
“SELL-BUY,” particularly in an emerging economy like India. After
studying trading algorithms, and having a detailed literature survey, the
chapter implements the Trend Following Strategy among Indian
Information Technology stocks. In such a context, the following objectives
are proposed:
1. To implement the Trend Following Strategy among Indian IT Stocks.
2. To assess the Trend Following Strategy’s efficiency for “BUY-SELL,”
“SELL-BUY,” and overall stock.
TABLE 18.1
Descriptive of Sample Data
18.3 Methodology
Trend Following Strategy is used to analyze the probability of success for
BUY and SELL call, for five Information Technology stocks listed in
NIFTY 50. This will assess efficiency of the Trend Following Strategy of
algorithmic trading.
Stock return is estimated using daily closing price as formulated in eq.
(18.1).
Ret t = ln
P rice t
P rice t−1
(18.1)
Simple moving average at 11 and 22 days is estimated as exhibited in eq.
(18.2).
SM A t =
1
m
∑
t
t−m+1
Ret t (18.2)
Where t is the trading day and m is the order of moving average. Here, m is
11 and 22.
If Simple Moving Average at 11 days is greater than 22 days, it is a
“BUY” call; otherwise, it is a “SELL” call. Pairing of buy and sell, or sell
and buy is done for consecutive trading days. This indicates squaring of
positions in the market is carried out on successive trading days. A call is
considered to be a success if profit is generated while squaring off.
Individual probability of success for BUY-SELL, SELL-BUY, and overall
stock is the ratio of number of successful calls to the total calls, and is
estimated using eq. (18.3). The workbook representing sample data is
represented in Figure 18.1.
P =
N umber of Successf ul Calls
(18.3)
T otal Calls
FIGURE 18.1 Workbook of Trend Following Strategy.
ρ x−y =
∑
i=1
(x i −x)(y i −y )
(18.4)
n 2 2
√∑ (x i −x) (y i −y )
i=1
TABLE 18.2
Stock Return along with SMA at 11 Days and 22 Days
Table 18.3 depicts pairwise Pearson correlation between stock return and
SMA at 11 days and 22 days. Weak correlation is observed between the
return of stock and SMA at 11 days and 22 days. This indicates it is difficult
for an investor to make a decision by computing correlation with stock
return. However, a strong positive correlation is found between SMA at 11
days and 22 days, thus becoming a parameter for an investor to make a
decision.
TABLE 18.3
Correlation between Return and Simple Moving Average at 11 and 22 Days
for IT Companies
Table 18.4 depicts the total number of possible trade executions for
consecutive trading days with Trend Following Strategy, for both “Sale-
Buy” and “Buy-Sell.” It is observed that Infosys has the highest trade
executions while Wipro has the lowest trade executions. Among the
efficiency, TCS has the highest efficiency in order to book profit, whereas
HCL Technologies finds its place at the bottom. Among “Sale-Buy,” TCS
exhibits the highest efficiency at 64.29%, and the lowest is observed for
Infosys at 35%. For “Buy-Sale,” Infosys has the highest efficiency to book
profit at 60%, whereas Wipro exhibits the lowest efficiency at 33.33%.
TABLE 18.4
Efficiency of Trend Following Strategy along with the Number of Trade
Executions
Figure 18.4 depicts the decision to “BUY” and “SELL” stock for each
trading day during FY 2019-20.
FIGURE 18.4 Decision to “BUY” or “SELL” IT stocks.
18.5 Conclusions
Data analytics has become an essential aspect of research in this age of
datafication. With no exception, it has become an essential study in finance,
especially in the field of stock markets. Due to High-Frequency Trading,
big data generated needs to be processed in order to execute orders in the
market with effective trading algorithms. The study has examined the Trend
Following Strategy in view to assess its efficiency. The work was conducted
on five major IT stocks in NIFTY50.
As the study was conducted during the bearish phase of the stock market,
all companies in the sample resulted in negative returns, with Infosys
having the least negative returns followed by TCS, Tech Mahindra, Wipro,
and HCL Technologies. The study also indicated that during the bearish
market, moving average for 22 days yields better results as compared to that
of moving average at 11 days. However, returns and both moving averages
are consistent with their respective companies. As a strong positive
correlation is observed between moving average at 11 and 22 days, it
becomes an important parameter for an investor to make a decision. TCS
exhibits the highest overall efficiency to book profit, whereas HCL
Technology bears the lowest efficiency. The study conducted with this
methodology is significant in assessing the efficiency of Trend Following
Strategy, and contributes to the literature.
CONTENTS
19.1 Introduction
19.2 Review of Literature
19.3 Proposed Methodology
19.3.1 Sentiment Score
19.3.2 Labeling
19.3.3 Feature Matrix
19.3.4 Probabilistic Neural Network
19.4 Numerical Results and Discussion
19.4.1 Data Description
19.4.2 Statistical Measure
19.5 Simulation Results and Validation
19.5.1 Comparative Analysis over Existing and Proposed Decision-
Making Methods
19.6 Conclusion and Future Enhancement
References
19.1 Introduction
Stock market prediction is a thriving topic in the field of the financial world and
has received greater consideration over stockholders, experts, and investigators.
Prediction is the strategy in which the future price or stock market shift may be
expected. This includes the process of taking historical stock prices and technical
indicators and collecting data patterns seen through certain techniques. Then the
model is employed to make predictions about the future direction or price of
stock market. A precise forecast of the future movement or price of a stock
market may yield a high profit.
Stock market research encapsulates three elemental trading philosophies: (i)
fundamental investigation, (ii) technical investigation, and (iii) time series
prediction. Fundamental analysis is a method of estimating share value of a
company by analyzing some factors such as sales, profits, earnings, and other
economic factors [1]. In order to identify the potential movement or price,
technical analysis uses historical stock prices. Fundamental analysis is a long-
term investment approach, while short-term methodology is far more known as
quantitative analysis. Time series forecasting involves two basic models: (i)
linear models and (ii) non-linear models. Autoregressive moving average
(ARMA) and autoregressive integrated moving average (ARIMA) were
extensively accepted linear models used to make predictions about the stock
market. Linear models use some predefined assumptions such as normality and
postulates to fit a mathematical model [2].
These models need more historical prices to meet these assumptions. Because
of this, the trends present in the stock data cannot be established. Since stock
markets are considered non-linear systems, it is important to increase the
prediction accuracy of more versatile methods that can learn hidden information
in stock data. Artificial Neural Networks (ANNs) have strong advantages in this
regard because they can derive nonlinear data relationships without prior
knowledge of input data through the training process. Several studies have
showed that the ANN technique outperforms the linear models [3].
Sentiment analysis is a contextual text mining aimed at detecting and
extracting subjective knowledge from source content and helping a company
understand the social feelings of its brand, product, or service by analyzing
online conversations. In addition, sentiment analysis helps businesses evaluate
what customers want or dislike and therefore take action to enhance their service,
thus enhancing their credibility. Businesses must also be aware of what is written
about them in the public domain, as it can have a positive or negative effect on
them and have a direct impact on their stock market values, affecting either
earnings or losses of their investment value [4].
For both traders and analysts, predicting stock market movements by
analyzing historical stock data has always been a fascinating subject. Researchers
have used numerous machine learning models, such as ANN and historical data
deep learning models, to forecast stock market movements. Many methods have
been developed, and ANN-based prediction models are popular and widely used
due to their ability to identify stock movement from the massive amount of stock
data that can capture the underlying patterns through training process. Twitter
details and both financial news/social media, are exterior influences which shall
influence the movement of stock market.
In literature, the use of news and Twitter data for forecasting stock movements
is very unusual. Adding social media data and news data with historical data is
relevant because unusual events on both Twitter and news information can also
influence the stock market. Twitter is a modern medium of web content for social
media. A significant aspect of social media data is the appropriate availability of
original data and the quick interface between the customers. Those
correspondences were seen as an indicator towards the interest of customers to
many topics, including stock market issues. However, Twitter data alone does not
affect the stock market movement.
For individuals or investors who try to invest in capital markets, the trend of
equity markets is unclear. They shall not identify the exact share to buy and
which share to sell so as to get high returns for their invested money. Such
investors recognize that the conduct of the stock market relies on financial news.
They also need to be more precise and appropriate listing data on stock markets,
thereby their trade decisions were made with timely and reliable information.
Also, traders’ expectations focused only on financial news because the trading
approach cannot be adequate. Numerous methods have been developed for
predicting the stock market trend on the basis of historical information or
combination of historical data and media data or historical data and news data.
However, little research is done to investigate the impact of social media and
economic bulletin on improving correctness in predictions. Stock sentiment score
may be positive or negative, making investors bullish or bearish about a
particular stock. Utilizing unique types of information might not provide high
prediction accurateness. Social media information and news data together can
affect investors’ decisions, so both sources and historical data should be
considered when designing a model to forecast behavior of stock markets. The
prediction accuracy of the prediction model will increase, taking into account
three types of historical data, social media data, and financial news information.
A new approach to machine learning based on the above analysis is proposed
in the chapter to forecast the path of the stock market price by analyzing Twitter
data, financial news data, and historical data.
For this research, historical data are collected from Yahoo finance and
financial news is collected from Bloomberg terminal
(https://www.bloomberg.com).
The main focus of the analysis is to examine the influence of the Twitter
sentiment and news sentiment score in enhancing accurateness of stock market
prediction by means of artificial neural network movements.
Major contributions in this research are as follows:
Proposed a blend of historical data, Twitter data, and financial news data to
predict stock market movements.
Designed a machine learning approach, Probabilistic Neural Network
(PNN) for predicting stock market trends.
Evaluated performance of the developed model by computing prediction
accuracy.
Analyzed the efficacy of the designed model by using only historical data
and combined data (historical and sentiment score).
Compared the performance of the established model with previous models
to prove its superiority with respect to prediction accuracy.
TABLE 19.1
A Summary of Reviewed Chapters
yi =
(xi−1+xi+1)
2
(19.1)
where yi is the missing value on ith day and xi+1 and xi-1 represent the previous
value and next value, respectively.
The calculated values were then applied to the PNN as additional feature vectors
to test for any improvement in the accuracy.
19.3.2 Labeling
After collecting the stock data for the intended period, all the daily closing prices
are labeled as up or down using Equation (19.4)
(19.4)
The proposed model uses “up and down” stock price changes as training
patterns.
The pattern layer receives the data from the input layer, and subsequently
calculates the distance amid the center, and the input data obtained. The resulting
value will be moved to the next sheet. The summation layer calculates the
weighted average of the similar types of neuron node from pattern layer.
After measuring every neuron’s node at the summation layer, the higher
possibility as the end node has been evaluated, and transports it towards output
layer. The PNN is implemented in the study work by means of MATLAB Neural
Network Toolbox, possessing network architectures identified by default settings.
Maximum likelihood was selected as the result node after measuring each neuron
node in the summation layer, and transports it towards the output layer. Based on
features in test samples, the qualified network was employed for forecasting the
stock market trends. The detailed procedure of proposed method is given in Table
19.2.
TABLE 19.2
Pseudo Code of the Proposed Method
Step 1: Collect the data regarding stock market from social media, financial
news and yahoo finance.
Step 2: Preprocess the Twitter data and financial news data to remove
unwanted information i.e., removal of stop words, punctuation and
tokenization etc.
Step 3: Process the historical data by using Equation (19.1) to fill missing
values for weekends and holidays
Step 4: Calculate the day sentiment score by using Equation (19.2)
Step 5: Compute the day sentiment score by using Equation (19.3)
Step 6: Manually label the closing price as up or down by using Equation
(19.4)
Step 7: Create two feature matrices by concatenating historical data and
sentiment score. One consists of only historical data whereas another
one has both historical data and sentiment score
Step 8: Design a PNN using MATLAB neural network toolbox
Step 9: Divide the feature matrices into two groups: training set and testing set
Step 10: Train the network using training sample and save the trained PNN for
future use
Step 11: Apply test sample to the trained network to make prediction about
stock movement
Step 12: Analyze the efficacy of the developed model using only historical
data and combined data
Step 13: Compute the prediction accuracy by comparing predicted values with
the actual values
Step 14: Compare the outcomes with the previous methods with respect to
prediction accuracy
TABLE 19.3
List of Companies
FIGURE 19.4 Prediction accuracy before and after addition of sentiment scores as an input to PNN.
As in Figure 19.4, results showed that for the selected stocks, higher prediction
accuracy was the one that had news and tweet scores as additional inputs to the
PNN. It strongly indicates that the PNN is reliable in determining future
movement of stocks if addition of investor sentiments as inputs to PNN.
REFERENCES
1. Chandar, S.K. “Fusion model of wavelet transform and adaptive neuro
fuzzy interference system for stock market prediction”. Journal of Ambient
Intelligence and Humanized Computing (2019): 1–9.
https://doi.org/10.1007/s12652-019-01224-2
2. Dang, M., and Duong, D. “Improvement methods for stock market
prediction using financial news articles”. Proceedings of the 3rd National
Foundation for Science and Technology Development Conference on
Information and Computer Science (NICS), Danang, Vietnam (2016).
3. Sheta, A.F., Ahmed, S.E., and Faris, H. “A comparison between regression,
artificial neural networks and support vector machines for predicting stock
market index”. International Journal of Advanced Research in Artificial
Intelligence 4, no. 7 (2015): 1–7.
4. Ondieki, A.R., Keyo, G.O., and Kibe, A. “Stock price prediction using
neural network models based on tweets sentiment scores”. Journal of
Computer Sciences and Applications 5, no. 2 (2017): 64–75.
5. Schumaker, R.P., Zhang, Y., Huang, C., and Chen, H. “Evaluating sentiment
in financial news articles”. Decision Support Systems 53, (2012): 458–464.
6. Kordonis, J., Symenonidis, S., and Arampatzis, A. “Stock price forecasting
via sentiment analysis on Twitter”. Proceedings of the 20th Panhellenic
Conference on Informatics (PCI ’16), Greece (2016).
7. Ho, K.Y., and Wang, W. “Predicting stock price movements with news
sentiment: An artificial neural network approach”. Studies in Computational
Intelligence (2016): 395–403. doi: 10.1007/978-3-319-28495-8_18.
8. Das, A., Behera, R.K., Kumar, M., and Rath, S.K. “Real time sentiment
analysis of Twitter data for stock prediction”. Proceedings of the
International Conference on Computational Intelligence and Data Science
(2018).
9. Shastri, M., Roy, S., and Mittal, M. “Stock price prediction using artificial
neural network: An application of big data”. EAI Endorsed Transactions on
Scalable Information Systems 6, no. 20 (2019): 1–8.
10. Khan, W., Ghazanfar, M.A., Azam, M.A., Karami, A., Alyoubi, K.H., and
Alfakeeh, A.S. “Stock market prediction using machine learning classifiers
and social media, news”. Journal of Ambient Intelligence and Humanized
Computing (2020).
20
Churn Prediction in the Banking Sector
CONTENTS
20.1 Introduction
20.1.1 Problem Statement
20.1.2 Current Scenario
20.1.3 Motivation
20.1.4 Objective
20.2 Related Work
20.3 Methodology
20.3.1 Dataset
20.3.2 Proposed System for Customer Churn Prediction
20.4 Results
20.4.1 Analysis of Clustering of Churned Customers
20.5 Conclusion
20.6 Future Work
References
20.1 Introduction
20.1.3 Motivation
The banking industry is humongous, making customer retention essential
concerns for its survival and good long-term profitability. Significant
research in the field of churn prediction is being carried out using various
statistical and data mining techniques for a decade. Very few studies have
addressed churn prediction with the use of Artificial Neural Network
models in the banking sector. This thesis aims to predict customer churn
using the Artificial Neural Network technique like backpropagation and
optimization technique stochastic gradient descent algorithm. During the
process of customer churn prediction, bank operators would often need to
analyze the steps to figure out the probable cause and rationale instigating
customers to churn. This could be possible with Artificial Neural Network
models as they generate accurate results.
20.1.4 Objective
To build the customer churn model which helps the bank to:
20.3 Methodology
ANN (Artificial Neural networks) typically consists of thousands of
artificial neurons called units that work in coordination for mathematical
processing and derive relevant and purposeful conclusions from it. Some of
the advantages of Artificial Neural Networks over other algorithms are that
they are fault tolerant and have parallel processing capability. Missing a few
pieces of information does not affect the network as information is stored
on the entire network and not just on the database. The hidden layer filters
some of the important patterns. Neural Networks have the capability to
learn from events by themselves and apply them when a similar event
arises. In the proposed system, Keras, Sklearn, and Pandas libraries were
used and analysis was done in Tableau software.
20.3.1 Dataset
Figure 20.1 represents a dataset used in the model which was taken from
the kaggle website (https://www.kaggle.com/santoshd3/bank-customers,
2018). It consists of 13 parameters and a final class column which shows
the supervised outcome of the given parameters. RowNumber, CustomerID,
and Surname parameters were ignored in the model as they have no impact
on output.
Data preprocessing was done on the dataset to convert raw data into well-
structured data, which will be more useful for the model. It is an important
step as it directly impacts the accuracy of the model. The following steps
were done in data preprocessing:
1. Handling missing values: In a particular row, if many values were
missing then the entire row was deleted. Sometimes if a particular
value was missing then the mean of the entire column was taken to fill
that value, whereas sometimes mode was taken. Sometimes the mean
of only particular rows was taken, which match some other values for
that particular row.
2. Handling categorical columns: Label encoding was applied on the
gender column which consists of 2 values, female and male. Then one
hot encoding was applied in the Geography column which consists of
three values, France, Spain, and Germany.
3. Feature scaling: It helps to normalize the data within a particular range
and is also useful in speeding up the calculations in an algorithm.
Figure 20.2 shows the dataset before and after applying data preprocessing.
FIGURE 20.2 (Left) Before data preprocessing; (right) after data preprocessing.
Weights = θ , θ , θ 11 12 13
a
(2)
= g(θ
(1)
x0 + θ
(1)
x1 + θ
(1)
x2 + θ
(1)
x 11 ) (20.1)
1 10 11 12 111
a
(2)
2
= g(θ
(1)
20
x0 + θ
(1)
21
x1 + θ
(1)
22
x2 + θ
(1)
211
x 11 ) (20.2)
a
(2)
3
= g(θ
(1)
30
x0 + θ
(1)
31
x1 + θ
(1)
32
x2 + θ
(1)
311
x 11 ) (20.3)
h θ (x) = a
(3)
= g(θ
(2)
a
(2)
+ θ
(2)
a
(2)
+ θ
(2)
a
(2)
+ θ
(2)
a
(2)
) (20.4)
1 10 0 11 1 12 2 13 3
Actual value = y i
Value found by model = a
(4)
δ
(4)
=
1
(y j − a
(4)
)
2
(20.5)
1 2 i
δ
(l)
= ((θ
(l)
)
T
δ
(l+1)
)⋅ ⁎a (l)
⋅ ⁎ (1 − a (l)
) (20.7)
δ
(3)
2
= δ
(4)
1
⋅ θ
(3)
12
(20.8)
δ
(2)
= δ
(3)
⋅ θ
(2)
+ δ
(3)
⋅ θ
(2)
(20.9)
2 2 22 1 12
*W := W − α . j′(W ) (20.10)
Where W is the weight at hand, *W is new weight α is the learning rate (i.e.,
0.1 in our example), and J′(W) is the partial derivative of the cost function
J(W) with respect to W.
J ′(W ) = a
(l)
j
⋅ δ
(l+1)
(20.11)
j
(l+1)
is
the loss at the unit on the other end of the weighted link:
θ
(1)
22
= θ
(1)
22
− α ⋅ x (2) ⋅ δ
(2)
2
(20.12)
θ
(2)
22
= θ
(2)
22
− α ⋅ a
(2)
2
⋅ δ
(3)
2
(20.13)
Mini-batch gradient descent was used in the model with a batch size of 10.
When the whole training set is passed through the ANN, that makes an
epoch. The model was trained on 100 epochs.
Step 7: Evaluate the Model
The model was optimized using the k-fold cross-validation method,
which fixed the variance problem. Cross-validation is a statistical method to
see how a model performs on unseen data. k refers to the number of
groups/folds that a given data sample is to be split into. In the proposed
model, the dataset is split into 10 folds. Ten accuracies are obtained for 10
folds and the mean of those values is taken, which results in the accuracy of
the model as 83.5% and variance as 0.9%.
Step 8: Improving the Model
Even though the model had a low variance of 0.9%, dropout
regularization was used to avoid overfitting which randomly disabled the
neurons and prevented them from being too dependent on each other while
they learned correlations. Therefore the neurons learn several independent
correlations in the data because each time there is a different configuration
of the neurons. Several independent correlations of data were found because
neurons work more independently, which prevents the neurons from
learning too much which prevents overfitting. To apply this, dropout class
was imported in code. Dropout at each hidden layer was done and 0.1% of
neurons were disabled.
Step 9: Tuning the Model
Parameter tuning is used to increase the accuracy. There are two types of
parameters, first are the ones that are learned from the model during
training e.g., weight and the others are parameters that are fixed that are
hyperparameters, e.g., number of epochs or neurons, batch size, etc. When
the model was trained, fixed values were taken for these hyperparameters.
But there is a possibility that accuracy might have increased if other values
for these hyperparameters were taken. That's what hyperparameter tuning is
all about. It finds the best values for these parameters. This was done with
the technique called grid search which will test several combinations of
these parameter values and will find the respective accuracy and will return
the set of parameters which will provide maximum accuracy with k-fold
validation. The model used a dictionary which contains different ranges of
values for parameters in the code. The range of batch size was from 25–32;
range of epochs was from 100–500; and adam or rmsprop were used for
optimizer. The parameter tuning uses a different combination of all these
values and finds the best set of parameters which will give the best
accuracy. This model was applied with k-fold cross-validation and after
running the code the best set of parameters were a batch size of 25; the
number of epochs was 500 and the optimizer was rmsprop; and the model
obtained an accuracy of 86.4% on these parameters.
20.4 Results
Figure 20.6 represents the confusion matrix. The system is evaluated using
accuracy, precision, recall, and F1 score as mentioned below:
1. Accuracy = (TP + TN/TP + FP + FN + TN) = 86.4%
2. Precision = (TP/TP + FP) = 95.48%
Of all the customers which the model predicted will stay, 95.48%
actually stayed.
3. Recall = (TP / TP + FN) = 88.39%
Of all the stayed customers, we predicted 88.39% correctly.
4. F1 Score = (2*P*R / P + R) = 91.8%
FIGURE 20.6 Confusion matrix.
After running the model, analysis was done on the entire output i.e.,
churn/loyal customers. The aim of this analysis was to find the features
which showed a major impact on churn customers. The dataset consists of
10,000 rows of which 2,000 are churn customers and 8,000 are loyal
customers. It has 5,500 male customers and 4,500 female customers. After
analysis, it was found that 25% of female customers churned, whereas only
16.5% of the males churned. Hence, female customers are 60% more likely
to churn as compared to male customers (Figure 20.7).
FIGURE 20.7 Analysis of churn and loyal customers.
20.5 Conclusion
The issue of customer churn is increasingly pressing by the day. The
proposed models help to control customer churn. The multi-layered ANN
(Artificial Neural Network) model is designed to solve this problem, which
produced an accuracy of 86.4%. The proposed model will help the bank to
identify which customers will leave the bank and hence the bank can retain
those customers, saving a lot of money which would rather had been used
for replacing the churn customers and also save the money which would be
used for retaining already loyal customers. The proposed model not only
provides high accuracy but also provides with the best insights to further
prevent churn behavior. With the help of influencing factors obtained
through clustering and analysis, it will be easy to design retention policies
to retain the customers as these methods provide the reasons for their churn
along with the list of customers with high probability to churn.
20.6 Future Work
In the future, clustering will be implemented on the customers based on
their likelihood of churning, i.e., customers which are more likely to churn
can be kept in one cluster (i.e., chances of leaving are greater than 0.85),
those that are likely to leave the bank by 0.65 < x < 0.85 can be kept in one
cluster and those that are less likely to leave the bank can be kept in another
cluster and analysis can be done accordingly. Those that are more likely to
leave will be on a high-priority list and attempts will be made to stop them
first by taking necessary steps. The model can be deployed on the cloud,
where frontend will be implemented for interaction, using azure/gcp cloud
model by implementing an API to pipeline the data into the cloud datastore
followed by the model implementation on the cloud. To use the pipelined
data for analysis, this model can be used to scale the application by using
the API for different banks across the country to analyze their data, and
generate the monthly or quarterly churn values of the customers. The
implementation of cloud deployment using MLOps would be beneficial for
continuous monitoring of the ML model and analyzing its performance. The
frontend can be made using latest cutting-edge technological advancements
like material design and trending design patterns which will help the
manager, admin, or appointed authority to easily maneuver the website so
that they can check details of customers, their chances of leaving and
analysis and graphs which will highlight the reason for their churn.
Additionally, a chatbot could be added that will make the website more
efficient.
REFERENCES
1. Agrawal, S., A. Das, A. Gaikwad, and S. Dhage, “Customer Churn
Prediction Modelling Based on Behavioural Patterns Analysis using
Deep Learning”, 2018 International Conference on Smart Computing
and Electronic Enterprise (ICSCEE), (Shah Alam, 2018), pp. 1–6.
2. Cao, S., W. Liu, Y. Chen, and X. Zhu, “Deep Learning Based
Customer Churn Analysis”, 2019 11th International Conference on
Wireless Communication and Signal Processing (WCSP), (Xi’an,
China, 2019), pp. 1–6.
3. Dalvi, P. K., S. K. Khandge, A. Deomore, A. Bankar, and V. A.
Kanade, “Analysis of Customer Churn Prediction in Telecom Industry
Using Decision Trees and Logistic Regression”, 2016 Symposium on
Colossal Data Analysis and Networking (CDAN), (Indore, 2016), pp.
1–4.
4. Hegde, Sandeep Kumar, and Monica Mundada, “Enhanced Deep Feed
Forward Neural Network Model for the Customer Attrition Analysis in
Banking Sector”,” International Journal of Intelligent Systems and
Applications, vol. 11, no. 7, pp. 10–19 (2019).
5. Hemalatha, Putta, and Geetha Mary Amalanathan, “A Hybrid
Classification Approach for Customer Churn Prediction using
Supervised Learning Methods Banking Sector”, International
Conference on Vision Towards Emerging Trends in Communication
and Networking (ViTECoN) (2019).
6. Karvana, K. G. M., S. Yazid, A. Syalim, and P. Mursanto, “Customer
Churn Analysis and Prediction Using Data Mining Models in Banking
Industry”, 2019 International Workshop on Big Data and Information
Security (IWBIS), (Bali, Indonesia, 2019), pp. 33–38.
7. Sai, B.N. Krishna and T. Sasikala, “Predictive Analysis and Modeling
of Customer Churn in Telecom using Machine Learning”, 3rd
International Conference on Trends in Electronics and Information
(ICOEI). ISBN: 978-1-5386–9439-8 (2019).
8. Spider, M., and G. Azzopardi, “Customer Churn Prediction for a
Motor Insurance Company”, 2018 Thirteenth International Conference
on Digital Information Management (ICDIM), (Berlin, Germany,
2018), pp. 172–178.
9. Ullah, I., B. Raza, A. K. Malik, et al “A Churn Prediction Model
Using Random Forest: Analysis of Machine Learning Techniques for
Churn Prediction and Factor Identification in Telecom Sector”, in
IEEE Access, vol. 7, pp. 60134–60149 (2019).
10. Ullah, I., H. Hussain, I. Ali, and A. Liaquat, “Churn Prediction in
Banking System using K-Means, LOF, and CBLOF”, 2019
International Conference on Electrical, Communication, and
Computer Engineering (ICECCE), (Swat, Pakistan, 2019), pp. 1–6.
11. Wadikar, Deepshika, “Customer Churn Analysis”,” Masters
Dissertation. Technological University, Dublin. DOI: 10.21427/KPSZ-
X829, Corpus ID: 211547276 (2020).
21
Machine and Deep Learning Techniques for
Internet of Things Based Cloud Systems
CONTENTS
21.1 Introduction
21.1.1 Power of Remote Computing
21.1.2 Security and Privacy Policies
21.1.3 Integration of Data
21.1.4 For Hosting, Providers Remove Entry Barrier
21.1.5 Improves Business Continuity
21.1.6 Facilitates Inter-device Communication
21.1.7 Pairing with Edge Computing
21.1.8 How IoT and Cloud Complement Each Other?
21.1.9 Cloud and IoT: Which Is Better?
21.1.10 The Challenges Posed by the Cloud and IoT Together?
21.1.10.1 Handling an Outsized Amount of Knowledge
21.1.10.2 Networking and Communication Protocols
21.1.10.3 Sensor Networks
21.1.10.4 Security Challenges
21.2 Security Issues in IoT-Based Cloud Systems
21.2.1 Attacks in IoT
21.2.1.1 Active Attack
21.2.1.2 Passive Attack
21.3 Machine Learning and Deep Learning: A Solution to Cyber Security
Challenges in IoT-Based Cloud Systems
21.3.1 Machine Learning and Deep Learning Techniques
Introduction
21.3.1.1 A Tour of Machine Learning Algorithms
21.3.2Machine Learning and Deep Learning Techniques Used in
IoT Security
21.3.2.1 Supervised Machine Learning
21.3.2.2 Unsupervised ML
21.3.2.3 Deep Learning (DL) Methods for IoT Security
21.3.2.4 Unsupervised DL (Generative Learning)
21.3.2.5 Semi-Supervised or Hybrid DL
21.4 Conclusion
References
21.1 Introduction
Nowadays IoT plays an important role to make everyone’s life easier in
complex tasks. As technology is growing and thus around the United States
has become connected more than before. So, a net of things hatched a
network with interconnected devices and sensors which they administered
everyday tasks simply [1]. The IoT applications like smart cities, smart
health care, smart automobiles, smart retails, smart homes, etc. will give
information on how connected devices are and disturbing the norm
prompting the assembly of an efficient and automatic planet. Over the
years, these devices are becoming very smart and affect vast amounts of
knowledge to transfer between different machines. Over time, IoT devices
are capable of affecting real-time data like sensor data, images, and audio
and video data [1]. The IoT generates quantities of knowledge and in turn
puts a burden on network infrastructure. Because it puts a huge strain on
network infrastructure, it becomes very difficult for the clients to affect big
data to transfer from one place to another. To agitate this instance, cloud
computing has entered into data technology and provides different services
for the end users according to their needs. These IoT devices alone will not
give benefits to the progression of infrastructure, instead they combine with
other technologies like cloud services that will give massive advantages to
the clients [1]. Distributed computing administrations encourage moment,
on-request conveyance of registering foundation, information bases,
stockpiling, and applications needed for the cycle and examination of data
focuses produced through a few IoT devices. At present, 96 of the
associations received cloud administrations and with the rise of Amazon
Internet providers, Google cloud stage, Microsoft Azure, and IBM cloud,
the broadening possibilities of the snare of things appear to be much more
brilliant. In lightweight of the guidelines of capacity and availability, the
cloud is hailed as reformist advancement over the planet. Here are some key
reasons the cloud is prime to the achievement of the Internet of Things [1].
TABLE 21.1
Comparison between IoT and Cloud Computing
IoT act as a source for large amount of data sets whereas cloud will
manage the large amounts of data.
Reachability of IoT devices are very limited whereas cloud systems
can be far and widespread over the network.
Storage capacity of IoT devices are limited in other words we can
say there is no storage whereas cloud systems will have large
capacity virtually it is endless.
The computing capability of IoT devices is very limited whereas
cloud systems will have more computing power.
IoT devices will run on hardware whereas cloud systems runs on
virtual machines.
Deep Learning
Supervised Models
Classic Neural Networks (Multilayer Perceptions): A
classic neural networks are class of artificial neural
networks. An MLP comprises of three layers: information or
input, a shrouded or hidden, and a yield or output layer [23].
Ex: Fitness approximation, speech recognition, image
recognition, and machine translation, etc…
Convolution Neural Networks (CNNs): In DL, a
convolution neural n/w (CNN or ConvNet) is a class of deep
neural associations or organizations, generally used for
perception. They are generally called move invariant or
space invariant ANN or organizations (SIANN) [24].
Ex: Decoding Facial Recognition, Historic and
Environmental Collections, Analyzing Documents,
Grey Areas, Advertising, etc…
Recurrent Neural Networks (RNNs): An RNN is a class of
ANN where the hubs are associated with a graph along with
a sequence. Thus, it acts progressively and these are gotten
from feed forward neural n/w; RNNs can use their internal
state (memory) to deal with variable-length plans of data
sources. The articulation “redundant neural n/w’s” is used
unusually to imply two extensive classes of associations with
a tantamount general structure, where one is a restricted
inspiration and the other is perpetual drive. The two classes
of associations show transitory dynamic behavior [6]. A
restricted drive irregular association is a planned non-cyclic
graph that can be unrolled and superseded with a cautiously
feed-forward neural association, while an unending
inspiration redundant association is an organized cyclic
outline that can’t be unrolled. Both restricted drive and
unlimited inspiration dull associations can have additionally
taken care of states [25].
Ex: Language Modeling and Generating Text, Machine
Translation, Video Tagging, Text Summarization, etc…
Unsupervised Models
Self-Organizing Maps (SOMs): A SOM is one variant
of artificial neural network it is used to deliver two
dimensional representation of input space using the
training samples called a chart, and is accordingly a
strategy to do dimensionality reduction [26].
Ex: Project prioritization and selection, Seismic
facies analysis for oil and gas exploration, Failure
mode and effects analysis, Creation of artwork,
etc…
Boltzmann Machines: A Boltzmann machine
(additionally called stochastic Hopfield network with
shrouded units or Sherrington–Kirkpatrick model with
outside field or stochastic Ising-Lenz-Little model) is a
kind of stochastic intermittent neural organization. It is
based upon the model called spin glass and The
Boltzmann Machines are partitioned into various kinds
they are
a. Restricted Boltzmann machine: Restricted
Boltzmann machine (RBM) which doesn’t permit
intralayer associations between shrouded
(hidden)units and noticeable(visible)units, for
example, there is no association between
noticeable to obvious and covered up to shrouded
units. Subsequent to preparing one RBM, the
exercises of its shrouded units can be treated as
information for preparing a more significant level
RBM. This strategy for stacking RBMs makes it
conceivable to prepare numerous layers of
shrouded units effectively and is one of the most
widely recognized profound learning procedures.
As each new layer is added the generative model
improves.
b. Deep Boltzmann machine: A deep Boltzmann
machine (DBM) is a sort of double pair wise
Markov arbitrary field (undirected probabilistic
graphical model) with numerous layers of
concealed irregular factors. Like DBNs, DBMs
can learn perplexing and conceptual inner
portrayals of the contribution to assignments, for
example, article or discourse acknowledgment,
utilizing restricted, named information to adjust
the portrayals constructed utilizing an enormous
arrangement of unlabeled tangible info
information. Notwithstanding, dissimilar to DBNs
and profound convolution neural organizations,
they seek after the induction and preparing
methodology in the two ways, base up and top-
down, which permit the DBM to more readily
reveal the portrayals of the info structures [27].
Ex: linguistics, robotics, computer vision,
etc…
Auto Encoders: An auto encoder is a sort of ANN used
to learn effective information codings in a solo way.
The point of an auto encoder is to get familiar with a
portrayal (encoding) for a bunch of information,
ordinarily for dimensionality reduction, via preparing
the organization to overlook signal “commotion.”
Along with that reduction, a remaking side is found out,
where the auto encoder attempts to produce from the
diminished encoding a portrayal as close as conceivable
to its unique info, consequently its name. A few
variations exist to the essential model, with the point of
constraining the scholarly portrayals of the contribution
to accept valuable properties. Models are the
regularized auto encoders (Sparse, Denoising, and
Contractive auto encoders), demonstrated power in
learning portrayals for ensuing order undertakings, and
Variation auto encoders, with their ongoing applications
as generative models. Auto encoders are adequately
utilized for taking care of many applied issues, from
face acknowledgment to gaining the semantic
significance of words [28].
Ex: Image Denoising, Dimensionality Reduction,
Sequence to sequence prediction,
Recommendation system, etc…
Machine Learning
Supervised Learning
Regression: this algorithm is used when you want
to predict a continuous variable from a range of
discrete variables. If the variable is dichotomous
then use the logistic regression. The unbiased
variables used in regression can be both non-stop
and dichotomous. One factor to hold in thought
with regression evaluation is that causal
relationships amongst the variables can’t be
determined. While the terminology is such that we
say that X “predicts” Y, we can’t say that X
“causes” Y [17].
Ex: Prediction, Forecasting, estimating
expectancy of life, etc…
Classification: Classification can also be
described as the manner of predicting category or
class from found values or given information
points. The classified output can have the shape
such as “Black” or “White” or “spam” or “no
spam.” Mathematically, classification is the
venture of approximating a mapping characteristic
(f) from enters variables (X) to output variables
(Y). It is essentially belongs to the supervised
desktop getting to know in which objectives are
additionally furnished alongside with the enter
records set. An instance of classification trouble
can be the unsolicited mail detection in emails.
There can be solely two classes of output, “spam”
and “no spam”; therefore this is a binary kind
classification [18].
Ex: Classification of images, Customer
Retention, Fraud Detection, etc…
Unsupervised Learning
Clustering: clustering is used to group the
information according to similar characteristics.
Assume you are the top of a rental store and wish
to comprehend the inclinations of your clients to
scale up your business. Is it feasible for you to take
a gander at the subtleties of every customer and
devise an interesting business system for every
single one of them? Unquestionably not. In any
case, what you can do is to bunch the entirety of
your customers into state 10 gatherings dependent
on their buying propensities and utilize a different
methodology for costumers in every one of these
10 gatherings [19].
Ex: Segmentation Targeted Marketing, etc…
Dimensionality Reduction: Dimensionality
reduction alludes to strategies for lessening the
number of information factors in preparing
information. High-dimensionality may mean
hundreds, thousands, or even a large number of
info factors. Fewer info measurements frequently
mean correspondingly fewer boundaries or a less
complex structure in the AI model, alluded to as
levels of opportunity. A model with an excessive
number of levels of opportunity is probably going
to overfit the preparation dataset and accordingly
may not perform well on new information. It is
attractive to have straightforward models that sum
up well, and thusly, input information with
scarcely any info factors. This is especially valid
for direct models where the number of sources of
info and the levels of the opportunity of the model
is frequently firmly related [20].
Ex: Structure Discovery, Compression,
Visualization, etc…
Semi-Supervised Learning
Semi-Supervised Clustering: In the real world,
there is more unlabelled information at that point
marked information, however, there is still some
named information. We need to utilize all
accessible data for generally strong and best-
performing models. Semi-directed bunching
encourages us there. Creating names may not be
simple or modest, and subsequently, because of
restricted assets, we may have names for just a
couple of perceptions. For instance, researching
extortion is costly so we may think about affirmed
misrepresentation or affirmed non-
misrepresentation just for restricted protection
claims. However, not knowing doesn’t imply that
those cases can’t be misrepresented. A few
instances of semi-managed grouping can be news
class order, as you have seen on Google News[21].
There might be some data about a news thing
being identified with “legislative issues” or
“sports” however no one can filter through a huge
number of things consistently to make completely
named information. Essentially, picture
acknowledgment utilizes the comparable strategy
as you may encounter now on Google Photos.
Ex: Speech analysis, DNA Sequence
Classification, etc…
Semi-Supervised Classification: The semi-
supervised algorithm dependent on agreeable
preparation understands the use of unlabeled
information by utilizing different classifiers. In the
learning cycle, the unlabeled information is
utilized as a stage for data association between
different classifiers. The contrasts between
numerous classifiers are basic to the viability of
such realizing, which is named as semi-directed
learning dependent on disparity. Semi-regulated
characterization calculation dependent on different
classifier cooperation just uses marked examples
to improve the variety of classifiers and doesn’t
utilize the bountiful data of an enormous number
of unlabeled examples to upgrade the variety of
classifiers [22].
Ex: Speech analysis, DNA Sequence
Classification etc…
Reinforcement Learning
Positive Reinforcement Learning: It is
characterized as a function, which happens on
account of explicit conduct. It builds the quality
and the recurrence of the conduct and effects
emphatically on the activity taken by the specialist.
This sort of Reinforcement encourages you to
expand execution and support change for a more
broadened period. Be that as it may, an excessive
amount of Reinforcement may prompt over-
enhancement of state, which can influence the
outcomes.
Ex: Resource Management, Traffic light
control, robotics, etc…
Negative Reinforcement Learning: Negative
Reinforcement is characterized as fortifying of
conduct that happens in light of a negative
condition which ought to have halted or
maintained a strategic distance from. It causes you
to characterize the base remain of execution.
Notwithstanding, the disadvantage of this strategy
is that it gives enough to get together the base
conduct.
Ex: Resource Management, Traffic light
control, robotics, etc…
21.3.2.2 Unsupervised ML
In this segment we discuss about various unsupervised learning techniques.
21.4 Conclusion
Here, I examined rudiments of Internet of Things devices and the function
of distributed computing in IoT climate and after that; I present essential AI
and ML learning algorithms. The rundown of all AI and ML algorithms
clarified the function of AI and M/c learning calculations in the IoT
security. These calculations assume a significant part in the field of security
for IoT based cloud frameworks. In this chapter, gave the information
regarding the IoT security and how the machine learning and deep learning
algorithms are used to identify the different attacks like Distributed Denial
of service attack, Detecting malware, Detection of Malicious software,
Detection malicious attacks, provide security to IoT devices in IoT based
cloud systems. In future chapters will concentrate on attacks that lead to
more damage to IoT devices.
REFERENCES
1. Aaron Chichioco, “What-is-the-role-of-cloud-computing-in-IoT”,
https://blog.resellerclub.com/what-is-the-role-of-cloud-computing-in-
iot/.
2. K. Sravanthi, Kavitha Agarwal, A. K. Tyagi, “Beyond things: A
systematic study of internet of everything”, In: and Abraham A., Panda
M., Pradhan S., Garcia-Hernandez L., Ma K. (eds) Innovations in Bio-
Inspired Computing and Applications. IBICA 2019. Advances in
Intelligent Systems and Computing, vol. 1180. Springer, Cham.
https://doi.org/10.1007/978-3-030-49339-4_23, 2020.
3. D. Evans, “The internet of things: How the next evolution of the
internet is changing everything”, CISCO White Paper, vol. 1, no. 2011,
pp. 1–11, 2011.
4. S. Ray, Y. Jin, and A. Ray Chowdhury, “The changing computing
paradigm with internet of things: A tutorial introduction”, IEEE
Design & Test, vol. 33, no. 2, pp. 76–96, 2016.
5. A. R. Sfar, E. Natalizio, Y. Challal, and Z. Chtourou, “A roadmap for
security challenges in the Internet of things”, Digital Communications
and Networks, vol. 4, no. 2018, 118–137, 2018.
6. M. K. Saggi and S. Jain, “A survey towards an integration of big data
analytics to big insights for value-creation”, Information Processing &
Management, 54(5) DOI: 10.1016/j.ipm.2018.01.010, 2018.
7. D. Li, Z. Cai, L. Deng, X. Yao, and H. H. Wang, “Information security
model of block chain based on intrusion sensing in the IoT
environment”. Cluster Comput, vol. 22, pp. 451–468. doi:
10.1007/s10586-018-2516-1.
8. F. Restuccia, S. D. Oro, and T. Melodia, “Securing the internet of
things in the age of machine learning and software-defined
networking”, IEEE Internet of Things Journal, vol. 5, pp. 4829–4842,
Dec. 2018.
9. A. Thakkar and R. Lohiya, ” “A review on machine learning and deep
learning perspectives of IDS for IoT: Recent updates, security issues,
and challenges”. Arch Computat Methods Eng, vol. 4, pp. 234-237,
2020. https://doi.org/10.1007/s11831-020-09496-0.
10. T. Hothorn. “CRAN task view: Machine learning and statistical
learning”, Version:2017-01-06 URL: https://CRAN.R-
project.org/view=MachineLearning, 2019.
11. R. Bandi and G. Anitha, “Machine learning based Oozie workflow for
hive query schedule mechanism”, 2018 International Conference on
Smart Systems and Inventive Technology (ICSSIT), Tirunelveli, India,
2018, pp. 513–517, doi: 10.1109/ICSSIT.2018.8748711.
12. R. Bandi and J. Amudhavel, “ “Object recognition using Keras with
backend tensor flow”. International Journal of Engineering and
Technology (UAE), 7, no. 3.6, pp. 229–233, 2018.
13. Raswitha Bandi, J. Amudhavel, and R. Karthik, “Machine learning
with PySpark–Review”, Indonesian Journal of Electrical Engineering
and Computer Science, vol. 12, no. 1, pp. 102–106, 2018.
14. K. A. da Costa, J. P. Papa, C. O. Lisboa, R. Munoz, and V. H. C. de
Albuquerque, “Internet of things: A survey on machine learning-based
intrusion detection approaches”, Computer Networks, vol. 151, pp.
147–157, 2019.
15. J. Hou, L. Qu, and W. Shi, “A survey on internet of things security
from data perspectives”, Computer Networks, vol. 148, pp. 295–306,
2019.
16. M. binti Mohamad Noor and W. H. Hassan, “Current research on
internet of things (IoT) security: A survey”, Computer Networks, vol.
148, pp. 283–294, 2019.
17. S. Zhang, L. Yao, A. Sun, and Y. Tay, “Deep learning based
recommender system: A survey and new perspectives”, ACM
Computing Surveys vol. 52, no. 1, p. 5, 2019.
18. S.-G. Leem, I.-C. Yoo, and D. Yook, “Multitask learning of deep
neural network-based keyword spotting for IoT devices”, IEEE Trans.
Consum. Electron., vol. 65, no. 2, pp. 188–194, May 2019.
19. W. Z. Khan, M. Y. Aalsalem, and M. K. Khan, “Communal acts of IoT
consumers: A potential threat to security and privacy”, IEEE Trans.
Consum. Electron., vol. 65, no. 1, pp. 64–72, Feb. 2019.
20. H. H. Pajouh, R. Javidan, R. Khayami, D. Ali, and K.-K. R. Choo, “A
two-layer dimension reduction and two-tier classification model for
anomaly-based intrusion detection in IoT backbone networks”, IEEE
Trans. Emerg. Topics Comput., vol. 7, no. 2, pp. 314–323, Apr–Jun
2019.
21. G. Dulac-Arnold, D. Mankowitz, and T. Hester, “Challenges of real-
world reinforcement learning”, 2019. [Online]. Available: arXiv:
1904.12901.
22. P. Asghari, A. M. Rahmani, and H. H. S. Javadi, “Internet of things
applications: A systematic review”, Computer Networks, vol. 148, pp.
241–261, 2019.
23. L. Xiao, X. Wan, and Z. Han, “Phy-layer authentication with multiple
landmarks with reduced overhead”, IEEE Transactions on Wireless
Communications, vol. 17, pp. 1676–1687, Mar. 2018.
24. D. Li, Z. Cai, L. Deng, X. Yao, and H. H. Wang, “Information security
model of block chain based on intrusion sensing in the iot
environment”, Cluster Computing, Mar. Special Issue 1, 2018.
25. J. F. Colom, D. Gil, H. Mora, B. Volckaert, and A. M. Jimeno,
“Scheduling framework for distributed intrusion detection systems
over heterogeneous network architectures”, Journal of Network and
Computer Applications, vol. 108, pp. 76 –86, 2018.
26. I. Makhdoom, M. Abolhasan, J. Lipman, R. P. Liu, and W. Ni,
“Anatomy of threats to the internet of things”, IEEE Communications
Surveys Tutorials, vol. 21, no. 2, pp. 1636–1675, 2018.
27. D. E. Kouicem, A. Bouabdallah, and H. Lakhlef, “Internet of things
security: A top-down survey”, Computer Networks, vol. 141, pp. 199–
221, Aug. 2018.
28. I. Yaqoob et al, “The rise of ransom ware and emerging security
challenges in the Internet of Things”, Computer Networks, vol. 129,
pp. 444–458, Dec. 2017.
29. L. Xiao, X. Wan, X. Lu, Y. Zhang, and D. Wu, “IoT security
techniques based on machine learning”, 2018. [Online]. Available:
arXiv:1801.06275.
30. Jason Brownlee, “A tour of Machine learning algorithms”,
https://machinelearningmastery.com/a-tour-of-machine-learning-
algorithms/.
31. A. Oussous, F.-Z. Benjelloun, A. A. Lahcen, and S. Belfkih, “Big data
technologies: A survey”, J. King Saud Univ. Comput. Inf. Sci., vol. 30,
no. 4, pp. 431–448, 2018.
Section V
Future Research Opportunities
towards Data Science and Data
Analytics
22
Dialect Identification of the Bengali
Language
CONTENTS
22.1 Introduction
22.2 Previous Works
22.3 Proposed Methodology
22.3.1 Computation of Features
22.3.1.1 Feature Selection
22.3.2 Formation of Feature Vector and Classification
22.4 Experimental Results
22.4.1 Relative Analysis
22.5 Conclusion
References
22.1 Introduction
India is a vast country with huge diversity of languages. People of India
speak in many languages. As per the constitution of India, there are many
languages in India. There are also several languages in India which are
being spoken by different tribes and they are not recognized. Even it is also
observed that several languages of India do not have any written format,
they are only vocal. Every language pronunciation styles vary from one
geographic region to other geographic region, though people of these two
geographic positions are speaking in the same language. This type of
variation within the same language is termed a dialect of that language. A
single language will have many dialects as people speaking in a particular
language are not restricted within a geographical boundary, rather they are
spread in different geographic regions.
Classification of Indian languages is a very fascinating investigation area
because of the colossal assortment of Indian languages. As there are lots of
languages in India, as well as lots of regional variation in India, all the
languages of India contain several dialects within it. Dialect is not a new
language, rather it is a sub-language. Dialect is nothing but a
distinguishable form within a single language which is totally specific to a
geographic region or social group. Geographic region creates a deep impact
on languages because variations of geographic regions will create cultural
variations also which will generate a dialect. Pronunciation style of a
language gets changed according to geographic region. The way people of
West Bengal speaks Bengali, people of Tripura of Jharkhand or any other
geographic localities do not speak in the same way—pronunciation style is
changed although they are speaking the same language—Bengali. Though
dialect is mainly a pronunciation variation of a language, still it is found
that the dialect of a language is deviated from the main language in such a
way that sometimes it becomes very tough to understand that it is a
variation of that language and not a new separate language.
Dialect identification plays an important role in automated speaker
recognition. If the system is well trained with different dialects of
languages, performance of the system will increase. More types of dialects
will be considered in the dataset, the versatility of dataset will increase, and
as a result system performance will increase. Basically dialect identification
task will support the language identification task.
Among the languages in India, Bengali language is one of the largest
languages in the world also. Bengali is also spoken in Bangladesh which is
a neighboring country of India. Bengali is the national language of
Bangladesh also. Bengali is seventh most spoken language in the world [1,
2]; 3.4% [1] of the population of the whole world speaks Bengali. In India,
Bengali is largely spoken by Bengalis residing in West Bengal, Tripura,
Jharkhand, Chhattisgarh along with other states also. Even in Andaman and
Nicobar Islands, Bengali is widely spoken. Moreover national anthems of
both Bangladesh and India are sung in Bengali which was written by the
great Bengali Nobel laureate Rabindranath Tagore. Even the national song
of India is also written and sung in Bengali, which was written Bankim
Chandra Chatterjee. Among the Indian languages, Bengali is globally
acknowledged. Also, Bengali language itself is a famous research domain.
Lots of research work is going on in Bengali language. But when the
question of dialect identification comes, unfortunately it is found that most
of the works were done either on European languages or other Indian
languages like Hindi, etc. Very few works are identified on Bengali dialects
which were based on Bengali dialects of Bangladesh. Surprisingly research
works on Bengali language dialects of India remains unexplored for
unknown reasons. These have motivated to identify dialects of Bengali
language as a whole.
In Bengali language, there exists lots of dialects but among them three
major dialects are considered in this work—Bangali (pronounced as
“Bongaali”), Manbhumi (Jharkhandi), and Rarhi. These dialects are major
because lots of people who speak in Bengali follow this dialect as well as
they are prominent. These dialects are being spoken in western, southern,
and eastern regions of West Bengal.
In this work, previous efforts related to dialect identification and its
related exertions are discussed in Section 22.2. Proposed approach for
dialect identification of Bengali language is placed in Section 22.3. Section
22.4 describes the experimental results along with the comparative analysis
with other work. Section 22.5 concludes the whole work.
This exertion aims to put forward a simple acoustic facet group which is
computationally simple as well as stumpy dimensional.
m=1
sign [x i (m − 1) ∗ x i (m) (22.1)
1, if v > 0 (22.2)
sign [v] = {
0, otherwise
If Figures 22.3, 22.4, and 22.5 are watched minutely, it will be found that
the rate of crossing the zero axis is quite different for the three dialects of
Bengali language.
f m = 2595 ∗ log 10 (1 +
f
) (22.3)
100
where t indicates overall frame number. spec_f represents spectral flux for
Yth spectrum of the signal. If s(Y,j) is the value of jth bin for Yth spectrum
then s(Y-1,j) is the value of j-1th bin. Like ZCR only mean and standard
deviation of spectral flux is considered.
Spectral flux plots of Bangali, Jharkhandi, and Rarhi are depicted in
Figures 22.10, 22.11, and 22.12, respectively.
FIGURE 22.10 Spectral flux plot of Bangali dialect.
FIGURE 22.11 Spectral flux plot of Jharkhandi dialect.
FIGURE 22.12 Spectral flux plot of Rarhi dialect.
TABLE 22.1
Accuracy of Dialect Identification of Bengali Language (Using F1)
Classifier Discrimination Exactness (in %) Meant for
Suggested Feature Set
TABLE 22.2
Accuracy of Dialect Identification of Bengali Language (Using F2)
TABLE 22.3
Accuracy of Dialect Identification of Bengali Language (Using F3)
TABLE 22.4
Accuracy of Dialect Identification of Bengali Language (Using F4)
TABLE 22.5
Relative Analysis of Intended Facet Set with Other Work
22.5 Conclusion
This work indicates that suggested facet set is able to identify three major
categories of Bengali dialects (Bangali, Jharkhandi, and Rahi) with 86.67%
success rate using the Naïve Bayes classifier. This work also establishes the
fact that MFCC along with mean and standard deviation of ZCR and
skewness is capable of identifying dialects of Bengali language.
Sequential Floating Forward Search (SFFS) facet selection technique is
applied in this venture to find the best possible facet set to identify dialects
of Bengali language. Through this facet selection technique the most
significant facet is added first in the plausible facet set and gradually the
other facets. If a facet is found to be a low discriminating facet, that facet
will be discarded in the subsequent phase. In this work, MFCC is added the
plausible facet set first as MFCC is the most significant facet for speech
processing related activities. Next, mean and standard deviation of ZCR is
added in the plausible facet set. After that skewness based facet is included.
Gradually classification accuracy is observed to be increasing whenever a
new facet is attached in the plausible facet set. But whenever spectral flux
based facet is added in the plausible facet set, classification accuracy of the
system is observed to be decreased. From this observation it is concluded
that spectral flux based facets are not able to identify dialects of Bengali
language very well, rather the combination of MFCC, mean and standard
deviation of ZCR with skewness produces better classification accuracy.
In the future, other facets may be explored to improve the classification
accuracy. Also other non-major dialects of Bengali language will also be
explored. Also, sub-classification of these dialects may also be explored in
future endeavours also. Through comparative analysis of previous efforts it
is also clear that Mel frequency cepstral coefficient (MFCC) alone is not
sufficient to identify different dialects of Bengali language. Zero Crossing
Rate (ZCR) and skewness-based aural facets are also required to be
considered along with the Mel frequency cepstral coefficient (MFCC) as
these dialects diverge both in time domain as well as in frequency domain.
REFERENCES
1. “The World Factbook” (https://www.cia.gov/library/publications/the-
world-factbook/geos/xx.html). www.cia.gov. Central Intelligence
Agency. Archived (https://web.archive.org/web/20080213004843/
https://www.cia.gov/library/publications/the-world-
factbook/geos/xx.html) from the original on 13 February 2008.
Retrieved 21 February 2018.
2. “Summary by language size”
(https://www.ethnologue.com/statistics/size). Ethnologue. 3 October
2018. Archived (https://web.archive.org/web/20130911104311/
http://www.ethnologue.com/statistics/size) from the original on 11
September 2013. Retrieved 21 February 2019.
3. Torres-Carrasquillo, P. A., Gleason, T. P., & Reynolds, D. A. (2004).
“Dialect identification using Gaussian mixture models”. In
ODYSSEY04-The Speaker and Language Recognition Workshop (pp.
297–300).
4. Hossain, S. A., Rahman, M. L., & Ahmed, F. (2005, October). “A
review on Bangla phoneme production and perception for
computational approaches”. In 7th WSEAS International Conference
on Mathematical Methods and Computational Techniques in Electrical
Engineering (pp. 69–89).
5. Baker, W., Eddington, D., & Nay, L. (2009). “Dialect identification:
The effects of region of origin and amount of experience”. American
Speech, 84(1), 48–71.
6. Muhammad, G., Alotaibi, Y. A., & Huda, M. N. (2009, December).
“Automatic speech recognition for Bangla digits”. In 2009 12th
International Conference on Computers and Information Technology
(pp. 379–383). IEEE.
7. Mandal, S. D., Warsi, A. H., Basu, T., Hirose, K., & Fujisaki, H.
(2010, November). “Analysis and synthesis of F0 contours for Bangla
readout speech”. In Proc. of Oriental COCOSDA.
8. Rao, K. S. (2011). “Role of neural network models for developing
speech systems”. Sadhana, 36(5), 783–836.
9. Rashel, M. M. (2011). “Phonological analysis of Chatkhil Dialect in
Noakhali District, Bangladesh”. Theory and Practice in Language
Studies, 1(9), 1051–1061.
10. Warsi, A. H., Basu, T., Hirose, K., & Fujisaki, H. (2011, October).
“Prosodic comparison of declarative and interrogative utterances in
Standard Colloquial Bangla”. In 2011 International Conference on
Speech Database and Assessments (Oriental COCOSDA) (pp. 56–61).
IEEE.
11. Rao, K. S., & Koolagudi, S. G. (2011). “Identification of Hindi
dialects and emotions using spectral and prosodic features of speech”.
IJSCI: International Journal of Systemics, Cybernetics and
Informatics, 9(4), 24–33.
12. Saxena, A., & Borin, L. (2011, May). “Dialect classification in the
Himalayas: A computational approach”. In Proceedings of the 18th
Nordic Conference of Computational Linguistics (NODALIDA 2011)
(pp. 307–310).
13. Faquire, A. B. M. R. K. (2012). “On the classification of varieties of
Bangla spoken in Bangladesh”. Bup Journal, 1(1), 136.
14. Warsi, A. H., Basu, T., Hirose, K., & Fujisaki, H. (2012, December).
“Analysis and synthesis of F 0 contours of declarative, interrogative,
and imperative utterances of Bangla”. In 2012 International
Conference on Speech Database and Assessments (pp. 56–61). IEEE.
15. Das, B., Mandal, S., Mitra, P., & Basu, A. (2013). “Effect of aging on
speech features and phoneme recognition: A study on Bengali voicing
vowels”. International Journal of Speech Technology, 16(1), 19–31.
16. Etman, A., & Beex, A. L. (2015, November). “Language and dialect
identification: A survey”. In 2015 SAI Intelligent Systems Conference
(IntelliSys) (pp. 220–231). IEEE.
17. Mehrabani, M., & Hansen, J. H. (2015). “Automatic analysis of
dialect/language sets”. International Journal of Speech Technology,
18(3), 277–286.
18. Das, P. P., Allayear, S. M., Amin, R., & Rahman, Z. (2016, February).
“Bangladeshi dialect recognition using Mel frequency cepstral
coefficient, delta, delta-delta and Gaussian mixture model”. In 2016
Eighth International Conference on Advanced Computational
Intelligence (ICACI) (pp. 359–364). IEEE.
19. Sarma, M., & Sarma, K. K. (2016, February). “Dialect identification
from Assamese speech using prosodic features and a neuro fuzzy
classifier”. In 2016 3rd International Conference on Signal Processing
and Integrated Networks (SPIN) (pp. 127–132). IEEE.
20. Bhowmik, T., Chowdhury, A., & Mandal, S. K. D. (2018). “Deep
neural network based place and manner of articulation detection and
classification for Bengali continuous speech”. Procedia Computer
Science, 125, 895–901.
21. Ruch, H. (2018). “The role of acoustic distance and sociolinguistic
knowledge in dialect identification”. Frontiers in Psychology, 9, 818.
22. Khan, M. E. I. (2019). “Exploring Bhairab dialect vis-à-vis standard
Bangla”. Journal of ELT and Education, 2(1), 14–18.
23. Ismail, T. A Survey of Language and Dialect Identification Systems.
24. Mamun, R. K., Abujar, S., Islam, R., Badruzzaman, K. B. M., &
Hasan, M. (2020). “Bangla speaker accent variation detection by
MFCC using recurrent neural network algorithm: A distinct approach”.
In Innovations in Computer Science and Engineering (pp. 545–553).
Springer, Singapore.
23
Real-Time Security Using Computer Vision
CONTENTS
23.1 Introduction
23.1.1 Biometric
23.1.2 Computer Vision
23.1.3 Opencv Library
23.2 Data Security
23.3 Technology
23.3.1 Face Detection
23.3.2 Face Recognition
23.3.3 Haar Cascade Classifier
23.4 Algorithm
23.4.1 Algorithm to Capture the Image for Database
23.4.2 Algorithm to Recognize the Face
23.4.3 Algorithm to Train the Face Recognizer
23.4.4 Algorithm for Security
23.5 Result
23.6 Conclusion
23.7 Future Scope
Reference
23.1 Introduction
The real-time security system is developed for the computers. The security
system has the potential to recognize its owner this will prevent anyone to
misuse the computer. The system has face recognition [1,2], which is able
to recognize its owner and only provide access to its owner. The main user,
which can be denoted as the primary user, can limit the access of other users
to the computer. This will prevent others from misusing the computer. This
system is better than other security systems in the sense that it has the
capability to watch its owner and then provide only him/her full access to
the computer. Other systems use an authentication process which is based
on rules or protocols which will provide security for only once until the
protocol is checked and sometimes it may happen that someone else has
other means to check [3] and validate the protocols and the rules; these
protocols and rules do not check the person who is using the data or the
system. At that point, a computer vision based security system will prove to
be an effective means of protection.
23.1.1 Biometric
Biometric login systems use the face recognition [3,4], fingerprint
matching, and other means but they are just for login into the system; once
the user has somehow logged into the system there will be no cross check
until the system or the computer has been logged out. These are the areas
where computer vision based security system will prove to be effective.
Biometric is used to refer to measurement and calculation. Examples of
biometrics are as follows
Fingerprint
Palm Veins
Face Recognition
DNA
Palm Print
Hand Geometry
Iris Recognition
Retina
Face Detection
Face Recognition
Depth Map
Image Processing
Image Repainting
3D Construction and a lot more
A. Security Goal
It is the most common aspect of the information security. It allows
authorized users to access sensitive and protected data. The data sent over
the network should not be accessed by unauthorized users. Attackers will
try to capture data. To avoid this, various encryption techniques are used to
safeguard our data. So that even if anattacker get access they will not
decrypt it to the original message. In banks when we deposit and withdraw
money, the balance needs to be maintained. Change needs to be done by
only authorized persons. Nobody else should modify the data. Data must be
available to authorized users. Info is useless if we can’t not use it.
B. Security Services
i. Data confidentiality
ii. Data integrity
iii. Authentication
iv. Non repudiation
v. Access control
C. Data Confidentiality
Data confidentiality means secure or protect the data from unauthorized
access from unauthorized users. Ensure that data can only be used by the
authorized persons only. Because data encryption and decryption are the
part of cryptography so, in cryptography data security is the main purpose
that’s why data confidentiality is the main thing or main purpose for this
project. Data confidentiality can only be achieved by using strong
algorithms for high security of the data that cannot be stolen or broken by
anyone; that’s the main purpose of cryptography or data encryption or
decryption. Confidentiality is the main purpose or main thing for data
encryption and decryption. We have to achieve the security by using the
encryption technique. Data that secure that much, if anyone can read or
steal the data is can’t read by anyone means data is jumbled up with
thousands of characters, data is only read by the authorized users when use
the password or passkey for read the data, this time jumbled up data is
decrypted. Otherwise if anyone wants to break the passkey, then the data is
deleted automatically and the machine is protected on high security alert in
this time due to unauthorized access.
D. Data Integrity
Data integrity means the data which is in digital format in web or
computers; it can be online or offline mode we have to protect the data
which is in digital format.
We have to protect the data by using strong passwords or by using
encryption techniques using strong algorithms. Data integrity can follow
through in this matter:
Two types of integrity can happen. One is passive and another one is active.
In passive there can be held any changes in data by accidentally, data fault
is not created by any intension. But active means the data is changed by
manipulating this by any unauthorized users; it means data is accessed by
unauthorized person.
E. Authentication
Authentication means only authorized users can access the data by proper
authentication, by using any strong password or strong encrypted technique,
but if the users do not have any proper authentication then the user can’t use
the data and it is called malicious access by unauthorized users.
Authentication means the proper key or proper way for using the data. In
authentication, we verify the user’s identity by verifying the data that they
used for access the data. Such as in our mobile phone, we use a password or
pin for access. Or nowadays we use biometric authentication techniques for
access on our mobile; our mobile cannot be used by anyone other than us.
In this way we protect our online account or our laptop, computers by using
login by generating user id and password; that’s why our data is always
protected from unauthorized users and it’s secure all the time. Only we can
use our precious digital data by using proper authentication. The main thing
about authentication is the data can only be accessed by a verified user.
F. Non-repudiation
Non-repudiation is a method in this area or a place is protected with a
security if the security key is lost then it is reported as soon as possible at
that time. In cyber security or cryptography, it is used in many places.
Example: An area or an office is master key protected, if anyone wants to
enter there then the card is required; if it is not theirs then they don’t access
that area, if the card is lost then it is reported at this time due to security
purposes.
G. Access Control
Access control means if you have the proper password or technique for
using the data then you only access the data; otherwise you can’t access the
data. Access control is a very vital part in data security. Example: In our
Facebook account, without having proper id and password we can’t login in
our account; it’s called access control. We have the full access of our
account and data.
H. Security Attack
Passive Attack
The passive attack attempts to learn or make use of information from the
system but does not affect system resource. Passive attacks are in the nature
of eavesdropping on transmission. The purpose of the opponent is to gain
information being transmitted [16].
Active attacks: Active attacks try to change the system info or affect
their operation. Active attacks involve some changes of the data or
creation of false statements.
Masquerade: Masquerade attacks occur when one entity pretends to
be a different entity. A masquerade is one of the other forms of active
attacks.
iii. Python
Web Developer
Desktop Application Developer
Apps Developer
Software Developer
Data Analyst
Data Scientist
Editor: The term editors usually refers to source code editors that include
many special features for writing and editing source code.
Eclipse
Pycharm
Emacs
Notepad++
Visual studio
Spyder
Thonny
vi. Eclipse
It is most popular IDE for Windows and Linux. Eclipse contains many
features. We can code many programming languages in it like Java, C, C++,
Python, etc.
vii. Pycharm
Most of the programmers are using this IDE for development. It’s available
for both paid and open source. Pycharm installs quickly and works on
Windows, Mac, and Linux platforms.
viii. Spyder
23.3 Technology
The project is a real-time security project in which the camera of the
computer keeps constantly looking at the owner. The authentication process
is done at regular intervals, which ensures that the system is protected.
When the user moves from its place or a new user can in front of the
computer the system will recognize the user as an invalid user and stop the
system access to the user.
On finding an unknown person, the system will continue to freeze or
inactivate the keyboard and display a notification of not the user and access
denied sequentially. The system will keep doing this until a known person
uses the computer. The keyboard will keep locking again and again inspite
of unlocking it. If a known person of the primary user of the computer is
found, the computer will stop doing the same. Moreover, the valid person
can use the computer without any interruption. The technology used in this
project is computer vision.
23.4 Algorithm
23.4.1 Algorithm to Capture the Image for Database
1. Create connection to the camera using cap=cv2.VideoCapture ()
2. Open the camera
3. Read each frame using cap.read()
4. Detect face inside the frame using the haar cascade classifier for face
5. If a face is detected then store the frame in jpg format
6. Otherwise display the message to come in front of the frame.
23.5 Result
The system is able to recognize the rightful owner of the computer (Figure
23.5).
When the key lock is manually unlocked by clicking the button then a
screen as in Figure 23.7 is shown, which will block the back view of the
screen. This prevents the user from using the computer. The keyboard
locker and the screen blocker will appear one after another, preventing the
user to effectively use the computer.
FIGURE 23.7 System locked due to unauthorized user.
23.6 Conclusion
Using biometrics, a security system will prove to be effective in the long
run of this cyber security field and this will help to cope with a lot of
attacks and using the person as the security key will make it less vulnerable
as anyone can mimic the key that is mathematically generated but
mimicking a person is a difficult job and doing it for always at certain
interval of time makes it frustrating. Image spoofing can be effectively
removed, as one has to always show the image in front of the camera and
the program will always track the same person after a certain interval of
time.
23.7 Future Scope
At this time, this program is working for just the laptop or desktop security
purpose; later on with advancement it will work for secure connections
also. This system can be used for any device. Presently it is able to provide
access to one person but with advancement of technology, it will provide
different levels of access to different persons on the same computer. This
will work as a barrier for those who are restricted to access only some parts
of the computer i.e., if someone is not provided the access of notepad the
computer will not open the notepad for him and keep blocking the same as
per the settings made.
REFERENCE
1. F. Crow, “Summed-area tables for texture mapping”, Proceedings of
SIGGRAPH, vol. 18, no. 3 (1984): 207–212.
2. Viola and Jones, “Rapid object detection using a boosted cascade of
simple features”, Computer cool Vision and Pattern Recognition, 2001.
3. Oren Papageorgiou and Poggio, “A general framework for object
detection”, International Conference on Computer Vision, 1998.
4. R. Lienhart and J. Maydt, “An extended set of Haar-like features for
rapid object detection”, ICIP02 (2002): 900–903.
5. C. H. Messom and A. L. C. Barczak, “Fast and efficient rotated haar-
like features using rotated integral images”, Australian Conference on
Robotics and Automation (ACRA 2006): 1–6.
6. D. C. He and L. Wang, “Texture unit, texture spectrum, and texture
analysis”, IEEE Transactions on Geoscience and Remote Sensing, vol.
28 (1990): 509–512.
7. L. Wang and D. C. He, “Texture Classification using texture
spectrum”, Pattern Recognition, vol. 23, no. 8 (1990): 905–910.
8. T. Ojala, M. Pietikäinen and D. Harwood, “Performance evaluation of
texture measures with classification based on Kullback discrimination
of distributions”, Proceedings of the 12th IAPR International
Conference on Pattern Recognition (ICPR 1994), vol. 1 (1994): 582–
585.
9. T. Ojala, M. Pietikäinen and D. Harwood, “A comparative study of
texture measures with classification based on feature distributions”,
Pattern Recognition, vol. 29 (1996): 51–59.
10. M. Heikkilä and M. Pietikäinen, “A texture-based method for
modeling the background and detecting moving objects”, IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 28,
no. 4 (2006): 657–662.
11. L. Scripcariu, A. Alistar and M. D. Frunza, “JAVA implemented
encryption algorithm”, Proceedings of the 8th Int. Conference
“Development and Application Systems”, Suceava, DAS 2006 (2006):
424–429.
12. H. Gilbert and M. Minier, “A collision attack on seven rounds of
Rijndael”, Proceedings of the 3rd AES Candidate Conference (2000):
230–241.
13. D. J. Wheeler and R. Needham, “TEA, A tiny encryption algorithm”,
Technical Report 355, Computer Laboratory, University of Cambridge
(1994): 1–3.
14. B. Schneier, “The GOST encryption algorithm”, Dr. Dobb’s Journal,
vol. 20, no. 1 (1995): 123–124.
15. L. Scripcariu and S. Ciornei, “Improving the encryption algorithms
using multidimensional data structures”, Proceedings of the Third
European Conference on the Use of Modern Information and
Communication Technologies, ECUMICT 2008, Gent (Belgium)
(2008): 375–384.
16. M. E. Hellman and W. Diffie, “New directions in cryptography”, IEEE
Transactions on Information Theory, vol. 22, no. 6 (1976): 644–654.
17. F. Ballardin, “A calculus for the analysis of wireless network security
protocols”, Formal Aspects of Security and Trust. Springer: Berlin
Heidelberg; (2011): 206–222.
18. Guoying Zhao and Pietikainen Matti, “Dynamic texture recognition
using local binary patterns with an application to facial expressions”,
IEEE Transactions on Pattern Analysis and Machine Intelligence vol.
29, no. 6 (2007): 915–928.
19. Kertész, “Texture-based foreground detection”, International Journal
of Signal Processing, Image Processing and Pattern Recognition
(IJSIP), vol. 4, no. 4 (2011).
20. Sally Cole, “U.S. Army’s AI facial recognition works in the dark”,
Military Embedded Systems (2018): 1–8.
21. Joy Buolamwini and Timnit Gebru, “Gender shades: intersectional
accuracy disparities in commercial gender classification”, Proceedings
of Machine Learning Research, vol. 81 (2018): 1–15.
22. L. C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, et al, “Deeplab:
semantic image segmentation with deep convolutional nets, atrous
convolution, and fully connected CRFs”, EEE Transactions on Pattern
Analysis and Machine Intelligence vol. 40, no. 4 (2018): 834–848.
23. Kaushal M., Khehra B. and Sharma A., “Soft computing based object
detection and tracking approaches: state-of-the-art survey”, Applied
Soft Computing, vol. 70 (2018): 423–464.
24. Z. H. Peric and J. Nikolic, “An adaptive waveform coding algorithm
and its application in speech coding”, Digital Signal Processing, vol.
22, no. 1 (2012): 199–209.
25. R. K. Moore, “Cognitive informatics: the future of spoken language
processing?” in Proceedings of the 10th International Conference on
Speech and Computer (SPECOM), Patras, Greece, October 2005.
26. J. Nikolic and Z. H. Peric, “Lloyd-Max’s algorithm implementation in
speech coding algorithm based on forward adaptive technique”,
Informatics (Lithuanian Academy of Sciences), vol. 19, no. 2 (2008):
255–270.
24
Data Analytics for Detecting DDoS Attacks in Network Traffic
CONTENTS
24.1 Introduction
24.2 Background
24.3 Related Work
24.4 Methodology
24.4.1 Oversampling and Synthetic Sampling of Data
24.4.2 Detection of Stealthy DDoS attacks
24.4.3 Performance Evaluation by Ranking Machine Learning Algorithms
24.5 Result and Discussion
24.5.1 Datasets Used for Evaluation
24.5.2 Evaluation Metrics Used
24.5.3 Observations
24.6 Conclusion
Notes
References
24.1 Introduction
Distributed Denial of Service (DDoS) attack is usually launched to make an online service, a network, or even a
website, unavailable by flooding it with traffic from many sources. DDoS attacks are becoming bigger, more
frequent, and more sophisticated and advanced. They are very common and its impact is most devastating that
network defenders and analysts must watch out for. According to Helpnetsecurity, a featured news blog, the
average bandwidth of DDoS attacks are increasing and low and slow attacks targeting layer 7 are also increasing.
In Q1 2020 the biggest attack stopped was 406 Gbps, but in Q1 2019 the maximum bandwidth peaked to 224
Gbps. The share of multi-vector attacks rose to 64% in Q1 2020 from 47% in Q1 2019. The attack techniques are
also fast evolving. The low rate, slow attacks, multi-vector and layer 7 attacks, are prevalent today. They do have
the common characteristic that they are decoys and their similarity to benign traffic is large. A memcached DDoS
attack is a type of modern day cyber attack which comes under the category of amplification attack. The attacker
spoofs requests to a vulnerable UDP memcached server, which then floods a targeted victim server with Internet
traffic, ultimately causing saturation of its resources. Memcached is a database caching system for speeding up
websites and networks. Here the main attacking vector is UDP. Similarly the sophisticated DDoS attacks are all
different only in their attacking strategies, but their attacking vectors mainly fall under the category of UDP
flooding, TCP SYN flooding, HTTP low and slow rate attacks, etc.
Intrusion Detection Systems (IDS) are usually used to detect network attacks, but the different traditional
detection strategies are incapable of identifying novel attacks or evolving attacks. Hence, we need intelligent IDS
system to detect the more sophisticated attacks. Researchers are now using machine learning algorithms in this
area.
Important consequences of the digitally connected world are the collecting and managing of raw data. Managing
needs attention, because the data is valuable and come in different sizes and formats. Big data in the form of
network traffic is very powerful and it affects all parts of society from social aspects to almost all other areas. As
the amount of data increases, big data analytic based processing involving experimentation, data analysis and
monitoring get more importance. Machine learning is one of the very important tools and works in supervised and
unsupervised models. The power of machine learning analytics depends on data input. The more exact and
representative the data input, the more effective will be the analytical performance. To gauge and improve the
effectiveness of detection algorithms, we need large and representative datasets of Internet traffic consisting of the
combination of benign and attack network traffic patterns.
The large variety of attacks demands case specific approach to hinder the attack. There exists no such single
system which can perform well to detect and mitigate attacks. Hence, we have to choose the appropriate method to
handle the attack. There exist certain factors that need thorough analysis and attention. As thousands of packets are
passing by a particular point at any given instance on any computer network, it is very difficult to gather a set of
balanced data instances. The attack instances are comparatively very less compared to the instances of normal
packets. According to Thomas et al., this skewness in distribution of real-world network traffic data is known as an
imbalanced dataset problem.1 This scenario will affect training, model designing, and finally causes
misclassification. Moreover, it is very hard to get all possible situations of normal and attack traffic to train a
machine learning algorithm.
The second important factor which affects the performance of the machine learning algorithm is the stealthy
nature of attack instances. Majority of supervised learning algorithms fail to detect a marginal percentage of
attacks even if their features are very distinguishable. According to Feistein et al., stealthy traffic means the traffic
flood launched by sophisticated attackers that mimic the legitimate traffic expected to evade detection. Many
advanced attacking tools are available today that can determine typical entropy levels seen at the detector end and
tune the attacking parameters to match. Hence an attacker who is equipped with the kind of knowledge that could
produce attack traffic that would produce not even a slight change in the entropy observed at the detector.2
Normally, the attack traffic coming at the victim end in the network has very prominent features, unless otherwise
the attackers use guesswork, penetration, or trial and error to mimic the behavior of normal traffic.
The proposed system improves the performance of the machine learning algorithms by considering the aspects
such as imbalanced dataset problem and the stealthiness of DDoS attacks. It starts with dataset preparation by
extracting features from the network traffic available in pcap format. The marginal percentage of attack
misclassification is mainly due to the disproportional number of instances of attack and normal traffic. Based on
this, it is proposed to analyse the performance of machine learning algorithms in attack detection by doing
preprocessing such as oversampling and synthetic oversampling of attack instances. Then, we address the problem
of misclassification of attack that happens due to its similarity to normal traffic. The features that can be used for
attack detection are not that much sharp enough to take the decision confidently. Hence, we have computed the
similarity index based on Hellinger distance (HD) and it is taken as a measure to represent the stealthiness of
attack and this information can be exploited to detect stealthy layer seven attacks.
The contributions of this chapter are as follows.
1. We describe how an analysis of data related to its imbalance in distribution and the stealthy nature can be
used to choose the proper preprocessing method such as oversampling, synthetic oversampling or feature
engineering based on distributional similarity of attack and benign instances to have a proper model that
correctly fits the data.
2. Ranking of machine learning algorithms in detecting DDoS attacks while addressing the contradictory
constraints such as maximising the recall and precision and minimizing false positives and negatives using
multi criteria decision aid system called PROMETHEE.
3. A performance evaluation of this framework is conducted on bench mark datasets: LLS DDoS scenario
specific dataset, CAIDA dataset, and CICIDS 2017, and comparing the proposed system with the existing
systems to assess its effectiveness in detecting DDoS attacks.
The rest of this chapter is organized as follows. The background on DDoS detection is presented in Section 24.2.
Section 24.3 describes related work, especially covering the areas on machine learning based DDoS detection, the
importance of flow-based approach, and the various preprocessing experiments to prepare the data. The proposed
methodology is presented in Section 24.4 and results are discussed in Section 24.5. Finally the conclusion and
future works are given in Section 24.6.
24.2 Background
DDoS attacks can be classified into volumetric attack, protocol attack and application layer attack. The specific
focus of volumetric attack is to congest the network by sending large volume of data packets over the network and
utilising the network bandwidth. This kind of attack is usually executed by botnets. The protocol attack targets
actual web/DNS/FTP servers, routers, switch, firewall devices, and load balancers to disrupt the network services
and will cause resource exhaustion. Application layer attacks are also termed as layer 7 attacks which mainly target
the layer 7 protocols and exploit them to cause resource exhaustion. These attacks are very sophisticated with low
traffic rate, which appears to be legitimate for the victim system. UDP flood, HTTP flood, and slowloris are some
of the sophisticated attacks worth mentioning and considered in this work as well. These attacking vectors are very
stealthy in nature and are part of multi-vector attacks. UDP flood targets the opened UDP ports on the victim
network and start the flood by simply sending UDP packets. HTTP flood targets the web applications and use
legitimate HTTP GET and POST requests to launch the attack. These are simply legitimate requests and the main
aim is to bring down the server. Slowloris is entirely different from the attacks mentioned. It is a perfect benign
HTTP traffic and is launched by making use of a software called slowloris.
The DDoS defence challenges are mentioned in the work of Mirkovic et al.3 The seriousness of the DDoS attack
demands distributed response from the network. Most of the defence systems are meant for a specific kind of
DDoS attack. There is no such single system which can resist all the attacking strategies. Hence, there is a need for
such a system which can combine the approaches effectively to solve the problem.
TABLE 24.1
Summary of Literature Used for Comparison
22Singh et Multilayer Perceptron with a Genetic Algorithm Accuracy of 98.04% for detecting the layer
al. (MLP-GA). Layer 7 HTTP attack using seven DDoS attacks and false positives are
slowloris attacking tool is addressed. less compared to Naive Bayes, Radial Basis
Function (RBF) Network, MLP, J48, and C45
(2017).
23Lima et Machine learning DoS detection system and Four datasets namely CIC-DoS, CICIDS2017,
al. makes inferences based on signatures CSE-CIC-IDS2018, CICIDS2017 and
previously extracted from samples of network customized dataset. Acquired detection rate
traffic. and precision higher than 93% with FAR less
than 1.8%.
24Shone et Non-symmetric deep auto-encoder (NDAE) for KDD Cup’99 and NSL-KDD datasets used.
al. unsupervised feature learning and deep
learning classification model constructed
using stacked NDAE.
25Ahmed et New structures called application fingerprints Accuracy of over 97% is achieved with the
al are generated using transport layer packet- misclassification rate of 2.5%.
level and flow-level features.
26Alsirhani A dynamic DDoS attack detection system Trade-off exists in the classification algorithm’s
et al. framework uses fuzzy logic to dynamically accuracy and its delay.
select an algorithm from a set of prepared
classification algorithms that detect different
DDoS patterns.
24.4 Methodology
The machine learning based DDoS detection system is depicted in Figure 24.1. The initial part of DDoS detection
system is the collection of network traffic in pcap format. Specific application programming interface called packet
capture (pcap) is used to collect network traffic. Software like Libpcap of Unix and Winpcap of Windows are the
libraries used to collect data in pcap format. This kind of software can be used to do additional statistical
functionalities related to packets. TCPdump and Wireshark are the free software which aids packet sniffing and
monitoring along with statistical analysis. Wireshark is actually the GUI version of TCPdump.
In high-speed connections having rates up to hundreds of Gigabits per second (Gbps), it is very costly to do
packet level data analysis. Moreover, it leads to poor performance of the detection system as well. Hence, it is
more convenient and efficient to aggregate packet level data into flow level data and it greatly supports big data
analytic of network traffic to a large extent. According to RFC 3697A, a flow can be defined as a sequence of
packets sent from a particular source to a particular unicast, anycast, multicast destination that the source desires to
label as a flow. It is more logical to consider a flow as an instance which can be malicious or benign. Hence, it is
desirable to extract features related to each flow. We have selected the features mentioned in the work of
Karimazad and Faraahi, which hold the information very specific to represent DDoS attack.27 We have defined
these features in relation to the flows. The features are Average packet size, Number of packets, Time interval
variance, Packet size variance, Number of bytes, Packet rate, and Bitrate. Average packet size is the sum of size of
each packet of a flow averaged over total number of packets and is given in (24.1), where N is the total number of
packets in a flow and Pi is the size of the ith packet. Time interval variance quantifies the variance in the inter
arrival time of packets in a flow and is given in (24.2), where ti is the arrival time of ith packet. Packet size
variance quantifies the size difference of adjacent packets averaged over total number of packets and is given in
(24.3). Packet rate is expressed in (24.4), where tend is the arrival time of last packet and tstart is the arrival time of
first packet related to a particular flow. Number of packets, Number of bytes, and bitrate can be computed directly
from the packet features exported by Wireshark.
N
∑
N
i=0
Pi (24.1)
N
∑
N
i=0
t i+1 − t i (24.2)
N
∑
N
i=0
P i+1 − P i (24.3)
P acket _ rate =
N
t end −t start
(24.4)
Feature normalization is an essential step for scaling features to fit into a particular range, and will eliminate the
bias from data without modifying the statistical nature of the features. Minmax normalisation is being employed to
scale the feature values in the range [0 1] and is given in (24.5). The normalized features are stacked to form
feature vectors and they are shuffled to have a proper distribution of the instances across the sample space.
x norm_i =
x i −x max
x max −x min
(24.5)
DDoS attack detection is subsequently done by considering two different situations that affects the algorithm
performance. The DDoS attack traffic is inherently imbalanced in nature because the presence of attack instances
is very low compared to the benign traffic. Hence it is proposed to do the random oversampling or synthetic
sampling to make the data appropriate to build the model. The stealthy layer 7 attacks are handled by creating a
new feature which represents the similarity of attack traffic to benign traffic. Then the analysis of machine learning
algorithms are done based on multiple criteria like improving the True Positive (TP) rate and True Negative (TN)
rate and reducing the False Positive (FP) rate and False Negative (FN) rate.
Syn_sample_min and Syn_sample_max are the two synthetic samples created. FN_ instance(i) represents the ith
misclassified attack sample, i can vary from 1 to N, where N is the total number of misclassified attacks. randn(0
-1) is the function used to generate a number in the interval [0 1]. For N number of misclassified attack instances
only 2N number of synthetic samples is created. These synthetic samples are added to the original train data and
shuffled to get an evenly distributed dataset.
f (a, b) = ∑
k
j=1
w j P j (a, b) (24.9)
It involves the values taken by the preference functions associated to the criteria and not directly the evaluations of
the actions themselves. The advantage of the PROMETHEE is that, it does the pairwise comparisons of actions.
Two actions a and b are compared by computing the multicriteria preference index as given in (24.9), where Pj(a,
b) is defined as the preference function and non negative weights that represent the relative importance of the
criteria can also be defined. Then the preference flows are computed to consolidate the results of the pairwise
comparisons of the actions and rank all the actions from the best to the worst. The mathematical expressions to
compute preference flows are given in equations (24.10) and (24.11).
φ
+
=
1
N −1
∑ b≠a f (a, b) (24.10)
φ
−
=
1
N −1
∑
b≠a
f (b, a) (24.11)
Φ+ is the positive preference flow which measures how much an action a is preferred to the other n − 1 actions. It
is a global measurement of the strength of action a and larger the Φ+, better the action. Similarly Φ− measures how
much the other n − 1 actions are preferred to action a. It is global measure of weaknesses of action a and smaller
the Φ− the better the action. We make use of PROMETHEE Complete ranking as we are dealing with strong
conflicting criteria and it is given by net preference flow as shown in (24.12).
φ(a) = φ
+
(a) − φ
−
(a) (24.12)
It aggregates both the strengths and the weaknesses of the action into a single score. Hence, the proposition is aPb
if and only if Φ(a)>Φ(b). The action a is preferred over action b if and only if it is preferred to b according to the
net preference flow.
TABLE 24.2
Confusion Matrix
Attack TP FN
Normal FP TN
As far as DDoS detection is concerned, our interest lies in TP rate of attack class rather than the overall accuracy
as sparing the attacks undetected is considered as very costly. Moreover, an imbalanced data cannot be evaluated
by using accuracy parameter only. The equation for TP rate metric is given in (24.13), where TP is the number of
positives correctly predicted as positives and FN is number of positives wrongly predicted as negative.
TPR =
TP
T P +F N
(24.13)
24.5.3 Observations
The results based on TP rate of attack class is shown in Tables 24.3–24.5. In the situations where there exists high
imbalance in distribution of attack and benign instances, we simply do the oversampling of attack instances. We
have experimented the random oversampling of attack instances and the synthetic oversampling of instances as
proposed in the methodology and SMOTE. Random oversampling of attacks has been conducted by varying the
percentage of attack instances selected for oversampling. To get a better performance of algorithm it is observed to
select 75% of instances to do random oversampling. However, in synthetic sampling, we have generated double
the misclassified attack instances as the synthetic samples. The performance achieved is far better than the simple
random oversampling and the results are shown in the column labeled “Synthetic sampling” in Tables 24.3–24.5.
TABLE 24.3
LLS-DDoS Scenario Specific Dataset
TABLE 24.4
CAIDA 2007 Dataset
TABLE 24.5
CICIDS 2017 Dataset
The existence of hard-to-detect attacks in the network traffic are the stealthier attacks. Their distribution
parameters are very similar to benign traffic, smartly fabricated by attackers and used as a technique to evade
attack detection.
Even then, there exists a marginal difference in distribution of attack and benign traffic. The difference in HD of
instances with the true class and the opposite class is represented as a new feature. Hence, a positive value shows
the similarity of samples to the opposite class. Zero and negative value indicates its similarity to true class
instances. The sim_index feature effectively capture the difference and hence the TP rate obtained for CICIDS
2017 data using this preprocessing step is far better than oversampling and SMOTE based preprocessing. The best
TP rate obtained by doing oversampling is 0.962 for Adaboost, 0.992 by SMOTE and 0.998 by similarity-based
approach. Slowhttptest and slowloris attack instances in CICIDS2017 are hard to detect, because of their similarity
towards benign traffic and their presence is competitively very low in train data. The TP rate obtained for these
three attacks by similarity-based Adaboost algorithm demonstrate the effectiveness of approach in detecting
stealthy attacks.
PROMETHEE complete ranking gives the performance evaluation of algorithms in terms of conflicting criteria
related to the performance metrics extracted from the confusion matrix and are shown in Figures 24.2–24.4. Two
pairs of conflicting criteria are used here. Precision and Recall forms one pair of conflicting criteria whose values
are to be maximised. FN rate and FP rate constitutes the second pair, which are to be minimised. The results of
PROMETHEE Complete ranking demonstrate the fact that Adaboost and Random forest algorithms perform better
in detecting DDoS attacks than KNN and J48 algorithms, as KNN and J48 algorithms are marked towards the
negative side of preference flow. Adaboost and Random forest algorithms are towards the positive side of
preference flow.
FIGURE 24.2 Ranking of machine learning algorithms on LLS-DDoS dataset.
FIGURE 24.3 Ranking of machine learning algorithms on CAIDA 2007 dataset.
FIGURE 24.4 Ranking of machine learning algorithms on CICIDS 2017.
The results can be compared with some of the existing research works in this area. The literature summarized in
Table 24.1 is selected for comparison, as these are the recent works. The work of Singh et al.33 and Lima et al.34
used the modern benchmark dataset called CICIDS 2017. The results shown in these literatures give detection
accuracy as the performance metric. Our result on CICIDS 2017 gives TP rate of 0.998 by Adaboost algorithm.
The observed accuracy of the proposed method is 99.82% on CICIDS 2017 dataset. The observed accuracy of
various methods on three datasets are given in Table 24.6.
TABLE 24.6
Comparison of Accuracy Obtained
Random 99.43 98.75 99.13 99.88 99.32 98.79 97.14 99.82 99.71
forest
Algorithm SyntheticSampling Sim SMOTE Synthetic Sim SMOTE Synthetic Sim SMOTE
J48 97.97 99.33
Index 97.29 99.15
Sampling 98.75
Index 98.16 96.25
Sampling 99.08
Index 98.99
Adaboost 99.39 98.75 99.23 99.84 99.44 98.79 97.91 99.36 99.71
Knn 97.91 95.98 98.52 97.77 98.67 98.88 95.12 97.87 97.70
The proposed methodology in our system is less complex and it is evaluated using the default settings of
algorithms. Moreover the experiments are conducted by supplying test data separately rather than using cross
validation. The works of Shone et al.,35 Ahmed et al.,36 Lima et al.,37 and Alsirhani et al.38 are done mainly on
modern attacks and NSL-KDD benchmark dataset and the results can be compared for the complexity of the
method and their effectiveness. The works explained in these literatures demonstrate the results obtained from
cross-validated data. Ahmed et al. actually experimented five real-world datasets and DDoS flooding using IRC
botnet or Slowloris attack is also one among them. The experimental results claim an accuracy of 97%. These
comparisons show that our model is very promising and can perform with an average accuracy of 99.8%.
The most important advantage of synthetic sampling is that the number of synthetic samples generated is
considerably low as compared to random oversampling and SMOTE. The generation of spurious samples is also
reduced, since the new synthetic samples are produced along the line of misclassified attack instances and the
extreme samples selected from the correctly classified attack class distribution.
24.6 Conclusion
This chapter discussed the methodologies which cause the performance improvement of machine learning
algorithms while dealing with imbalanced dataset and stealthy layer 7 attacks. The solution proposed for the
imbalanced dataset problem was random oversampling and synthetic sampling. The empirical results proved that
the synthetic sampling is the best choice compared to random sampling and SMOTE. The number of synthetic
samples generated was less compared to 1:1 sampling in SMOTE. Only double the number of attacks misclassified
in the prior learning phase was selected for synthetic sampling. The stealthy layer 7 attacks were handled by a new
feature created, which could hold the information of similarity of attack instances to normal instances. TP rate of
detection of stealthy attacks present in CICIDS 2017 data was 0.998 by Adaboost algorithm and the proposed
model can ensure an average accuracy of 99.8%. The performance evaluation of algorithms was done using a
multi-criteria decision aid system called PROMETHEE Complete ranking where Random forest and Adaboost
were proved to be the best algorithms in detecting DDoS attacks. In the future, we can use the newly introduced
feature to further reduce the dimension of the training data. We can also extend this study to apply more statistical
analysis to bring out the best performance of machine learning algorithms in detecting DDoS attacks.
Notes
1. Ciza Thomas, “Improving intrusion detection for imbalanced network traffic,” Security and Communication
Networks 6, no. 3 (2013): 309–324.
2. Laura Feinstein et al., “Statistical approaches to DDoS attack detection and response,” in Proceedings DARPA
information survivability conference and exposition, vol. 1 (IEEE, 2003), 303–314.
3. Jelena Mirkovic and Peter Reiher, “A taxonomy of DDoS attack and DDoS defense mechanisms,” ACM
SIGCOMM Computer Communication Review 34, no. 2 (2004): 39–53.
4. Peng Xiao et al., “Detecting DDoS attacks against data center with correlation analysis,” Computer
Communications 67 (2015): 66–74.
5. Cynthia Wagner, Jérôme François, Thomas Engel, et al., “Machine learning approach for ip-flow record
anomaly detection,” in International Conference on Research in Networking (Springer, 2011), 28–39.
6. Xi Qin, Tongge Xu, and Chao Wang, “DDoS attack detection using flow entropy and clustering technique,” in
2015 11th International Conference on Computational Intelligence and Security (CIS) (IEEE, 2015), 412–
415.
7. Robin Sommer and Vern Paxson, “Outside the closed world: On using machine learning for network intrusion
detection,” in 2010 IEEE Symposium on Security and Privacy (IEEE, 2010), 305–316.
8. Josep L Berral et al., “Adaptive distributed mechanism against flooding network attacks based on machine
learning,” in Proceedings of the 1st ACM Workshop on AISec (2008), 43–50.
9. RR Rejimol Robinson and Ciza Thomas, “Ranking of machine learning algorithms based on the performance
in classifying DDoS attacks,” in 2015 IEEE Recent Advances in Intelligent Computational Systems (RAICS)
(IEEE, 2015), 185–190.
10. Alan Saied, Richard E Overill, and Tomasz Radzik, “Detection of known and unknown DDoS attacks using
Artificial Neural Networks,” Neurocomputing 172 (2016: 385–393.
11. Rajagopalan Vijayasarathy, Serugudi Venkataraman Raghavan, and Balaraman Ravindran, “A system
approach to network modeling for DDoS detection using a Naive Bayesian classifier,” in 2011 Third
International Conference on Communication Systems and Networks (COMSNETS 2011) (IEEE, 2011), 1–10.
12. Ming-Yang Su, “Real-time anomaly detection systems for Denial-of-Service attacks by weighted k-nearest-
neighbor classifiers,” Expert Systems with Applications 38, no. 4 (2011): 3492–3498.
13. Bin Kong et al., “Distinguishing flooding distributed denial of service from flash crowds using four data
mining approaches,” Computer Science and Information Systems 14, no. 3 (2017): 839–856.
14. Keunsoo Lee et al., “DDoS attack detection method using cluster analysis,” Expert systems with applications
34, no. 3 (2008): 1659–1665.
15. Pedro Casas et al., “Network security and anomaly detection with Big-DAMA, a big data analytics
framework,” in 2017 IEEE 6th International Conference on Cloud Networking (CloudNet) (IEEE, 2017), 1–7.
16. Hui Han, Wen-yuan Wang, and Bing-huan Mao, “Borderline-SMOTE: A new over-sampling method in,”
2005, 878–887, https://doi.org/10.1007/1153805991.
17. Nv Chawla and Kw Bowyer, “SMOTE: Synthetic minority over-sampling technique,” Journal of Artificial
Intelligence Research 16: 321–357, issn: 10769757, https://doi.org/10.1613/jair.953, arXiv: 1106.1813,
http://arxiv.org/abs/1106.1813.
18. N Japkowicz, “Learning from imbalanced datasets.,” Papers from AAAI Workshop 21, no. 9 (2000): 10–15,
issn: 1041–4347, https://doi.org/10.1109/TKDE.2008.239, arXiv: arXiv:1011.1669v3,
http://www.aaai.org/Papers/Workshops/2000/WS-00-05/WS00-05-003.pdf.
19. Taeho Jo and Nathalie Japkowicz, “Class imbalances versus small disjuncts,” ACM Sigkdd Explorations
Newsletter 6, no. 1 (2004): 40–49.
20. Opeyemi Osanaiye et al., “Ensemble-based multi-filter feature selection method for DDoS detection in cloud
computing,” EURASIP Journal on Wireless Communications and Networking 2016, no. 1 (2016): 130.
21. Muhammad Aamir and Syed Mustafa Ali Zaidi, “DDoS attack detection with feature engineering and
machine learning: the framework and performance evaluation,” International Journal of Information Security
18, no. 6 (2019): 761–785.
22. Khundrakpam Johnson Singh and Tanmay De, “MLP-GA based algorithm to detect application layer DDoS
attack,” Journal of information security and applications 36 (2017): 145–153.
23. Francisco Sales de Lima Filho et al., “Smart detection: an online approach for DoS/DDoS attack detection
using machine learning,” Security and Communication Networks 2019 (2019).
24. Nathan Shone et al., “A deep learning approach to network intrusion detection,” IEEE Transactions on
Emerging Topics in Computational Intelligence 2, no. 1 (2018): 41–50.
25. Muhammad Ejaz Ahmed, Saeed Ullah, and Hyoungshick Kim, “Statistical application fingerprinting for
DDoS attack mitigation,” IEEE Transactions on Information Forensics and Security 14, no. 6 (2018): 1471–
1484.
26. Amjad Alsirhani, Srinivas Sampalli, and Peter Bodorik, “DDoS detection system: using a set of classification
algorithms controlled by fuzzy logic system in apache spark,” IEEE Transactions on Network and Service
Management 16, no. 3 (2019): 936–949.
27. Reyhaneh Karimazad and Ahmad Faraahi, “An anomaly-based method for DDoS attacks detection using
RBF neural networks,” in Proceedings of the International Conference on Network and Electronics
Engineering, vol. 11 (2011).
28. Chawla and Bowyer, “SMOTE: Synthetic Minority Over-sampling Technique.”
29. John McHugh, “Testing intrusion detection systems: a critique of the 1998 and 1999 darpa intrusion detection
system evaluations as performed by Lincoln Laboratory,” ACM Transactions on Information and System
Security (TISSEC) 3, no. 4 (2000): 262–294.
30. Matthew V Mahoney and Philip K Chan, “An analysis of the 1999 DARPA/Lincoln Laboratory evaluation
data for network anomaly detection,” in International Workshop on Recent Advances in Intrusion Detection
(Springer, 2003), 220–237.
31. Ciza Thomas, Vishwas Sharma, and N Balakrishnan, “Usefulness of DARPA dataset for intrusion detection
system evaluation,” in Data Mining, Intrusion Detection, Information Assurance, and Data Networks Security
2008, vol. 6973 (International Society for Optics and Photonics, 2008), 69730G.
32. Ciza Thomas and N Balakrishnan, “Improvement in intrusion detection with advances in sensor fusion,” vol.
4, 3 (IEEE, 2009), 542–551.
33. Khundrakpam Johnson Singh and Tanmay De, “MLP-GA based algorithm to detect application layer DDoS
attack,” Journal of Information Security and Applications 36 (2017): 145–153.
34. Francisco Sales de Lima Filho et al., “Smart detection: an online approach for DoS/DDoS attack detection
using machine learning,” Security and Communication Networks 2019 (2019).
35. Nathan Shone et al., “A deep learning approach to network intrusion detection,” IEEE Transactions on
Emerging Topics In Computational Intelligence 2, no. 1 (2018): 41–50.
36. Muhammad Ejaz Ahmed, Saeed Ullah, and Hyoungshick Kim, “Statistical application fingerprinting for
DDoS attack mitigation,” IEEE Trans-actions on Information Forensics and Security 14, no. 6 (2018): 1471–
1484.
37. Lima Filho et al., “Smart detection: an online approach for DoS/DDoS attack detection using machine
learning.”
38. Amjad Alsirhani, Srinivas Sampalli, and Peter Bodorik, “DDoS detection system: using a set of classification
algorithms controlled by fuzzy logic system in apache spark,” IEEE Transactions on Network and Service
Management 16, no. 3 (2019): 936–949.
REFERENCES
1. Aamir, Muhammad, and Syed Mustafa Ali Zaidi. “DDoS attack detection with feature engineering and
machine learn-ing: the framework and performance evaluation”. International Journal of Information
Security 18, no. 6 (2019): 761–785.
2. Ahmed, Muhammad Ejaz, Saeed Ullah, and Hyoungshick Kim. “Statistical application fingerprinting for
DDoS attack mitigation”. IEEE Transactions on Information Forensics and Security 14, no. 6 (2018): 1471–
1484.
3. Alsirhani, Amjad, Srinivas Sampalli, and Peter Bodorik. “DDoS detection system: using a set of classification
algorithms controlled by fuzzy logic system in apache spark”. IEEE Transactions on Network and Service
Management 16, no. 3 (2019): 936–949.
4. Berral, Josep L, Nicolas Poggi, Javier Alonso, Ricard Gavalda, Jordi Torres, and Manish Parashar. “Adaptive
distributed mechanism against flooding network attacks based on machine learning”. In Proceedings of the
1st ACM workshop on Workshop on AISec, 43–50, 2008.
5. Casas, Pedro, Francesca Soro, Juan Vanerio, Giuseppe Settanni, and Alessandro D’Alconzo. “Network
security and anomaly detection with Big-DAMA, a big data analytics framework”. In 2017 IEEE 6th
International Conference on Cloud Networking (CloudNet), 1–7. IEEE, 2017.
6. Chawla, Nv, and Kw Bowyer. “SMOTE: synthetic minority over-sampling technique”. Journal of Artificial
Intelligence Research 16 (2002): 321–357. issn: 10769757. https://doi.org/10.1613/jair.953. arXiv:1106.1813.
http://arxiv.org/abs/1106.1813.
7. Feinstein, Laura, Dan Schnackenberg, Ravindra Balupari, and Darrell Kindred. “Statistical approaches to
DDoS attack detection and response”. In Proceedings DARPA Information Survivability Conference and
Exposition, 1: 303–314. IEEE, 2003.
8. Han, Hui, Wen-yuan Wang, and Bing-huan Mao. “Borderline-SMOTE: a new over-sampling method in
imbalanced datasets learning”. In Proceedings International Conference on Intelligent Computing, pp. 878–
887, 2005. https://doi.org/10.1007/1153805991.
9. Japkowicz, N. “Learning from imbalanced datasets”. Papers from AAAI Workshop 21, no. 9 (2000): 10–15.
issn: 1041-4347. https://doi.org/10.1109/TKDE.2008.239. arXiv: arXiv:1011.1669v3.
http://www.aaai.org/Papers/Workshops/2000/WS-00-05/WS00-05-003.pdf.
10. Jo, Taeho, and Nathalie Japkowicz. “Class imbalances versus small disjuncts”. ACM Sigkdd Explorations
Newsletter 6, no. 1 (2004): 40–49.
11. Karimazad, Reyhaneh, and Ahmad Faraahi. “An anomaly-based method for DDoS attacks detection using
RBF neural networks”. In Proceedings of the International Conference on Network and Electronics
Engineering, vol. 11, 2011.
12. Kong, Bin, Kun Yang, Degang Sun, Meimei Li, and Zhixin Shi. “Distinguishing flooding distributed denial of
service from flash crowds using four data mining approaches”. Computer Science and Information Systems
14, no. 3 (2017): 839–856.
13. Lee, Keunsoo, Juhyun Kim, Ki Hoon Kwon, Younggoo Han, and Sehun Kim. “DDoS attack detection
method using cluster analysis”. Expert Systems with Applications 34, no. 3 (2008): 1659–1665.
14. Lima Filho, Francisco Sales De, Frederico AF Silveira, Agostinho de Medeiros Brito Junior, Genoveva
Vargas-Solar, and Luiz F Silveira. “Smart detection: an online approach for DoS/DDoS attack detection using
machine learning”. Security and Communication Networks 2019 (2019).
15. Mahoney, Matthew V, and Philip K Chan. “An analysis of the 1999 DARPA/Lincoln Laboratory evaluation
data for network anomaly detection”. In International Workshop on Recent Advances in Intrusion Detection,
220–237. Springer, 2003.
16. McHugh, John. “Testing intrusion detection systems: a critique of the 1998 and 1999 DARPA intrusion
detection system evaluations as performed by Lincoln Laboratory”. ACM Transactions on Information and
System Security (TISSEC) 3, no. 4 (2000): 262–294.
17. Mirkovic, Jelena, and Peter Reiher. “A taxonomy of DDoS attack and DDoS defense mechanisms”. ACM
SIGCOMM Computer Communication Review 34, no. 2 (2004): 39–53.
18. Osanaiye, Opeyemi, Haibin Cai, Kim-Kwang Raymond Choo, Ali Dehghantanha, Zheng Xu, and Mqhele
Dlodlo. “Ensemble-based multi-filter feature selection method for DDoS detection in cloud computing”.
EURASIP Journal on Wireless Communications and Networking 2016, no. 1 (2016): 130.
19. Qin, Xi, Tongge Xu, and Chao Wang. “DDoS attack detection using flow entropy and clustering technique”.
In 2015 11th International Conference on Computational Intelligence and Security (CIS), 412–415. IEEE,
2015.
20. Robinson, R. R. Rejimol, and Ciza Thomas. “Ranking of machine learning algorithms based on the
performance in classifying DDoS attacks”. In 2015 IEEE Recent Advances in Intelligent Computational
Systems (RAICS), 185–190. IEEE, 2015.
21. Saied, Alan, Richard E Overill, and Tomasz Radzik. “Detection of known and unknown DDoS attacks using
Artificial Neural Networks”. Neurocomputing 172 (2016): 385–393.
22. Shone, Nathan, Tran Nguyen Ngoc, Vu Dinh Phai, and Qi Shi. “A deep learning approach to network
intrusion detection”. IEEE Transactions on Emerging Topics in Computational Intelligence 2, no. 1 (2018):
41–50.
23. Singh, Khundrakpam Johnson, and Tanmay De. “MLP-GA based algorithm to detect application layer DDoS
attack”. Journal of Information Security and Applications 36 (2017): 145–153.
24. Sommer, Robin, and Vern Paxson. “Outside the closed world: on using machine learning for network
intrusion detection”. In 2010 IEEE Symposium on Security and Privacy, 305–316. IEEE, 2010.
25. Su, Ming-Yang. “Real-time anomaly detection systems for Denial-of-Service attacks by weighted k-nearest-
neighbor classifiers”. Expert Systems with Applications 38, no. 4 (2011): 3492–3498.
26. Thomas, Ciza. “Improving intrusion detection for imbalanced network traffic”. Security and Communication
Networks 6, no. 3 (2013): 309–324.
27. Thomas, Ciza, and N. Balakrishnan. “Improvement in intrusion detection with advances in sensor fusion”.
IEEE Transactions on Information Forensics and Security 4, no. 3 (2009): 542–551.
28. Thomas, Ciza, Vishwas Sharma, and N. Balakrishnan. “Usefulness of DARPA dataset for intrusion detection
system evaluation”. In Data Mining, Intrusion Detection, Information Assurance, and Data Networks
Security 2008, vol. 6973, 69730G. International Society for Optics and Photonics, 2008.
29. Vijayasarathy, Rajagopalan, Serugudi Venkataraman Raghavan, and Balaraman Ravindran. “A system
approach to network modeling for DDoS detection using a Naive Bayesian classifier”. In 2011 Third
International Conference on Communication Systems and Networks (COMSNETS 2011), 1–10. IEEE, 2011.
30. Wagner, Cynthia, Jérôme François, Thomas Engel, et al “Machine learning approach for IP-flow record
anomaly detection”. In International Conference on Research in Networking, 28–39. Springer, 2011.
31. Xiao, Peng, Wenyu Qu, Heng Qi, and Zhiyang Li. “Detecting DDoS attacks against data center with
correlation analysis”. Computer Communications 67 (2015): 66–74.
25
Detection of Patterns in Attributed Graph
Using Graph Mining
Bapuji Rao
CONTENTS
25.1 Introduction
25.2 Research Background
25.3 Literature Survey
25.4 General Definitions
25.4.1 Multi-relational Edge-attributed Graph
25.4.2 Multi-layer Edge-attributed Graph
25.4.3 Attributed Graph
25.5 Problem Definition
25.6 Proposed Approach
25.6.1 Pattern Length of 4, 5, and 6
25.6.1.1 For Length = 4
25.6.1.2 For Length = 5
25.6.1.3 For Length = 6
25.6.2 Node-Pair Generations
25.6.2.1 Node-Pair Generation for Three Attributed Line
and Loop Patterns
25.6.2.2 Node-Pair Generation for Four Attributed Line
and Loop Patterns
25.6.2.3 Node-Pair Generation for Four Attributed Star
Patterns
25.6.2.4 Node-Pair Generation for Five Attributed
Elongated Star Patterns
25.6.3 Pattern Detections
25.6.3.1 Three-Attributed Line Pattern
25.6.3.2 Three-Attributed Loop Pattern
25.6.3.3 Four-Attributed Line Pattern
25.6.3.4 Four-Attributed Loop Pattern
25.6.3.5 Four-Attributed Star Pattern
25.6.3.6 Five-Attributed Elongated Star Pattern
25.7 Proposed Algorithm for Detection of Patterns – Line, Loop, Star,
and Elongated Star
25.7.1 Algorithm PDAGraph345()
25.7.2 Procedure for Node-Pair Assignment
25.7.3 Procedure to Create Three-Attributed Line and Loop
Patterns
25.7.4 Procedure to Display Three-Attributed Line and Loop
Patterns
25.7.5 Procedure to Create Four-Attributed Line and Loop
Patterns
25.7.6 Procedure to Display Four-Attributed Line and Loop
Patterns
25.7.7 Procedure to Create Four-Attributed Star Patterns
25.7.8 Procedure to Display Four-Attributed Star Patterns
25.7.9 Procedure to Create Five-Attributed Elongated Star
Patterns
25.7.10 Procedure to Assign Node IDs of Five-Attributed
Elongated Star Patterns
25.7.11 Procedure to Display Five-Attributed Elongated Star
Patterns
25.7.12 Procedure to Generate Node-Pairs
25.7.13 Explanation of PDAGraph345()
25.8 Experimental Results
25.8.1 Using C++ Programming Language
25.8.1.1 Three-Attributed Line Pattern (1-2-3)
25.8.1.2 Three-Attributed Loop Pattern (2-3-4-2)
25.8.1.3 Four-Attributed Line Pattern (1-3-2-4)
25.8.1.4 Four-Attributed Loop Pattern (1-3-4-2-1)
25.8.1.5 Four Attributed Star Pattern (1-3-2-3-4)
25.8.1.6 Five-Attributed Elongated Star Pattern (1-2-3-4-
3-2)
25.8.2 Using Python Programming Language
25.8.2.1 Three-Attributed Line Pattern (1-2-3)
25.8.2.2 Three-Attributed Loop Pattern (2-3-4-2)
25.8.2.3 Four-Attributed Line Pattern (1-3-2-4)
25.8.2.4 Four Attributed Loop Pattern (1-3-4-2-1)
25.8.2.5 Four-Attributed Star Pattern (1-3-2-3-4)
25.8.2.6 Five-Attributed Elongated Star Pattern (1-2-3-4-
3-2)
25.9 Analysis of Experimental Results
25.10 Conclusion
References
25.1 Introduction
The real-life graph has a finite number of nodes or vertices. The
relationship among the nodes or vertices is created with the help of the
edges. The attributes associated with nodes or vertices represent node or
vertex properties. But in a social graph, the node or vertex attributes are
used to model the personal characteristics. Similarly, in a web graph, the
node or vertex attributes are assigned with the contents such as keywords
and tags related to a particular page on the web. This type of extended
graph representation is considered as the attributed graph. It is possible to
detect the hidden patterns from the attributed graph that provides the
relevant knowledge related to various applications. The University
Attributed Graph could be treated as one kind of social attributed graph.
The attributes of the nodes of University Attributed Graph are some kind of
“Job Title.” The “Job Title” of nodes may be Vice-Chancellor, Dean,
Associate Dean, etc.
FIGURE 25.4 (i) Detected 3-attributed line patterns (ii) Detected 3-attributed loop patterns.
The following six patterns have been detected by the proposed algorithm.
25.8.1.1 Three-Attributed Line Pattern (1-2-3)
There are nine numbers of line patterns that have been detected successfully
in the University Attributed Graph. So, the detected line patterns are 1-4-11,
1-4-12, 1-6-16, 1-7-15, 2-6-16, 8-3-11, 8-5-14, 8-9-13, and 8-9-14,
respectively and depicted in “Figure 25.11”. The detected patterns of
graphical representation have been depicted in “Figure 25.4(i)”.
FIGURE 25.20 (i) 4-Attributed detected line patterns (ii) 4-Attributed detected loop patterns.
25.10 Conclusion
The author has extended the algorithm of the article Bapuji Rao et al [12].
which only detects the line and loop patterns with three numbers of
attributes and named the proposed algorithm, PDAGraph345 which able to
detect patterns with four and five numbers of attributes. The proposed
algorithm has been implemented on the proposed University Attributed
Graph and successfully detected three-attributed and four-attributed line
patterns, three-attributed and four-attributed loop patterns, four-attributed
star pattern, and five-attributed elongated star pattern, respectively. The
experiment was carried out using C++ and Python programming languages
and the results were satisfactory.
REFERENCES
1. Pfeiffer III, Joseph J., Moreno, Sebastian, Fond, Timothy La, Neville,
Jennifer, and Gallagher, Brian. “Attributed graph models: Modeling
network structure with correlated attributes”. WWW’14, April 7–11,
Seoul, Korea. ACM, 2014.
2. Bothorel, C., Cruz, J. D., Magnani, M., and B. Micenkova, B.
“Clustering attributed graphs - Models, measures and methods”.
Network Science 3, no. 3 (2015): 408–444.
3. Magnani, Matteo, and Rossi, Luca. “The ML-model for multi-layer
network analysis”. In Proceedings of IEEE International Conference
on Advances in Social Network Analysis and Mining. IEEE Computer
Society, Los Alamitos, 2011.
4. Gao, Jianxi, Buldyrev, Sergey V., Stanley, H. Eugene, and Havlin,
Shlomo. “Networks formed from interdependent networks”. Nature
Physics 1, no. 8 (2012): 40–48.
5. Kivelä, Mikko, Arenas, Alexandre, Barthelemy, Marc, Gleeson, James
P., Moreno, Yamir, and Porter, Mason A. “Multilayer networks”.
Journal of Complex Networks 2 (2014): 1–59.
6. Wasserman, Stanley, and Faust, Katherine. “Social network analysis:
Methods and applications”. Structural Analysis in the Social Sciences.
Cambridge University Press, 1994.
7. Fionda, Valeria, and Pirrò, Giuseppe. “Querying graphs with
preferences”. In Proceedings of CIKM’13, Oct. 27–Nov. 1, San
Francisco, CA, USA, 2013.
8. Tong, H., Faloutsos, C., Gallagher, B., and Eliassi-Rad, T. “Fast best-
effort pattern matching in large attributed graphs”. In Proceedings of
the 13th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, August 12–15, 2007, ACM, San Jose,
California, USA (2007): 737–746.
9. Silva, A., Meira Jr, W., and Zaki, M. J. “Mining attribute-structure
correlated patterns in large attributed graphs”. In Proceedings of
VLDW Endowment, no. 5 (2012) : 466–477.
10. Gomes, C. S., Amaral, J. N., Sander, J., Siu, J., and Ding, L.
“Heavyweight pattern mining in attributed flow graphs”. In
Proceedings of the IEEE International Conference on Data Mining
(ICDM), December 14–17, Shenzhen, China (2014): 827–832.
11. Zhang, Q., Song, X., Saho, X., Zaho, H., and Shibasaki, R. “Attributed
graph mining and matching: An attempt to define and extract soft
attributed patterns”. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, June 23–28, Columbus,
Ohio, USA (2014): 1394–1401.
12. Rao, Bapuji, Mishra, Sarojananda, and Kumar, T. Kartik. “An
Approach to detect patterns in a social attributed graph using graph
mining techniques”. Journal of Engineering and Applied Sciences
(JEAS) 5, no. 13 (2018): 4753–4760.
13. Jorgensen, Z., Yu, T., and Cormode, G. “Publishing attributed social
graphs with formal privacy guarantees”. In Proceedings of the 2016
International Conference on Management of Data, June 26-July 1,
ACM, San Francisco, California, USA (2016): 107–122.
14. Cook, D. J., and Holder, L. B. Mining Graph Data. John Wiley &
Sons, Hoboken, New Jersey, USA, 2007.
15. Rao, Bapuji, Mitra, A., and Narayana, U. “An approach to study
properties and behaviour of social network using graph mining
techniques”. In DIGNATE 2014: ETEECT 2014 (2014): 1–6.
26
Analysis and Prediction of the Update of
Mobile Android Version
CONTENTS
26.1 Introduction
26.1.1 Mobile Fragmentation
26.1.2 Treble – Google
26.1.3 Security Fix Support and Android Update
26.2 Systematic Literature Survey
26.2.1 API Compatibility Issues and Android Updates
26.2.2 Android Updates and Software Aging
26.2.3 Android Updates and Google Play Store
26.2.4 Security Standards Hardware Rooted in Mobile Phones
26.2.5 Security Fixes and Android Update
26.2.6 Machine Learning and Android Antivirus Updates
26.2.7 Smells Detection in Android Using Machine Learning
26.2.8 Android Malicious Classification Using Various ML
Algorithms
26.3 Existing Techniques
26.4 Methodology and Tools Used in Existing Techniques
26.5 Proposed System
26.5.1 Schematic Overview of Mobile Android Update Prediction
and Analysis
26.5.2 Flow Chart Depicting Mobile Android Update Prediction
and Analysis
26.5.3 Algorithm for the Prediction and Analysis
26.5.3.1 Algorithm for Linear Regression Model and R
Programming
26.5.3.2 Algorithm for Logistic Regression Model
26.5.3.3 Algorithm for Decision Tree Model
26.5.4 Methodology
26.5.5 Software Packages Used
26.5.6 Dataset Description
26.5.6.1 Attribute and Values Information
26.5.6.2 Missing Attribute Values: None
26.6 Experimental Results and Discussions
26.6.1 Graphical Representation
26.7 Conclusions and Future Work
References
Appendix: Datasets Sample Attachments
26.1 Introduction
Output: Graph with the version number along the abscissa and Android API
level along the ordinate and the accuracy of prediction is displayed.
Algorithm 2 logistics_function()
xtrain, xtest, ytrain, ytest = train_test_split (x, y, test_size = 0.25,
random_state = 0) where xtrain, xtest, ytrain, ytest are the input
variable(x) used for training and testing, the target variable (y) used for
training and testing respectively.
classifier = LogisticRegression (random_state = 0) Initialise Logistic
regression
Fit the logistic regression classifier.fit (xtrain, ytrain)
Predict using y_pred = classifier.predict (xtest)
cm = confusion_matrix(ytest, y_pred) to get the confusion matrix
Print cm (confusion matrix) and accuracy
Plot the predicted values with x and y labels.“0”: no security fixes for the
particular Android version and “1”: availability of security fixes.
Algorithm 3 decision_tree_function()
Import the dataset
Separate input and target variable, X =
data_set.iloc[:,1].values.reshape(-1,1)
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.5,
random_state = 100) Separate training and test dataset where test size is
50% and random_state is 100
clf_gini = DecisionTreeClassifier(criterion = “gini,” random_state =
100,max_depth=3, min_samples_leaf=5) for training the dataset using
Gini Index.
Equation 3.4: H(x) = −∑ p (xi) log 2 p(xi)x can take N different
N
i=1
values from 1 and p(xi) is the probability value and H(x) is the entropy
value
clf_gini.fit(X_train, y_train), fit the decision tree model.
clf_entropy = DecisionTreeClassifier (criterion = “entropy,” random_state
= 100, max_depth = 3, min_samples_leaf = 5) for training the dataset
using entropy.
clf_entropy.fit(X_train, y_train), fit the decision tree model
y_pred = clf_object.predict(X_test) Predict the values using Decision tree
model.
26.5.4 Methodology
To begin with, the process starts with collecting the dataset comprising of
release days between each version of Android. Here the input dataset
contains data for like Samsung, Huawei, One Plus, HTC, Nokia, Sony
Xperia, Motorola, XIOAMI MI and LG. Four versions of Android are
considered in this analysis and for prediction namely Android versions
considered are 7.0 NOUGAT, 8.0 OREO,9.0 PIE and Android 9. The data is
analyzed and predicted using data analytical tools like Machine learning
Regression technique and R programming. Linear Regression is used to
accurately predict and with the residual error the correctness is also
observed. The data set is trained and tested using the regression model. The
best fit line is found, the slope and y-intercept of the line is found by using
the formula in equation 3.1. The prediction is done using the known
equation of line as shown in equation 3.3. Once the weight and bias are
found, these values are used to find the predicted point for any x.
Scatter is used to scatter the points given in the x and y arrays where x is
the list of the x-coordinates of all the points and y is the list of the y-
coordinates of all the points. This plots a continuous line graph on the plot.
Since all y_predicted values are calculated using the formula of line on
corresponding x values all the points (x,y_predicted) have to be colinear
and hence a line is plotted. Using the prediction function the values are
predicted and the respective graphs are plotted. The same analytical
approach is followed in R programming and the predicted values are printed
which displays the days of the next release. The model is also analyzed
statistically by obtaining their mean, median and mode values. The graph
for the rating of the users and the number of days is also visualized in the
form of a graph. Security assurance during the updating of any version of
Android becomes an essential process. The lack of security results in
several software problems and the data would be highly insecure This
means that with the Android updates if the security fix support is not
provided or is unavailable then the user’s data is prone to hacking. Any
third-party software can easily obtain the confidential files and other
resources stored in an Android device. To solve this, the proposed technique
of Logistic regression is helpful. Logistic regression is generally used to
predict values in the form of binary which serves the purpose here. This
basic idea, is used to identify if the updated version of Android has the
support of security fixes or not. The dataset used comprises of the version
history of Google Android release from Android version 1 to Android 11.
API levels of security is also considered. The model is then trained and
tested. The prediction of the availability and non-availability of security fix
support for a particular version of Android is done and is plotted in the form
of graph. The predicted model is also evaluated using Confusion matrix to
observe the true and false predictions. The accuracy score is also displayed
as percentage for the predicted logistic model. Decision Tree model is a
powerful algorithm to predict the target variable with great accuracy. The
structure of the Decision Tree model is such that every internal node clearly
depicts the test on a specific variable (i.e.) the target variable. In the
algorithms used, the process is divided as building phase, training phase,
testing phase and prediction phase. The leaf node is considered to depict a
class label (i.e.) after initialising the Decision Tree classifier and computing
the target variables and the tree branches depicts concurrences of variable
or features that results in class labels. Gini index function and entropy
functions are invoked. The Gini index and entropy functions using equation
3.4 are used to precisely find out which values are predicted as true
variables and the falsely predicted values. The target variable here is the
availability and non-availability of security fix in each updated version of
Android. The Decision Tree Classifier is initialised and the test size during
dataset split up is given as 50%. The values are predicted. Confusion
Matrix, precision, recall, f1-score, and support values are displayed. The
accuracy of the predicted model is also found by using the calc_accuracy
function.
i=1
different values from 1 and p(xi) is the probability value and H(x) is the
entropy value.
2. Dataset 2:
Manufacturer: Displays the names of each mobile manufacturer.
The following are the other attributes used with values as the
number of days taken by each manufacturer to make the Android
compatible for the next release.
Android 10, 9.0 Pie, 8.0 Oreo, 7.0 Noughat
Figure 26.12 gives a clear perspective of the results obtained from Lasso
regression. The model predicted using Lasso regression is evaluated using
the Confusion matrix as shown. The true negative value is displayed as 3
and true positive value is displayed as 1. The accuracy of the predicted
model is 100% as all the values given for the test dataset were predicted
accurately. In Figure 26.12, a plot of the predicted result can be seen. The
abscissa contains the Android version number and the ordinate displays the
API security level for each version (i.e.) from Android version 1 to the
recently released version, Android 11. Lasso regression results gives a
better understanding about the security fixes which comes with each
version of Android. “1” denotes that the version contains security fix
supports and “0” denotes the non-availability of the security fix support.
The results obtained from Decision Tree is shown in Figure 26.13. The
decision tree predicted model gives an accuracy of 75% which the model
has successfully predicted. The results of “1” being categorised as Security
fix availability and “0” as non-availability is also obtained as Entropy
results with precision for each of the categories 1 and 0, recall score which
shows the ratio of true positive to number of false negative, f1 score
depicting the weighted average of precision and recall, support values
which shows the number of true values in each category “0” and “1.” These
results are obtained from the confusion matrix. Using the highly accurate
results obtained from Decision Tree, Lasso and Linear Regression models,
conclusions of the usage of a particular updated version of Android and the
security fixes availability is experimentally proved.
REFERENCES
1. S. Scalabrino, G. Bavota, M. Linares-Vásquez, et al “API
compatibility issues in Android: Causes and effectiveness of data-
driven detection techniques”. Empir. Softw. Eng. 25, 5006–5046
(2020).
2. S. Amann, H. A. Nguyen, S. Nadi, T. N. Nguyen, and M. Mezini, “A
systematic evaluation of static API-misuse detectors”. IEEE Trans.
Softw. Eng. 45(12), 1170–1188 (1 Dec. 2019), doi:
10.1109/TSE.2018.2827384.
3. D. Cotroneo, F. Fucci, A. K. Iannillo, R. Natella, and R. Pietrantuono,
“Software aging analysis of the Android mobile OS”. IEEE 27th Int.
Symp. Softw. Reliab. Eng. (ISSRE), Ottawa, ON, 2016, pp. 478–489,
doi: 10.1109/ISSRE.2016.25.
4. S. McIlroy, N. Ali, and A.E. Hassan, “Fresh apps: An empirical study
of frequently-updated mobile apps in the Google play store”. Empir.
Softw. Eng. 21, 1346–1370 (2016).
5. N. Ashraf, A. Masood, H. Abbas, et al “Analytical study of hardware-
rooted security standards and their implementation techniques in
mobile”. Telecommun. Syst. 74, 379–403 (2020).
6. L. Singleton, R. Zhao, M. Song, and H. Siy, “FireBugs: Finding and
repairing bugs with security patterns”. IEEE/ACM 6th Int. Conf. on
Mob. Softw. Eng. and Syst. (MOBILESoft), Montreal, QC, Canada,
2019, pp. 30–34, doi: 10.1109/MOBILESoft.2019.00014.
7. N. Nissim, R. Moskovitch, O. BarAd, et al “ALDROID: Efficient
update of Android anti-virus software using designated active learning
methods”. Knowl. Inf. Syst. 49, 795–833 (2016).
8. A. Gupta, B. Suri, V. Bhat, “Android smells detection using ML
algorithms with static code metrics”. In: and Batra U., Roy N., Panda
B. (eds) Data Science and Analytics. REDSET 2019. Communications
in Computer and Information Science, vol. 1229, 2020. Springer,
Singapore.
9. M. Anshori, F. Mar'i, and F. A. Bachtiar, “Comparison of machine
learning methods for Android malicious software classification based
on system call”. Int. Conf. Sustainable Inf. Eng. and Technol. (SIET),
Lombok, Indonesia, 2019, pp. 343–348, doi:
10.1109/SIET48054.2019.8985998.
10. Z. Guo, Z. Lv, B. Zhou, and C. Chen, “Feature detection and security
evaluation of mobile phone based on decision tree”. 14th Int. Comput.
Conf. Wavelet Active Media Technol. Inf. Process. (ICCWAMTIP),
Chengdu, 2017, pp. 89–92, doi: 10.1109/ICCWAMTIP.2017.8301455.
11. M. Xu, C. Song, Y. Ji, M.-W. Shih, K. Lu, C. Zheng, R. Duan, Y. Jang,
B. Lee, C. Qian, S. Lee and T. Kim, “Toward engineering a secure
Android ecosystem: A survey of existing techniques”. ACM Comput.
Surv. 49, 2 (2016), Article 38 (November 2016).
12. N.-V. Long, J. Ahn, and S. Jung, “Android fragmentation in malware
detection”. Comput. Secur. 87 (2019101573).
13. G. Yang, J. Jones, A. Moninger, and M. Che, “How do Android
operating system updates impact apps?” IEEE/ACM 5th Int. Conf.
Mob. Softw. Eng. Syst. (MOBILESoft), Gothenburg, 2018, pp. 156–
160.
14. P. Kong, L. Li, J. Gao, K. Liu, T. F. Bissyandé, and J. Klein,
“Automated testing of Android apps: A systematic literature review”.
IEEE Trans. Reliab. 68(1), 45–66 (March 2019), doi:
10.1109/TR.2018.2865733.
15. J. DeLoach, D. Caragea, X. Ou, “Android malware detection with
weak ground truth data”. Proc. of the Thirty-First AAAI Conf. Artif.
Intell., AAAI Press, 2017, p. 4915.
16. K. Aggarwal, A. Hindle, and E. Stroulia, “GreenAdvisor: A tool for
analyzing the impact of software evolution on energy consumption”.
IEEE Int. Conf. on Softw. Maintenance and Evolution, ICSME 2015,
2015, pp. 311–320.
17. Q. Do, G. Yang, M. Che, D. Hui, and J. Ridgeway, “Redroid: A
regression test selection approach for Android applications”. The 28th
Int. Conf. Softw. Eng. Knowledge Eng. SEKE 2016, 2016.
18. Y. Aafer, G. Tao, J. Huang, X. Zhang, N. Li, “Precise Android API
protection mapping derivation and reasoning”. Proc. ACM SIGSAC
Conf. Comput. Commun. Secur., ACM, 2018, pp. 1151–1164.
19. R. Mahmood, N. Mirzaei, and S. Malek, “EvoDroid: Segmented
evolutionary testing of Android apps”. Proc. ACM SIGSOFT Int.
Symp. Found. Softw. Eng., 2014, pp. 599–609.
20. L. Li, “Mining androzoo: A retrospect”. Proc. Doctoral Symp. 33rd
Int. Conf. Softw. Maintenance Evolution, 2017, pp. 675–680.
21. L. Li, T. F. Bissyande, Y. Le Traon, and J. Klein, “Accessing
inaccessible Android APIs: An empirical study”. Proc. 32nd Int. Conf.
Softw. Maintenance Evolution, 2016, pp. 411–422.
22. N. Mirzaei, J. Garcia, H. Bagheri, A. Sadeghi, and S. Malek,
“Reducing combinatorics in GUI testing of Android applications”.
Proc. Int. Conf. Softw. Eng., 2016, pp. 559–570.
23. Y. Hu, I. Neamtiu, and A. Alavi, “Automatically verifying and
reproducing event-based races in Android apps”. Proc. Int. Symp.
Softw. Testing Anal., 2016, pp. 377–388.
24. L. Li, T. F. Bissyande, H. Wang, and J. Klein, “CiD: Automating the
detection of API-related compatibility issues in Android apps”. Proc.
ACM SIGSOFT Int. Symp. Softw. Testing Anal., 2018, pp. 153–163.
25. L. Wei, Y. Liu, and S. Cheung, “Taming Android fragmentation:
Characterizing and detecting compatibility issues for Android apps”.
Proc. 31st IEEE/ACM Int. Conf. Automated Softw. Eng., 2016, pp.
226–237.
26. N. Mirzaei, S. Malek, C. S. Psreanu, N. Esfahani, and R. Mahmood,
“Testing Android apps through symbolic execution”. Proc. ACM
SIGSOFT Softw. Eng. Notes, 2012, pp. 1–5.
27. R. Hay, O. Tripp, and M. Pistoia, “Dynamic detection of inter-
application communication vulnerabilities in Android”. Proc. Int.
Symp. Softw. Testing Anal., 2015, pp. 118–128.
28. G. d. C. Farto and A. T. Endo, “Evaluating the model-based testing
approach in the context of mobile applications”, Electron. Notes Theor.
Comput. Sci. 314, 3–21 (2015).
29. P. Bielik, V. Raychev, and M. T. Vechev, “Scalable race detection for
Android applications”. Proc. ACM SIGPLAN Int. Conf. Object-
Oriented Program., Syst., Lang. Appl., 2015, pp. 332–348.
30. S. Packevicius, A. Usaniov, S. Stanskis, and E. Bareisa, “The testing
method based on image analysis for automated detection of UI defects
intended for mobile applications”. Proc. Int. Conf. Inf. Softw. Technol.,
2015, pp. 560–576.
31. H. Cai, N. Meng, B. Ryder, D. Yao, “Droidcat: Effective Android
malware detection and categorization via app-level profiling”. IEEE
Trans. Inf. Forensics Secur. 14 (6), 1455–1470 (2018).
TABLE 26.1
Prediction Model Sample Dataset1
TABLE 26.2
Prediction Model Sample Dataset2
Note: Italicized page numbers refer to figures, bold page numbers refer to
tables
A
accelerometers, 23
access control, 378–379
acompositional dimension, 411
acoustic performance prediction, 267–275
backward elimination model, 270
cross validation, 271
data preprocessing, 270
deployment and optimization, 271
error analysis, 273
forward selection model, 270, 275
overview, 267–268
regression model, 270
regressor model, 271
schematic diagram of impedance tube, 269
sound absorption values, 272, 273
structural parameters of each layer material, 271
ACRYL, 434
active attacks, 335, 379
Adadelta, 175, 177–178, 182–185
Adagrad, 175, 177, 182–185
Adaline, 191
Adam, 175, 178, 182–185
AdaMax, 179, 182–185
adaptive filter
general framework, 163
as linear predictor, 163
adjacency matrix, 426
ADNS-3080 sensor, 40
AFG Miner, 411
Airbnb, 334
ALDROID, 436
AlexNet, 192
AlexNetOWTBN, 192
algorithmic trading, 293–302
Amazon, 334
anaffiliation dimension, 411
Android updates, 433–449
algorithm for prediction and analysis, 441–443
decision tree model, 443, 449
linear regression model, 441, 447
logistic regression model, 442
R programming, 441
antivirus updates, 436
API compatibility issues, 434–435
attribute and values formation, 445
confusion matrix, 448
datasets, 445, 451
existing techniques, 436–437, 437
Google Play Store, 435
graphical representation, 445–447
Lasso regression plot, 448
malicious classification, 436
methodology, 443–444
mobile fragmentation, 434
proposed system for, 437–445
flow chart, 439, 439–441, 440, 441
schematic overview, 437–439, 438
security fix support, 434
security fixes, 435
security standards, 435
smells detection, 436
software aging, 434
software packages, 444
application layer attacks, 391
artificial intelligence (AI), 4
artificial neural network (ANN)
algorithms, 344, 345
architecture, 279–280
in customer churn prediction, 319–320, 327
in DDoS attack detection, 392
in diabetes prediction, 85, 219
in hyper-spectral imaging, 74
in stock market prediction, 306
association mining, 238
association rule learning algorithms, 344, 345
in IoT security, 348
astructural dimension, 411
attributed graphs, 410, 412
adjacency matrix, 426
audio content analysis, 243
authentication, 378
auto encoders, 340, 350–351; see also artificial neural network (ANN)
autoregressive integrated moving average (ARIMA), 305–306
autoregressive moving average (ARMA), 305–306
average packet size, 394–396
B
back propagation, 191
backward elimination model, 270, 273; see also forward selection
model
bacterial leaf blight, 49
banking sector, churn prediction in, 317–328
Bayesian algorithms, 343, 344
in IoT security, 347–348
Bengali language, 358
big data, 19
categories, 49
characteristics of, 293
defined, 293–294
features of, 49–50, 50
machine learning tools, 52
binary cellular automata, 148
binary data sequence, 151–152, 153
biometrics, 376
bitrate, 394
Bluetooth, 334
Boltzmann machines, 340
Boolean cellular automata, 148
BreakHis dataset, 252–253, 253, 254, 263
breast cancer, 124–125
breast cancer detection 248–264
advantages, 263
BreakHis dataset, 252–253, 253, 254, 263
convolutional neural network approach, 251–252
data acquisition, 252–253
data preprocessing, 254
deep transfer learning approach, 252
future works, 264
invasive ductal carcinoma (IDC), 252–253, 258, 262
machine learning approach, 251
model architecture, 253
overview 248–249
performance evaluation metrics, 257
transfer learning model, 254–257
confusion matrix, 261
hyper-parameter values, 260
performance, 259
brown spot, 49
Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm, 7
C
C++ programming language, 421–428
cellular automata, 145–159
binary, 148
Boolean, 148
defined, 146
evolutions, 149–150
modeling of dynamical systems, 151–158
overview, 145–146
spatially hybrid, 148
temporally hybrid, 147
Chatkhil dialect, 359
China Laboratories, 217
chloroquine, 221
chronic obstructive pulmonary disease (COPD) prediction and
classification, 203–214
classification model, 208
classification of COPD disease, 207
decision tree analysis, 209
KNN algorithm for, 209–210
logistical regression model, 209
methodology, 206–207
overview, 203–204
pseudo code for feature selection, 207
random forest, 209
studies on, 204–206
churn prediction in banking sector, 317–328
analysis of churn and loyal customers, 326
block diagram, 322
churn rate, 317–318
clustering of churned customers, 327
confusion matrix, 325
data preprocessing, 319–320, 321
datasets, 319–320
methodology, 319–324
motivation for, 318
proposed system for, 320–324
back propagation, 320–322, 322
error calculation, 320, 322
feature values, 320
forward propagation, 320, 322
model evaluation, 324
model improvement, 324
model tuning, 324
weight initialization, 320
studies on, 318–319
classic neural networks (CNNs), 339–340
classification, 5, 53, 238, 341, 368–369
classification layer, 75
cloud computing
complementing features with IoT, 333
versus IoT, 333, 333
cloud computing, IoT-based, 331–353
for business continuity, 333
challenges in, 334
networking and communication protocols, 334
outsized amount of knowledge, 334
sensor networks, 334
data integration, 332
deep learnings in, 337–338
and edge computing, 333
for hosting, 332–333
inter-device communication, 333
machine learning in, 337–338
overview, 332
privacy, 332
remote computing, 333
security, 332
security challenges, 334–335
active attacks, 335
attacks in IoT, 335
passive attacks, 335
clustering, 238
algorithms, 343, 344
coefficients, 341
cognitive computing, 22
communication, 22
compositional dimension, 411
computational linguistics, 240
computer vision-based security system, 375–386
algorithms
capturing images for database, 382
face recognition, 382–383
for security, 384
for training the face recognizer, 383–384
biometrics, 376
computer vision, defined, 376
data security, 376–380
face detection, 381
face recognition, 381, 382, 384–385
future scope, 385–386
Haar cascade classifier, 381
OpenCV library, 376
overview, 375
confidentiality, 377–378
confusion matrix, 121, 121, 123
connectionism, 191
consumer privacy, 22
Conv2D, 193
conventional layer, 74
conventional neural networks (CNNs), 74–75
classification layer, 75
conventional layer, 74
fully connected layer, 74
pooling layer, 74
rectified linear unit layer, 74
softmax layer, 75
convolution layer, 9
convolutional neural networks (CNNs), 9–11, 339
1D, 77
2D, 77
3D, 77
in breast cancer detection, 251–252
for chronic disease prediction, 221
convolution layer, 9
in crop modeling, 74–75
fully connected layer, 11
general model of, 9, 10
hybrid spectral, 77
hyper-spectral image processing, 77–78
in IoT security, 350
last layer activation function, 11
nonlinear activation function, 11, 12
pooling layer, 11
in three-tiered educational analytics framework, 234
in wheat rust disease detection, 192
coronavirus multimodal risk disease prediction, 217–229
in coronavirus disease prediction
F1 score, 225
precision, 225
sensitivity, 225
specificity, 224
death rate by pre-existing medical condition, 218
LSTM model, 224
machine learning algorithms for, 218–219
multimodality, 221–223
naïve Bayes, 223
origin of coronavirus, 217–218
performance evaluation, 224–225
accuracy, 224
and pre-existing medical conditions, 220–221
risk factors in, 221–223
RNN-multimodal, 223
studies on, 219–221
symptoms of coronavirus, 218
total cases of coronavirus, 218
correlation analysis, 237–238
correlation matrix, 103–106, 104
limitations, 111
cosmic-ray (C-RAY) sensing, 36–37
cost function, 5
count plot, 127
covariance matrix
Boston housing datasets, 100–101, 101, 102
correlation matrix, 103–106, 104
data visualization, 101–103
eigenvalues of, 130
eigenvectors of, 130
extracting features and target, 108
feature scaling, 109
features selection using, 99–124
importing modules, 99–100
limitations of correlation matrix, 111
n-dimensional, 129–130
Pearson correlation coefficient, 103
regression model with reduced dataset, 109–111
3-dimensional data visualization, 106–108
training and testing data sets, 109
COVID-19 cases in India, 154–155, 156
crop illness and irritation, 54
crop infection, 53–54
crop infection recognition system, 47–63
classification, 56, 59
disease identification, 62–63
feature extraction, 56, 58–59
fuzzy C-means model, 57–58
image acquisition, 54
image enhancement, 54
image preprocessing, 57
image segmentation, 54, 55, 57–58
segmentation techniques, 61–62, 62
Spark-based versus Hadoop-based FCM approach, 59–61
scale-up performance, 60–61
speed-up performance, 59–60
crop modeling, 67–80
band selection, 73–74
convolutional neural network, 74–75
decision support system, 68–69, 71–72, 72
deep learning in, 72–73
defined, 70
hyper-spectral imaging in, 73
advanced processing, 76–78
crop data, 76
farm data, 75–76
machine learning in, 72
necessity for, 70
overview, 67–68
smart farming, 68
steps in, 70
trends in, 70–71
crop yield
decision support system for, 69
prediction, 54
cryptography, 377–378
curse of dimensionality, 361, 361
cyberattacks, 335
cybernetics, 191
D
data
access control, 378–379
authentication of, 378
confidentiality, 377–378
encryption, 377–378
erasure, 377
integrity, 378
masking, 377
non-repudiation, 378
resilience, 377
scalability, 24
storage and analysis, 24
structures, 24
visualization, 24, 101–103
principal component analysis, 125–127, 135
t-distributed stochastic embedding, 112
3-dimensional, 106–108
data analytics, 3
bio-inspired algorithms for, 15, 17
data imputation LMS (DI-LMS) algorithm, 162
data science
components of, 20, 20
defined, 19
growth of, 20
IoT applications in, 24, 25
method for, 21
step by step of, 21
sub-domain for IoT, 22–23
data security, 376–380
attacks, 379–380
data masking, 377
goal, 377
services, 377–379
types of, 377
data.table package, 83
datasets
testing, 109
training, 109
decision support system, 68–69
for crop models, 71–72, 72
for crop yield, 69
decision tree (DT)
algorithms, 343, 344
Android update prediction and analysis, 443
in COPD disease classification, 209
in crop disease analysis, 51, 51
in crop infection recognition system, 59
in diabetes mellitus prediction, 85–86, 86
in IoT security, 347
decryption, 377–378
deep auto encoders, 350–351
deep belief network (DBN), 234, 351
deep Boltzmann machine (DBM), 340
deep learning (DL), 4, 4, 8, 8–15, 173
algorithms, 345–346
dimensionality reduction, 345, 346
ensemble, 346, 346
applications, 337
classification of, 339–340
convolutional neural networks, 9–11
in crop modeling, 72–73
extreme learning machine, 13–14
hybrid learning, 351–353
and IoT, 22
for IoT security, 350–353, 352
in IoT-based cloud computing, 337–338
methods for, 8–15, 350, 352
semi-supervised learning, 351–353
supervised models, 339
transfer learning, 14–15
unsupervised learning, 350–351
unsupervised models, 339
deep neural networks, 173–186
Adadelta, 177–178, 182–185
Adagrad, 177, 182–185
Adam, 178, 182–185
AdaMax, 179, 182–185
NADAM, 179, 182–185
overview, 173–174
RMSprop, 176–177, 182–185
SGD, 175–176, 182–185
SGD with Momentum, 176, 182–185
deep reinforcement learning, 352–353
deep transfer learning, 252
degrees of freedom (DOF), 23, 28
DenseNet, 173, 174
DenseNet161, 252
DenseNet201, 175, 249, 258, 261, 262
Derrirer probabilities, 7
diabetes mellitus prediction, 83–96
correlation matrix, 90–91
data training and testing, 91–92
datasets, 85–87, 86, 91–92
experimental analysis, 92–93
implementation methods, 85–87
decision tree, 85–86, 86
naïve Bayesian algorithm, 87
random forest, 86
support vector machine, 87, 88
least-angle regression (LARS), 84
least-angle regression technique, 85
model fitting, 92
R language, 83–84
visualization, 87–90, 89–90
dialect identification of Bengali language, 357–372
accuracy of, 370–371
feature computation, 359–368
Mel frequency cepstral coefficients, 362–365, 365, 366–367
skewness-based, 365–367
spectral flux, 367–368, 368, 369
zero crossing rate, 361–362, 363–364
feature selection, 359–368
feature vector and classification, 368–369
overview, 357–358
previous studies on, 358–359
proposed methodology, 359–368, 360
relative analysis, 371, 371
use of Bengali language, 358
digital content, 233
dimensionality reduction algorithms, 97–143, 341, 345, 346
feature selection using covariance matrix, 99–124
Boston housing datasets, 100–101, 101, 102
correlation matrix, 103–106, 104
data visualization, 101–103
extracting features and target, 108
feature scaling, 109
importing modules, 99–100
limitations of correlation matrix, 111
Pearson correlation coefficient, 103
regression model with reduced dataset, 109–111
3-dimensional data visualization, 106–108
training and testing data sets, 109
overview, 98–99
principal component analysis (PCA), 124–143
two2-D visualization, 135
three3-D visualization, 135
captured variance and data lost, 134
classification modeling with support vector machines, 137–
142
constructing feature matrix, 131
data transformation, 131
data visualization, 125–127
dataframe preparation, 134, 135
derivation of dataset, 131
eigenvalues of covariance matrix, 130
eigenvectors of covariance matrix, 130
limitations of, 142–143
mean vector, 129
n-dimensional covariance matrix, 129–130
SkiKit-Learn Library, 133
sorting the eigenvalues and eigenvectors, 131
splitting data into test and train sets, 135–137
standardization, 129
stepwise, 133
training and testing data sets, 128
transposing data for usage into Python, 129
UCI breast cancer dataset, 124–125, 126
verification of library, 133
visualizations, 134–135
t-distributed stochastic embedding (t-SNE), 111–124
two2-D visualization, 115, 115–116, 116
three3-D visualization, 116, 117, 117
confusion matrix, 121, 121, 123
data visualization, 112, 113
defined, 112
extracting features and target, 118
F1 score, 121
Jaccard index, 121
k-nearest neighbors, 118
limitations of, 123–124
MNIST handwritten digits dataset, 111–112
model accuracy, 121
probability and mathematics behind, 114–115
random sampling of large dataset, 112
training and testing data sets, 118–119
discrete cosine transformation (DCT), 264
distributed denial of service (DDoS) detection system, 389–404
accuracy comparison, 403
bandwidth of, 389
confusion matrix, 399
datasets, 398–399, 399–400
machine learning-based, 395
methodology, 394–398
detection of stealthy attacks, 397
oversampling, 396–397
ranking of machine learning algorithms, 397–398, 400–401,
402
synthetic sampling, 396–397
metrics, 399–401
multivector, 389
observations, 401–404
overview, 389–390
related studies, 391–393, 394
types of attacks, 389, 391
vectors, 389
DNA sequence, 152, 154
Docstrings, 105
dplyr package, 83
Dyn, 334
dynamical system, 145, 151–158
spatially and temporally hybrid ca models, 155–158
temporally hybrid cellular automata models, 152–155
E
Eclipse, 380
edge computing, 23
edge-attributed graphs, 410–411
multi-layer, 412
multi-relational, 411
educational data mining (EDM), 233
EffieicnetNet, 173
EffieicnetNetB5, 174, 175
eigenvalues
of covariance matrix, 130
sorting, 131
eigenvectors
choosing, 131
of covariance matrix, 130
with largest eigenvalues, 131
sorting, 131
electronic compass, 39
elongated star patterns, five-attributed
node-pair generations for, 415
pattern detections, 418, 418
with C++ programming language, 428, 428
with Python programming language, 430, 430, 431
emotion lexicons, 240
encryption, 377–378, 378
ensemble learning (EL)
algorithms, 346, 346
in IoT security, 348
ensemble of deep learning networks (EDLNs), 352
Ethiopia, 191
Euclidian distance, 5
evolutionary algorithms, 280
evolutionary programming, 280
evolutionary strategies, 280
expectation–maximization method, 162
extreme learning machine (ELM), 13–14; see also artificial neural
network (ANN)
structure, 13
Extrinsic 12-axis sensor platform, 28
eye movement, 243
F
F1 score, 121, 225, 257
face detection, 381; see also computer vision-based security system
face recognition, 381, 382; see also computer vision-based security
system
Facebook, 410
facial expressions, 243
false positive (FP), 257
false positive rate (FPR), 257
feature extraction, 53
feature-representation-transfer approach, 15
fertilizer spreading analysis
define phase, 41
experimental results, 44
experimental setup, 43
implementation phase, 43
methodology, 41
nutrient availability, 41
prototype development phase, 42
quick design phase, 42
soil pH value, 41
system architecture, 42
testing phase, 42
finite state semi-automaton (FSSA), 146
five-attributed elongated star patterns
node-pair generations for, 415
pattern detections, 418, 418
with C++ programming language, 428, 428
with Python programming language, 430, 430, 431
forward selection model, 270, 275; see also backward elimination
model
four-attributed line pattern
node-pair generations for, 414–415
pattern detections, 415, 417
with C++ programming language, 426–427, 427
with Python programming language, 428–431
four-attributed loop pattern
pattern detections, 417, 418
with C++ programming language, 427, 427
with Python programming language, 430
four-attributed star pattern
node-pair generations for, 415
pattern detections, 418, 418
with C++ programming language, 427, 428
with Python programming language, 430, 430
Freescale, 28
fully connected layer, 11, 74
fuzzy C-means (FCM), 51
in crop disease analysis, 51
Spark-based versus Hadoop-based, 59–61
G
Gaussian mixture models, 358–359
general model, 9, 10
generative adversarial network (GAN), 252, 351
generative learning, 350–351
genetic algorithms (GAs), 281
genetically guided enhanced artificial bee colony (GGEABC), 279–
290
breast cancer dataset, 286–287, 288
control parameters, 285
convergence graphs, 289
diabetes dataset, 286, 288
flowchart, 283
metaheuristics, 280–281
MLP training, 282–284
SAheart dataset, 287–288, 288
simulation setup, 284–290
geographic indications, 22
geospatial data analysis, 22
global average pooling, 11
global configuration, 146
global transition function, 146
of spatially hybrid cellular automata, 150
Google, 434
Google Colab, 257
Google News, 341–342
Google Play Store, 435
GoogleNet, 173, 192, 252
GPS/GIS-based agricultural system, 34–44
fertilizer spreading analysis, 41–44
soil moisture and mineral content measurement, 35–39
cosmic-ray (C-RAY) sensing, 36–37
long duration optical fiber grating, 37
using sensor device, 37
tools, 34–35
data analysis, 35
GIS system, 35
information, 34
map, 34
system apps, 34
system development, 35, 36
graph matching, 410
graph mining, 409–431
adjacency matrix of attributed graph, 426
algorithm for pattern detection, 419–424
assigning node ID of five attributed elongated star patterns,
422–423
creating five attributed elongated star patterns, 422
creating four attributed line and loop patterns, 421
creating four attributed star patterns, 421–422
creating three-attributed line and loop patterns, 421
displaying five attributed elongated star patterns, 423
displaying four attributed line and loop patterns, 421
displaying four attributed star patterns, 422
displaying three-attributed line and loop patterns, 421
node-pair generations, 423
non-pair assignment, 420
PDAGraph345, 419–420, 423–424
attributed graphs, 412
C++ programming language in, 421–428
multi-layer edge-attributed graph, 412
multi-relational edge-attributed graph, 411–412
node-attributed adjacency matrix, 426
node-pair generations, 414–415
for five-attributed elongated star patterns, 415
for four-attributed line and loop patterns, 414–415
for four-attributed star patterns, 415
procedure for, 423
for three-attributed line and loop patterns, 414
overview, 410–411
pattern detections, 415–418
five-attributed elongated star pattern, 418, 418, 428, 428, 430,
430, 431
four-attributed line pattern, 415, 417, 426–427, 427, 428–431
four-attributed loop pattern, 417, 418, 427, 427, 430
four-attributed star pattern, 418, 418, 427, 428, 430, 430
three-attributed line pattern, 415, 416, 425, 427, 429, 430
three-attributed loop pattern, 415, 416, 425, 427, 429
problem definition, 412
proposed approach, 412–418
node-pair generations for, 414–415
pattern detections, 415–418
pattern lengths, 412–414
Python programming language, 428–431
G-Ray, 411
Gray, Jim, 19
GridSearchCV, 142
Grubhub, 334
H
Haar cascade classifier, 381
Hadoop, 52
Hadoop distributed file system (HDFS), 52
hand movement, 243
HBO, 334
heart disease dataset, 171
Hellinger distance (HD), 390, 397
Helpnetsecurity, 389
Hidden Markov model, 358
Hive ecosystem, 308
horse coli dataset, 171
HTTP flood, 391
hybrid learning, 351–353
hyperparameters, 142
hyper-plane, 137
hyper-spectral imaging (HSI)
advanced processing, 76–78
band selection, 73–74
crop data, 76
farm data, 75–76
methodologies, 73
supervised learning, 74
unsupervised learning, 73–74
I
image recognition/processing, 52
improved missing degree, 165
imputation method, 161–171
classification, 166
data collection, 165
data preprocessing, 165–166
evaluation, 167, 167
motivation, 162
overview, 161–162
proposed algorithm, 164–165, 166
inductive transfer learning, 15
inertial measurement unit (IMU), 23
inertial route frameworks (INS), 23
information protection, 22
information technology stocks, 293–302
instance-based algorithms, 342–343, 343
Internet of Things (IoT), 19–31
applications, 30
versus cloud computing, 333, 333
complementing features with cloud computing, 333
computational methodology, 26–28
pre-processing, 27–28
regression, 26–27
sensor fusion, 28
training sets, 27
difficulties to resolve, 21–22
distribution of algorithms in computer science to data, 24–25
effect of data science on, 25
offline architecture, 29, 29
online architecture, 29–30, 30
overview, 21–22
privacy, 28–30
relationship with data, 23, 23
security, 28–29
deep learning methods for, 350–353, 352
machine learning techniques in, 346–349, 349
security issues, 30
sub-domain of data science for, 22–23
intrusion detection systems (IDS), 390
invasive ductal carcinoma (IDC), 252–253, 262
K
k-celled repetitive block, 146
kernel, 74
kernel trick, 138
k-means, 50, 51
k-means clustering, 319, 348–349
k-nearest neighbors (KNN), 50
age-frequency histograms, 211
confusion matrix, 213
in COPD disease classification, 209–210
correlation matrix, 213
in crop disease analysis, 51
in data imputation, 162
for detection of DDoS attacks, 392
in dialect identification, 369
feature correlation, 211
feature distribution, 211
hyperparameters, 119–120
in IoT security, 348
in t-distributed stochastic embedding (t-SNE), 118
knowledgebase (KB), 240–241
L
last layer activation function, 11
layer 7 attacks, 391
leaf blast, 49
leaf scald, 49
learning algorithms, 53
learning analytics, 234–235
learning management system, 235
least mean square (LMS) algorithm, 162–164
least squares (LS) method, 162
least-angle regression (LARS), 84
least-squares approach, 5
Levenberg-Marquart (LM) back propagation, 307
Libpcap, 394
line patterns
four-attributed
node-pair generations for, 414–415
pattern detections, 415, 417, 426–427, 427, 428–431
three-attributed
node-pair generations for, 414
pattern detections, 415, 416, 425, 427, 429, 430
linear regression, 5–6
linear support vector machine, 138
LinkedIn, 410
local binary patterns histogram (LBPH), 380, 381, 382
local transition function, 146, 148
of spatially hybrid cellular automata, 150
logistic regression, 6–7
for COPD disease classification, 209
model, 442
multi-class, 7
polytomous, 7
long duration optical fiber grating (LDOPG), 37
long short-term memory (LSTM) network, 219, 224
support vector machine (SVM), 224
loop patterns
four-attributed, pattern detections, 417, 418, 427, 427, 430
three-attributed
node-pair generations for, 414, 414–415
pattern detections, 415, 416, 425, 427, 429
LSTAT, 103, 106, 110
M
machine learning (ML), 4, 4–5
in acoustic performance prediction, 267–275
algorithms, 342–346
artificial neural network, 344, 345
association rule learning, 344, 345
Bayesian, 343, 344
clustering, 343, 344
decision tree, 343, 344
instance-based, 342–343, 343
regression, 342, 342
regularization, 343, 343
and Android antivirus updates, 436
applications, 337
in breast cancer detection, 251
categories, 4–5
classification of, 338, 340–342
in crop disease detection, 50–51
in crop modeling, 72
frameworks, 53
in IoT security, 346–349, 349
in IoT-based cloud computing, 337–338
reinforcement learning, 342
semi-supervised learning, 341–342
supervised learning, 340–341, 347–348
techniques, 50–51
unsupervised learning, 348–349
MapReduce, 52
market basket analysis, 238
Markov chain Monte Carlo (MCMC) algorithm, 162
masquerade, 379
max pooling, 11
maximum likelihood classification (MLC), 74
max-min approach, 360–361
MaxPooling2D, 193
mean absolute error (MAE), 111
mean square error (MSE), 271, 272, 273, 275
mean vector, 129
MEDV, 103, 105–106, 110
Mel frequency cepstral coefficients (MFCCs), 358–359, 361, 362–365,
365, 366–367
memcached DDoS attack, 389
metaheuristics, 280–281
microcontrollers, 40–41
missing at random (MAR), 165, 167, 168
missing completely at random (MCAR), 165, 169
missing not at random (MNAR), 165, 170
MNIST database (Modified National Institute of Standards and
Technology Database), 111–112
mobile fragmentation, 434
model accuracy, 121
modified Bloom–Richardson (MBR), 251
Montylingua, 242
motor driver, 40
mTPM (modifications in TPM), 435
multi collinearity, 105
multi criteria decision aid (MCDA), 397
multi-class logistic regression, 7
multi-layer edge-attributed graph, 412
multi-layer perceptron (MLP), 191, 280, 282–284, 308, 339–340, 368–
369
multimodal educational data analysis, 244–245
multimodality, 221–223
multi-relational edge-attributed graph, 411
multivariate linear regression, 109–110
N
NADAM (Nesterov-accelerated adaptive moment estimation), 179,
182–185
naïve Bayes Bernoulli (NBB), 307
naïve Bayesian algorithm, 223
in crop infection recognition system, 59
in diabetes mellitus prediction, 87
National Stock Exchange (NSE), 296
natural language generation (NLG), 240
natural language processing (NLP), 240–242, 308
NDAM, 175
n-dimensional covariance matrix, 129–130
negative learning, 15
negative reinforcement learning, 342
negative sentiment score (NSS), 307
Netflix, 334
Netflow, 391
Newton-Raphson method, 6
NIFTY 50 stocks, 296
9-axis system, 28
node-attributed graphs, 410
nonlinear activation function, 11, 12
Non-Linear Autoregressive Neural Network using Exogenous Inputs
(NARX), 307
nonlinear support vector machine, 138
non-repudiation, 378
non-social networks, 411
NumPy, 444
O
online learning, 233
optical flow sensor, 40
optimal hyper-plane, 137
outlier analysis, 238
OverFeat, 192
oversampling, 396–397
Overstock.com, 334
P
packet rate, 394–396
packet size variance, 394–396
paddy crop diseases, 47–63
classification of, 53
feature extraction, 53
overview, 47–48
types of infections, 49, 49
pair plot, 125
pairwise Pearson correlation coefficient, 297
Pandas, 444
parameter estimation, 5
passive attacks, 335, 379
PayPal, 334
Pearson correlation coefficient, 103
Perceptron, 191
Play Store, 435
Plus l-take away r selection, 360
polytomous logistic regression, 7
pooling layer, 11, 74
positive reinforcement learning, 342
positive sentiment score (PSS), 307
precision (P), 257
prediction
of acoustic performance, 267–275
of chronic obstructive pulmonary disease, 203–214
of coronavirus risk d, 217–229
of customer churn in banking sector, 317–328
of diabetes mellitus, 83–96
of stock market, 305–315
predictive analytics (PA), 3–4, 4
predictive modeling, 238–239
classification, 238
clustering, 238
outlier analysis, 238
regression analysis, 238
time series analysis, 238
principal component analysis (PCA), 124–143
two2-D visualization, 135
three3-D visualization, 135
captured variance and data lost, 134
classification modeling with support vector machines, 137–142
constructing feature matrix, 131
data transformation, 131
data visualization, 125–127
dataframe preparation, 134, 135
derivation of dataset, 131
eigenvalues of covariance matrix, 130
eigenvectors of covariance matrix, 130
in hyper-spectral imaging, 74
in IoT security, 34
limitations of, 142–143
mean vector, 129
n-dimensional covariance matrix, 129–130
overview, 128–129
SkiKit-Learn Library, 133
sorting the eigenvalues and eigenvectors, 131
splitting data into test and train sets, 135–137
standardization, 129
steps in, 128–129
stepwise, 133
training and testing data sets, 128
transposing data for usage into Python, 129
UCI breast cancer dataset, 124–125, 126
verification of library, 133
visualizations, 134–135
vs. t-SNE, 143
probabilistic neural network (PNN), 305–315
comparative analysis, 313–314
feature matrix, 310–311
flowchart, 309
future enhancement, 315
in MATLAB environment, 313
sentiment score, 309–310
statistical measure, 312–313
processing engines, 53
profound Q networks, 353
profound reinforcement learning, 353
PROMETHEE, 397–398, 402
protocol attacks, 391
push services, 28
Pycharm, 380
Python, 379
editors, 380
job applications of, 380
uses of, 379–380
Python NLTK (Natural Language Toolkit), 242
Python programming language, 428–431
R
R language, 83–84
R programming, 441
RAD, 105
random forest (RF), 74
in COPD disease classification, 209
in crop infection recognition system, 59
in diabetes mellitus prediction, 86
in dialect identification, 369
in IoT security, 348
RandomizedSearchCV, 142
readr package, 83
real-time collection, 23
real-time security, computer vision-based, 375–386
algorithms
capturing images for database, 382
face recognition, 382–383
for security, 384
for training the face recognizer, 383–384
biometrics, 376
computer vision, defined, 376
data security, 376–380
face detection, 381
face recognition, 381, 382, 384–385
future scope, 385–386
Haar cascade classifier, 381
OpenCV library, 376
overview, 375
real-time transient connection, 23
recall (R), 257
rectified linear unit (ReLU), 11, 74, 280
recurrent neural network (RNN), 307, 339; see also artificial neural
network (ANN)
in IoT security, 350
recursive least square (RLS) algorithm, 163
recursive neural networks (RNN), 234
region of interest (ROI), 251
regression, 5–7, 340–341
algorithms, 342, 342
and IoT, 26–27
linear, 5–6
logistic, 6–7
regression analysis, 238
regression matrix, 111
regression model with reduced dataset, 109–111
regularization algorithms, 343, 343
regularized auto encoders, 340
reinforcement learning, 342; see also machine learning (ML)
ReLU, 193
remote computing, 333
ResNet, 173, 174
ResNet-50, 249
ResNet50, 175, 252, 258, 261
restricted Boltzmann machine (RBM), 340, 351
risk factors, 221–223
RM, 103, 106, 110
RMSprop, 175, 176–177, 182–185
RNN, 221
root mean square error (RMSE), 111, 167
R-squared value, 111
RStudio, 125, 444
S
seed spacing, 39–44
correct spacing, 39
GIS/GPS-based, 40
system components, 39–41
electronic compass, 39
microcontrollers, 40–41
motor driver, 40
optical flow sensor, 40
self-organizing maps (SOMs), 339; see also artificial neural network
(ANN)
semi-structured data, 49, 240–242; see also big data
semi-supervised classification, 341–342
semi-supervised clustering, 341
semi-supervised learning, 341, 341–342, 351–353; see also machine
learning (ML)
sensor combination, 22–23
sensor fusion, 28
sensory information, 27–28, 28
sentiment lexicons, 240
Sentiwordnet, 240
sequential backward selection (SBS), 360
sequential floating backward search (SFBS), 360–361
sequential floating forward search (SFFS), 360–361, 368
sequential forward selection (SFS), 360
SGD (Stochastic Gradient Descent), 175–176, 182–185
SGD (Stochastic Gradient Descent) with Momentum, 176, 182–185
sheath blight, 49
Sherrington–Kirkpatrick model, 340
sigmoid, 280
singular value decomposition (SVD) method, 162
skewness, 365–367
SkiKit-Learn Library, 133
sklearn, 444
slowloris, 391
smart farming, 68
current and future scope, 78–79
grouped dataset in, 69
SMOTE (Synthetic Minority Oversampling Technique), 393, 397,
401–404
social networks, 411–412
softmax layer, 75
software aging, 434
soil moisture and mineral content measurement, 35–39
cosmic-ray (C-RAY) sensing, 36–37
datasets, 37–39
experiment, 37–39
experimental setup, 38
long duration optical fiber grating, 37
seed spacing, 39–44
using sensor device, 37
Spark, 57–58
sparse autoencode (SAE), 318
spatially hybrid cellular automata, 148
global transition function, 150
local transition function, 150
spectral flux, 367–368, 369
speech recognition, 358
Spyder, 380
Stanford NLP tools, 242
star pattern, four-attributed
node-pair generations for, 415
pattern detections, 418, 418
with C++ programming language, 427, 428
with Python programming language, 430, 430
stealthy DDoS attacks, 397
stochastic gradient descent (SGD), 280
stochastic Hopfield network, 340
stochastic Ising-Lenz-Little model, 340
stock market
algorithmic trading, 293–302
prediction using Twitter data and news sentiments, 305–315
comparative analysis, 313–314
feature matrix, 310–311
flowchart, 309
probabilistic neural network, 311
sentiment score, 309–310
statistical measure, 312–313
trend following strategy, 293–302
structured data, 49; see also big data
structured data analysis, 237–240
association mining, 238
challenges in, 240
correlation analysis, 237–238
predictive modeling, 238–239
supervised learning, 5, 340–341; see also machine learning (ML)
categories, 5
in hyper-spectral imaging, 74
in IoT security, 347–348
regression, 5–7
support vector machine (SVM), 8, 50, 74, 307
in churn prediction, 318
classification modeling with, 137–142
in COPD disease classification, 209
in coronavirus disease prediction, 224–225
in crop disease analysis, 51
in crop infection recognition system, 59
in diabetes mellitus prediction, 87, 88
evaluation scores, 139
in IoT security, 347
linear, 138
metrics report, 141
nonlinear, 138
obtaining highest test accuracy model, 139
optimal model hyperparameters, 142
training and evaluating models, 138
types of, 138–142
swarm intelligence, 280
synthetic sampling, 396–397
T
tanh, 280
TCPdump, 394
t-distributed stochastic embedding (t-SNE), 111–124
two2-D visualization, 115, 115–116, 116
three3-D visualization, 116, 117, 117
confusion matrix, 121, 121
data visualization, 112, 113
defined, 112
extracting features and target, 118
F1 score, 121
Jaccard index, 121
k-nearest neighbors, 118
MNIST handwritten digits dataset, 111–112
model accuracy, 121
probability and mathematics behind, 114–115
random sampling of large dataset, 112
training and testing data sets, 118–119
vs. PCA, 143
temporally hybrid cellular automata, 147
textual data analysis, 240–242
3-dimensional data visualization, 106–108
three-attributed line pattern
node-pair generations for, 414
pattern detections, 415, 416
with C++ programming language, 425, 427
with Python programming language, 429, 430
three-attributed loop pattern
node-pair generations for, 414, 414–415
pattern detections, 415, 416
with C++ programming language, 425, 427
with Python programming language, 429
three-tiered educational analytics framework, 233–246
description of, 235–237
future research, 246
implementation of, 245
and learning analytics, 234–235
overview, 225
scope and boundaries of, 245
semi-structured/textual analysis, 240–242
structured data analysis, 237–240
association mining, 238
challenges in, 240
correlation analysis, 237–238
predictive modeling, 238–239
unstructured data analysis, 242–245, 244
time interval variance, 394–396
time series analysis, 238
toll plaza signaling, 156–158, 158
traffic analysis, 379
training sets, 27
trajectory of an elevator, 153–154, 155
transductive transfer learning, 15
transfer learning, 14–15
aim of, 14
in breast cancer detection, 254–257
considerations in, 14–15
relationships between environments of, 16
types of, 15
Treble, 434
trend following strategy, 293–302
true negative (TN), 273
true negative rate (TNR), 257
true positive (TP), 257
Trusted Platform Module (TPM), 435
Twitter, 305–315, 334, 410
U
UDP flood, 391
Ulam, S., 145
university attributed graph, 412, 413
unstructured data, 49, 242–245, 244; see also big data
unsupervised learning, 4, 341, 348–349, 350–351; see also machine
learning (ML)
in hyper-spectral imaging, 73–74
unsupervised transfer learning, 15
V
V3 framework, 50
valence of data, 293
value of data, 293
variability of data, 293
variation auto encoders, 340
variety of data, 293
Vendor Test Suite (VTS), 434
veracity of data, 293
VGG, 174, 192
VGG16, 175
VGG19, 249, 258, 261
VGGNet, 173, 252
video content analysis, 243
Visa, 334
Visual PROMETHEE, 397–398
VNH2SP30 motor driver, 40
volume of data, 293
volumetric attacks, 391
von Neumann, John, 145
W
Walgreens, 334
wheat rust disease, 191
wheat rust disease detection, 191–201
classification report, 200, 201
confusion matrix, 200
dataset preparation, 193
dropout rates, 201
grayscale images, 195, 198
image preprocessing, 193
image segmentation, 194–195, 195
learning rate, 197–201, 198
overview, 191
proposed model, 193, 194
RGB images, 195–201, 199
studies on, 192
WIDE backbone network, 393
Wi-Fi, 334
wireless sensor network (WSN), 163
Wireshark, 394
Wolfram, Stephen, 145
WordNet, 240
Y
Yahoo Finance, 306
YARN, 52
Z
zero crossing rate (ZCR), 361–362, 363–364