Deep Learning Applications in Image Analysis
Deep Learning Applications in Image Analysis
Deep Learning
Applications
in Image
Analysis
Studies in Big Data
Volume 129
Series Editor
Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland
The series “Studies in Big Data” (SBD) publishes new developments and advances
in the various areas of Big Data- quickly and with a high quality. The intent is to
cover the theory, research, development, and applications of Big Data, as embedded
in the fields of engineering, computer science, physics, economics and life sciences.
The books of the series refer to the analysis and understanding of large, complex,
and/or distributed data sets generated from recent digital sources coming from
sensors or other physical instruments as well as simulations, crowd sourcing, social
networks or other internet transactions, such as emails or video click streams and
other. The series contains monographs, lecture notes and edited volumes in Big
Data spanning the areas of computational intelligence including neural networks,
evolutionary computation, soft computing, fuzzy systems, as well as artificial
intelligence, data mining, modern statistics and Operations research, as well as
self-organizing systems. Of particular value to both the contributors and the
readership are the short publication timeframe and the world-wide distribution,
which enable both wide and rapid dissemination of research output.
The books of this series are reviewed in a single blind peer review process.
Indexed by SCOPUS, EI Compendex, SCIMAGO and zbMATH.
All books published in the series are submitted for consideration in Web of Science.
Sanjiban Sekhar Roy · Ching-Hsien Hsu ·
Venkateshwara Kagita
Editors
Venkateshwara Kagita
Department of Computer Science
and Engineering
National Institute of Technology Warangal
Warangal, India
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Singapore Pte Ltd. 2023
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
This book is dedicated to my mother
“Papri Roy”
–Sanjiban Sekhar Roy
Preface
vii
viii Preface
The intention of compiling this book is to present a good idea about both theory
and practice related to the above-mentioned applications before the readers by show-
casing the usages of deep learning. We hope that readers will be benefited signifi-
cantly from learning the state of the art of deep learning applications in the domain
of imagery.
Keep reading, learning, and inquiring.
ix
x Contents
Sanjiban Sekhar Roy is currently a Professor with the School of Computer Science
and Engineering, Vellore Institute of Technology. He received Ph.D. degree from
the Vellore Institute of Technology, Vellore, India, in 2016. He has edited handful of
special issues for journals, published numerous articles in SCI high impact journals
such as IEEE Transactions on Computational social systems; Scientific Reports,
Nature; Computers and Electrical Engineering, Elsevier and many other reputed
journals; Dr. Roy has published nine books with reputed international publishers such
as Springer, Elsevier and IGI Global. His research interests are deep learning and
advanced machine learning. Dr. Roy was a recipient of the “Diploma of Excellence”
Award for academic research from the Ministry of National Education, Romania.
He was also an Associate Researcher with Ton Duc Thang University, Ho Chi Minh
City, Vietnam, during 2019 to 2020.
xi
xii About the Editors
times talent awards from Ministry of Science and Technology, Ministry of Educa-
tion, and nine times distinguished award for excellence in research from Chung Hua
University, Taiwan. Prof. Hsu is president of Taiwan Association of Cloud Coputing;
Chair of IEEE Technical Committee on Cloud Computing (TCCLD); Fellow of the
IET (IEE) and senior member of the IEEE.
Tanzina Akter Tani, Mir Moynuddin Ahmed Shibly, Md. Shoumique Hasan,
Nilofa Yeasmin, and Shamim Ripon
1 Introduction
Handwritten character recognition has been an area of interest among deep learning
researchers and practitioners in recent years. Due to its huge possibilities of various
implementations, a significant number of studies have been carried out on hand-
written texts, and character recognition of different languages, such as English [1],
Japanese [2], Latin [3], etc. Bangla is the 1st and official language of Bangladesh, and
it is the 4th most popular language in the world, spoken by almost 300 million people
[4]. Considering this large number of native users, handwritten character recognition
of the Bangla language plays a very important role in a wide range of applications,
including bank cheque processing, identifying postal codes, zip code scanning, inter-
preting national ID numbers, Bangla optical character recognition (OCR), and many
more [5, 6].
In the Bangla language, there are 11 vowels, 39 consonants, and a considerable
number of vowel diacritical, consonant conjuncts and diacritical, and other digits,
symbols, and punctuation marks. Recognizing handwritten Bangla characters is more
difficult and complicated for a number of reasons: (a) there are a lot of compound
characters in the Bangla alphabet, (b) the forms of certain characters are identical,
(c) as different people write in different ways, the same character written by different
people will have different forms, sizes, and curvatures.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 1
S. S. Roy et al. (eds.), Deep Learning Applications in Image Analysis,
Studies in Big Data 129, https://doi.org/10.1007/978-981-99-3784-4_1
2 T. A. Tani et al.
To overcome these problems several efforts have been taken to improve the recog-
nition accuracy. Convolutional neural networks (CNN) [4, 7–9], Deep CNN [10], and
ensemble learning methods [11, 12] have been applied in recent years. However, the
scarcity of Bangla datasets and imbalances in classes in those datasets are a barrier
to the recognition problem. Ensemble methods and image augmentation are among
the many ways to overcome this issue. A generative Adversarial Network (GAN)
introduced by [13] is another way to produce new instances of data. The presence of
outliers in the dataset can also make the recognition a difficult task as they mislead
the training of the models. So, by eliminating outliers, statistically meaningful results
can be obtained.
In Bangla handwriting-related studies, researchers have used different classifica-
tion approaches. The authors in [14] suggested a hierarchical method for segmenting
characters from sentences, with multilayer perceptron (MLP) as the classification
algorithm, whereas an MLP, RBF network, and SVM fusion classifier is suggested
in [15]. In [16], Bangla handwriting images are classified into 50 groups by using a
multilayer perceptron neural network.
Deep learning methods such as convolutional neural network (CNN)-based archi-
tecture have been used in the majority of recent works. Some of these works are only
limited to simple characters [17] while others are concentrated in handwritten digits
[18, 19]. Additionally, work has been done on a subset of the compound charac-
ters of the Bangla language [20]. One of the major issues in Bangla handwritten
character recognition is the limited availability of a complete handwritten characters
dataset. Generating Bangla handwritten characters is a way to solve this problem.
Deep convolutional generative adversarial network (DCGAN) [21] has been used
by some researchers to generate Bangla handwritten digits [22, 23]. However, there
has not been much work focused on generating complicated curative characters and
classifying Bangla handwritten characters using them.
The deep neural network is a widely used technique in analyzing and classifying
different types of images [24–27]. Residual Networks (ResNet) is one of the promi-
nent neural network-based architectures that have been used for image classification
and identification with excellent results for a long time. For example, researchers have
used Transfer Learning with ResNet-50 for Malaria Cell-Image classification [28],
or malicious software classification [29]. The residual networks have been applied
in some of the Bangla handwritten character recognition studies [30–32].
This chapter aims at proposing a two-fold approach with the residual network clas-
sifier to classify Bangla handwritten characters. At first, a model has been created
by using a ResNet variant called ResNet-50 to show the classification of the target
dataset, which is in this case, the Ekush [33] dataset. After classifying, the datasets are
then stabilized by removing the outliers by using an autoencoder, and the classifica-
tion is performed again by using the same ResNet-50 model. Finally, the classes with
a fewer number of images are augmented with more images by applying DCGAN,
so that the number of images among the classes becomes balanced. This dataset is
then classified with the ResNet-50 model. In the end, a detailed comparative anal-
ysis is conducted over the results obtained from the above-mentioned experiments
measuring the strengths of the adopted methods.
Autoencoder and Deep Convolutional Generative Adversarial Network … 3
The rest of the chapter is structured as follows. Section 2 covers a detailed review
of the state-of-the-art of Bangla handwritten character recognition. The methods and
materials of this study and elaborate result analysis with discussion are presented in
Sects. 3, 4, and 5. The chapter ends with an appropriate conclusion section.
2 Related Work
The researchers in [4] have introduced a CNN model named Ekushnet, which has
generated satisfactory results on Ekush [33] and CMATERdb [34] datasets. The
authors have mentioned that their Ekushnet model has performed extremely well
and generated the best results on Bangla character recognition relative to the prior
work that has been performed. Their proposed model has found 96.90% accuracy
on the training set and 97.73% on the validation set on the Ekush dataset after 50
iterations. The authors also applied cross-validation on the CMATERdb dataset and
found that their EkushNet model is 95.01% accurate. Another research work [7]
has applied only a CNN model on Bangla handwritten character identification and
their proposed model obtained 85.96% accuracy on the test dataset, whereas the
authors in [10] achieved 95% accuracy by using the Deep CNN model. Both works
have used 50 alphabet classes of the Ekush dataset. Another study [20] has achieved
95.05% accuracy on 122 classes in the Ekush dataset using their implemented DCNN
model. The authors have experimented on two other databases, CMATERdb and
BanglaLekha Isolated dataset [35]. The authors in [36] have shown an excellent
accuracy result on the CMATERdb dataset which is 98.78%. The researchers have
used five different approaches for classification.
The authors of [11] have found that the ensembled convolutional neural network
system outperforms a single CNN model when it comes to recognizing Bangla
handwriting. They have proposed a stacked Generalization Ensemble Framework,
consisting of six CNN models. Their research has reached 96.72% accuracy on the
test set. They achieved the result only after 40 epochs. Another study [37] has applied
three approaches: first, seven CNN models have been applied to recognize the Bangla
handwritten characters. Then the best performing model ResNet-50 which has given
97.81% accuracy, has been used for feature extraction, and classification is done by
traditional classification algorithms. In the last step, the authors employed different
ensemble techniques for the classification task. The stacked generation ensemble
method has achieved 98.68% test accuracy which is the best result among all the
adopted methods. All the experiments of this study have been done on Ekush and
BanglaLekha-isolated datasets.
4 T. A. Tani et al.
The authors of another study [38] experimented on six CNN models and evalu-
ated which DCNN model produces the best performance by using CMATERdb [34]
dataset. The results have shown that all the DCNN models have worked wonderfully,
but the DenseNet model has outperformed the others. They have also pointed out
that the DCNN framework works better than other object recognition methods.
Another work [17] has shown that data augmentation can improve handwritten
character identification accuracy. The authors have tested their algorithms on the
alphabets of the BanglaLekha-Isolated dataset and found that it is 91.81% accurate
without data-augmented images and 95.25% accurate with data-augmented images.
They have also compared other machine learning approaches to find out the effi-
ciency level of these methods. The comparative analysis has revealed that CNN
outperforms SVM and LSTM with or without data augmentation. They also have
put their proposed approaches to test on other datasets with similar characteristics.
The experiment has demonstrated 95.07% test accuracy on the 59 classes of the
Ekush dataset.
The performance of the classifier can be enhanced by enlarging the dataset size.
GAN as a data augmentation technique can help to expand the dataset [23]. In [22],
the authors have proposed a DCGAN architecture that successfully increased four
Bangla handwritten character datasets. For their proposed work, the writers have
just focused on the digit dataset. However, they have not attempted to determine the
CNN model performance. The proof of improving the performance of the classifier on
handwritten datasets by adding GAN-generated images has been shown by another
study [23]. The proposed method has been successful in increasing the accuracy on
the MNIST dataset by using the GAN approach. They have also used GAN on three
Indian numerical handwritten datasets: Bangla, Devanagari, and Oriya. The accuracy
of all the datasets has been improved. However, the result of the proposed work has
shown that combining so many GAN-generated images with the real dataset might
degrade efficiency. Another digit recognition and generation work done by [39] has
proposed network architecture and achieved 99.44% on BHAND [40] dataset. After
that, the study applied Semi-Supervised GAN or SGAN to generate Bangla digits.
One more GAN-related work [41] has proposed a conditional GAN-based method
for generating character images based on class. Their study has used three separate
Bangla handwritten character datasets and has been able to achieve very realistic
images by 1500 epochs. However, they did not apply any classification with the
generated images.
Autoencoder and Deep Convolutional Generative Adversarial Network … 5
The literature review reveals that most of the research has been done with either
CNN or Deep CNN models. Apart from performing classification, there have been
very few variations of approaches in Bangla handwritten character recognition works.
Only a few studies have used the GAN method with the attempt of the classifiers.
Furthermore, none of the studies has used outlier detection as part of their study.
So, there is a knowledge gap identified in this literature review about outlier identi-
fication and elimination. In this work, both approaches are explored to enhance the
recognition performance of Bangla handwritten characters.
3.1 Dataset
BanglaLekha Isolated [35], ISI [42], NumtaDB [43], CMATERdb [34], and Ekush
[33] are a few datasets that contain Bangla handwritten characters and numerals.
Ekush dataset is selected in this study for experimental purposes because it contains
more classes than any other Bangla handwritten dataset. The Ekush dataset consists
of basic and compound characters, numerals, and modifiers. The 122 classes of
characters are grouped into four categories: 10 modifiers, 11 vowels, 39 consonants,
52 widely used compound letters, and 10 numeral digits and the dataset contains
about 7,29,750 images. A few images from the dataset are shown in Fig. 2. The
images are greyscaled with a size of 28 × 28 pixels.
Outliers boost the uncertainty of the results, lowering statistical power. Therefore,
removing outliers can lead to statistically significant results. In the Ekush dataset,
Bangla handwritten characters are categorized into 122 groups. While grouping indi-
vidual characters into their respective classes, some characters are moved into sepa-
rate classes. As a result, a few characters from various classes are mixed up [37]. It
has already been mentioned that some Bangla handwritten characters bear a striking
resemblance to one another. During the pre-processing of the dataset, it has been
discovered that class 87 and 97, class 19 and 84, and class 69, 76, 110, and 111, all
contain outliers of one another due to their resemblance. Since the character instances
are anomalous in only the specific context, they are termed contextual outliers. To
locate outliers in the individual groups, a semi-supervised outlier detection approach
using autoencoders has been applied.
In Fig. 3, the process of outlier detection and preparing a purer dataset has been
presented. Initially, the images of 122 classes have been analyzed and eight classes
that potentially contain more outliers have been identified. From each class that
contains more than 3000 images, 1000 inlier images have been selected and the
training sets for the autoencoder-based outlier detection models have been prepared.
The number of images in the training sets was 500 for the classes with less than
3000 images. No outlier image has been fed to the outlier detection model during
8 T. A. Tani et al.
training. After training, the rest of the images have been tested using the model and
the outliers have been identified. The inlier images along with the previously selected
pure training set have been used to develop robust classifiers.
3.2.1 Autoencoder
Autoencoders are special neural networks that learn features of complex data in lower
dimensions from unlabeled data [44], then try to reconstruct the original complex
input from the simpler encoded features. This type of neural network has been proven
to perform well in numerous fields such as generative models, classification, clus-
tering, recommender system, dimensionality reduction, and so on [45], but in this
work, it has been used as an outlier detection model.
A convolutional autoencoder depicted in Fig. 4 has been used for detecting outliers
from Bangla handwritten character dataset. The network consists of three major
components – an encoder network, a bottleneck layer, and a decoder network. The
encoder network starts with an input layer. After that, there are three convolutional
layers; the output of the last such layer is flattened and passed through a dense layer
which produces a vector containing features in a lower dimension. This is also known
as a bottleneck which is followed by a decoder network. The job of the decoder is to
reproduce the input as close as possible to the original. Convolutional transpose layers
have been used which perform the inverse operation of what typical convolutional
layers do.
Autoencoder and Deep Convolutional Generative Adversarial Network … 9
The autoencoder network has been trained with only a few inlier images. The
intuition behind using only inlier images is to train the model to be familiar with
what is normal so that while testing, the model reconstructs the outlier images poorly
and reconstruction error becomes high. The images with reconstruction errors higher
than a specific threshold then have been labeled as outliers and have been discarded
from the dataset. The reconstruction error is calculated using mean squared error.
The Ekush Bangla handwritten dataset contains several imbalanced classes. Data
augmentation can be a way for generating a number of images in order to balance
a dataset. Data augmentation approaches such as rotation, and scaling can expand
a dataset but do not always add information. Generative Adversarial Network
(GAN), on the other hand, can generate synthetic images that can bring additional
information to the dataset. We have chosen a deep convolutional generative adver-
sarial network (DCGAN) as it is the most effective architecture for improving
classification and identification [46]. We have only taken five classes from the Ekush
dataset as these classes have much fewer images than others. The outlier-removed
classes that are common in these classes are used as input data in the proposed GAN
model. Table 1 shows the classes that have been used in DCGAN.
The generative adversarial network is a method for creating new synthetic data
that consists of two models: generator and discriminator. The generator attempts to
create a new image from the random noise and feeds it into the discriminator model,
which determines whether the image is fake or real. If the discriminator determines
it to be fake, the generator attempts again to create a new image to deceive the
discriminator. The fight between these two models will continue until the generator
becomes incredibly powerful, creating a synthetic image that the discriminator model
is unable to differentiate. A general view of GAN is shown in Fig. 5.
Though a few experimental setups have been altered, we have adopted the DCGAN
model shown in [20] as their approach has achieved good results for generating Ekush
dataset images. The DCGAN architecture is defined briefly in the following section.
A CNN is used for both discriminator and generator in DCGAN architecture. Before
passing to the DCGAN model, the images have been prepared by converting all the
10 T. A. Tani et al.
76 4261
97 4100
110 2012
111 986
images into a single channel and by resizing them all into 28 × 28 pixels. After that,
all the images are normalized in the scale of [-1,1]. The GAN model has been run
separately for each of the chosen classes.
In GAN, the generator model is used to create new images from a random variable.
A random noise of 100 input sizes is given to the generator of our model. This is
forwarded onto the dense layer which with 1024 hidden units. To keep the GAN
model steady, batch normalization is used in both the generator and discriminator
[21]. The Relu activation function is used in all layers except the output layer, where
the Tanh activation is used. The Tanh activation function allows taking the pixel
in the [-1,1] range that is later used as the discriminator input [23]. Again, another
Autoencoder and Deep Convolutional Generative Adversarial Network … 11
dense layer having 6272 neurons is used. After the following batch normalization
and Relu activation, the output is reshaped. Up-sample to input data in the generator
model is required to generate a new output image. Two convolution layers have been
used where the first layer consists of 64 filters, and a kernel size of 3, and the second
layer consists of one filter, and 3 kernel size. In both layers, padding with zero has
been applied. A 2D upsampling is used just before each convolutional layer. The
architecture of the generator model is given in Fig. 6.
In the discriminator model, two convolution layers have been used. With 64 filters,
a kernel size of 5, a stride of 2, and ‘same’ padding, the first convolutional layer
receives the dimension of 28 × 28 × 1 as an input shape. The same size of the
kernel, stride, and padding is used in the second convolution layer with 128 filters.
Then the outputs are flattened and transferred to the dense layer with 256 hidden
neurons. The LeakyRelu activation function with an initial of α = 0.2 has been used
in every layer in the discriminator model as it helps to perform well in the GAN
model [21]. The alpha(α) parameter is the leakiness of the LeakyReLu activation
function which controls the negative inputs and allows the passing of negative values
to the network which prevents the dying state. After that, a 25% dropout has been
used to keep the discriminator model from overfitting. Finally, a single unit of output
has been used in a dense layer having sigmoid activation. The architecture of the
discriminator model is given in Fig. 7.
Following this [21] research, we have used the Adam optimizer in our DCGAN
model. Although another study [47] has used an Adam optimizer with 1 × 10−5
of learning rate and 0.1 of β1 momentum, we have changed the learning rate to
1.5e−4 and momentum term to β1 = 0.5 in both the discriminator and generator
model as we have found that these values of parameters helped to stabilize the
training. β_1 momentum is used to control the decaying of the running average of
the gradient, which is exponentially multiplied by itself at the end of each batch step
[48]. Binary cross entropy has been employed to measure the loss of the discriminator
and generator. We have used two separate batch sizes: 64 batch size in 110 and 111
classes because the number of real images has been limited, and 128 batch size in the
12 T. A. Tani et al.
other three classes. For these groups, the model has been trained for 2000 epochs,
except for 111, which have been trained for 4000 epochs. The explanation for the
higher number of epochs for 111 groups is that the number of training data in 111
is very scarce, which prevents the generator to generate quality synthetic images
in the early epoch. Every 50 epochs, we saved images and observed the produced
images. We have taken images for these five classes at various epochs and identified
the epoch where the quality of synthetic images is good enough compared to the
actual images. We have taken a fixed number of images for each of these classes so
that the classification model is trained with at least 4000 images. Table 2 shows the
total number of images that are added to the actual training dataset.
3.4 Classification
Before applying the classification model, all the images have been resized as 28 ×
28 scales with gray color mode. We have used ResNet-50 to classify the 122 classes
of the Ekush dataset. The name implies that the model consists of 50 layers. A brief
description of ResNet-50 is given in the following section.
Autoencoder and Deep Convolutional Generative Adversarial Network … 13
3.4.1 ResNet-50
Identity and convolutional blocks are the two different blocks that are used in ResNet-
50 architecture based on the dimensions of the input and output. Both blocks have a
skip connection over the main path which helps the model learn an identity function.
The identity function helps to skip the layers to be trained which is not helpful to
add value to accuracy [49]. In the identity block, there are three Conv2D layers
with stride (1,1) and zero seed random initialization. Only the second Conv2D has
padding. Batch normalization and Relu activation follow each Conv2D except that
a shortcut is added before the final Relu activation. In the convolutional block, the
skip connection has a Conv2D layer and Batch normalization that the identity block
does not have. Except for this, the structure is almost the same as the identity block.
The first and the convolution layer on shortcut paths has a stride of (s,s) and the rest
has (1,1).
The ResNet-50 architecture has five stages. Before entering these stages, the
dimension of the dataset image 28 × 28 × 1 is given as an input shape to the ResNet-
50 architecture. The first stage of the ResNet-50 has 7 × 7 convolutional layers
with 32 filters and (1,1) strides. Right after that, batch normalization and a 3 × 3
MaxPooling layer are used. The rest of the stages of ResNet-50 has two, three, five,
and two numbers of identity blocks, respectively followed by a convolutional block.
After the five stages, there is an average pooling with (2,2) strides, which is used
to reduce the output. Finally, a SoftMax activation is used with an FC-dense layer
to reduce the 122 input classes. The diagram of ResNet-50 architecture is given in
Fig. 8.
To train the ResNet-50 models, adam optimizer with the default learning rate
value of 0.001 has been used. Also, as the loss function, categorical cross-entropy
has been utilized. The accuracy with 1024 batch size has provided the best result in
the [11] study. Following this study, the batch size has reset to 1024. Furthermore,
100 epochs have been used to train all the approaches.
There have been made three different datasets after the deduction of outliers and
DCGAN image generation. All the images in each approach have been split in such
a way that 70% of images are in the training set, 20% of images are in the test set
and 10% of images are in the validation set. The DCGAN-generated images are used
only to balance the training dataset after the split to avoid bias. The total number of
images and train, test, and validation set image numbers are presented in Table 3.
4 Results
The Ekush dataset has been classified using the ResNet-50 framework in three
different datasets (original dataset, outlier removed dataset, outlier removed and
DCGAN implemented dataset). The ResNet-50 model has achieved 97.63% test
accuracy on the original dataset consisting of 122 classes. The second approach
where the outliers are removed from seven classes has achieved 97.95% test accu-
racy. And the final approach where outliers are removed from the original dataset
and DCGAN-generated images are used to balance the original training dataset has
achieved 97.92% test accuracy. The precision, recall, F1-score, and accuracy yielded
by the ResNet-50 model on three approaches are shown in Table 4. It illustrates
Autoencoder and Deep Convolutional Generative Adversarial Network … 15
that the ResNet-50 models with both outliers-removed dataset and with a balanced
dataset using DCGAN-generated images have outperformed the model trained on
the original dataset.
The model accuracy has improved from 97.63% to 97.95% after outliers are removed
from seven classes of the dataset, demonstrating the benefit of outlier removal.
When assessing changes in individual classes, the same trend of improvement can
be observed. In Table 5, the precision for classes 76, 87, and 97 increased by 4%,
1%, and 5%, respectively, suggesting that the performance has improved for these
three classes compared to the original dataset. In classes 84 and 111, the recall has
improved by 1% and 6% for the outlier-removed dataset than that for the original
dataset, which also indicates the improvement of the classifier. For four classes (76,
84, 97, and 111) among the seven discussed classes, the increased F1-score compared
to the original dataset indicates that the images have been better classified than the
original dataset. The remaining three classes (19, 69, and 87) have not seen any
changes in F1-score. However, few classes have experienced performance drops in
terms of precision and recall even after removing the outliers. The reason behind this
is, even though the outliers are eliminated from those classes, some noises are still
there. Another explanation is that, when the outliers are eliminated, it also removes
some of the original images from these classes, resulting in a dataset that is less
balanced than the original. But these can be ignored as the performance drop is very
negligible. Removing outliers from specific classes has also reduced the number of
false positives and false negatives for classes other than the ones that are discussed.
This, along with the improved performance in these specific classes has been the
key ingredient to achieving an overall better performance. So, the cumulative results
justify that the classifier performs well as a result of excluding outliers from the
original dataset.
The improvement in classification performance shows the effectiveness of the
autoencoder-based outlier detection model in Bangla handwritten character images.
In Table 6, the numbers of discarded images from the chosen seven classes are shown.
The greatest number of outliers have been removed from class 19, whereas the least
amount has been removed from class 111. There is also a correlation between the
16 T. A. Tani et al.
number of outliers removed with the original size of the dataset. The more images a
class has more outliers have been removed.
In Fig. 9, a few representative inlier and outlier images from class 69 that are
detected by the model are presented. By looking at the images, one can identify that
the images in the right part of the figure are anomalous, while the images on the
left side are inliers. However, there are cases where outliers have not been accurately
detected, and inliers have been wrongly identified as outliers. Despite this, the overall
outlier detection scheme has been successful as it has improved the ResNet-50 model
performance.
Figure 10 also justifies the efficiency of the outlier detection model. The inlier
images of class 19 have been divided into four batches, and all the images of each
batch have been superimposed into a single image. Each batch consists of approxi-
mately 1500 images. In contrast, 272 outlier images detected by the model have also
been superimposed into a single image. It is apparent from the figure that the images
with superimposed inliers tend to hold the inherent shape of the character even with
1500 images. On the other hand, only 272 outliers have made the corresponding
superimposed image all jumbled up, which further validates the efficiency of the
outlier detector.
Autoencoder and Deep Convolutional Generative Adversarial Network … 17
Apart from the existence of outliers in the Ekush dataset, there is also an imbalance
in it. Five such imbalanced classes (72, 76, 97, 110, and 111) have been selected and
their training sets have been made balanced using DCGAN-generated images. No
generated image has been added to either validation or test set. The test accuracy of
ResNet-50 has improved from 97.63% to 97.92% after adding synthesized images
to the training set. Moreover, Table 7 shows that almost all the evaluation metrics are
improved through the use of DCGAN-generated images. Especially the class of 111
has improved exceptionally. But only the precision of DCGAN with outlier removed
images in class 72 is dropped by 2% from the original classes. This means when the
18 T. A. Tani et al.
classifier predicted the images are from class 72, it is less correct than the original
dataset. The reason for decreasing the precision of class 72 can be the noisy images
that are generated in the DCGAN experiment. But that is a very negligible value and
also the corresponding recall is increased which means it can more correctly identify
all the respective class images than the original class. There are three classes to
which both the outlier detection model and DCGAN have been applied. For all three
classes, ResNet-50 with DCGAN-generated images has outperformed the ResNet-
50 trained on the outliers-removed dataset. The reason for this improvement is that
the DCGAN model has been trained on those individual classes after the removal of
outliers which has produced good-quality images. The overall performance justifies
that the use of proposed DCGAN-generated images on the real dataset can improve
the classification result.
Figure 11 shows a comparison of original images and DCGAN-generated images.
From the figure, it is difficult to distinguish between original and synthesized images
without the labels which prove that the DCGAN has generated good quality images.
However, for class 111, the generated images have not been up to the mark for having a
smaller number of images to train DCGAN. Using this generative adversarial network
has helped us to tackle the class imbalance problem. Without the five training classes
on which the DCGAN has been applied, the average training size had been nearly
4575 images per class. On the other hand, those five classes had only 2389 training
images on average per class. Even one class i.e., class 111 had only 770 training
instances which led the classifier to achieve only a 72% F1-score. But after balancing
only the training set with 3594 synthesized images, the F1-score has improved to 90%.
The changes in the training sample sizes are illustrated in Fig. 12. In classes 110
and 111, the number of synthesized images added has been more than 3000 and for
the rest of the three classes, this number has been around 1000. For four of these five
classes, the classifier performance in terms of the F1-score has improved. Moreover,
the overall performance of the ResNet-50 classifier trained on a balanced dataset
has been better than that of the trained on the original dataset. This validates the
applicability of DCGAN in generating synthesized Bangla handwritten character
images.
Autoencoder and Deep Convolutional Generative Adversarial Network … 19
Fig. 12 Training size before versus training size after applying DCGAN
Training and validation accuracy and loss are illustrated in Figs. 13, 14, and 15. On
both the original dataset and the outlier removed dataset, the ResNet-50 model has
a good fit for predicting handwriting characters, as illustrated in Figs. 13 and 14.
But Fig. 15 illustrates an exception, in which the model is applied to the Ekush
dataset after outliers are removed and DCGAN is used. Except for one epoch in
validation loss, the model has a good prediction result because the training and
validation accuracy and loss are near to each other. Additionally, in the learning
curve of each approach, the training and the validation loss are initially high and
20 T. A. Tani et al.
then gradually decrease in the same direction, indicating that the model is secure
from overfitting. Though there is a slight gap between the training and validation
curve, it is negligible for being considered as overfitting. However, the validation
loss in epoch 88 has increased to about 1.6, which is relatively high compared to
other epochs. The reason behind this spike of 88 number epoch can be due to the
existence of noise in the dataset. In this particular batch of images, the model is
unable to correctly predict the batch image’s class. This type of spike does not exist
in the outlier-removed dataset or the original dataset. This means there is some
noise in the DCGAN-generated images. Also, when the training and the validation
dataset are split randomly, this particular batch has got the noisiest images.
Outlier elimination on the Ekush dataset is a unique operation. We are the first who
experimented on the outlier-removed Ekush dataset. Authors in [22] only performed
Autoencoder and Deep Convolutional Generative Adversarial Network … 21
Fig. 15 Accuracy and loss of ResNet-50 on outlier removed and DCGAN applied dataset
DCGAN to enlarge the Ekush dataset but no classification was performed on the
generated images. A comparative analysis of the current work with others that used
only the Ekush dataset is illustrated in Table 8. Our proposed ResNet-50 model on
the original dataset has achieved 97.63% accuracy on the test dataset (Table 4), which
proves that the score has outperformed all the work except the EkushNet. Shibly et al.
[37] achieved the best test accuracy of 98.68% on the Ekush dataset but that has been
obtained through an ensemble of ten CNN models. Their highest performance with a
single CNN model has been 97.81% using ResNet-50 which is easily outperformed
by both of our proposed methods. Our work has also achieved better performance
than an ensemble [11] and deep CNN techniques [10, 20] applied to the same dataset.
Also, although we have applied outlier on only seven classes and DCGAN on only
five classes, our two approaches outperformed the other related works. However, the
improvement is minor as only some classes from 122 classes of the Ekush dataset have
been considered in our study. But the results can conclude that our proposed outlier
and DCGAN approaches are capable to improve the classification performance.
5 Discussion
Outlier elimination and applying DCGAN as well as comparing the character detec-
tion of these two approaches is a unique experiment conducted on the Ekush dataset.
ResNet-50 is one of the most popular models and can be used to achieve a very
good result on the Ekush dataset as in [37]. In addition to managing the vanishing
gradient problem, the ResNet-50 model can achieve great results with a few error
rates. Apart from that, by applying a skip connection, it can ignore the layers which
cannot provide any benefit to the output [50]. The result has shown that the ResNet-50
has given a better performance than the widely used CNN models.
Outlier detection is very beneficial if there is a probability of images being found
in the wrong classes. The result analysis has shown that the test result as well as the
precision, recall, and F1-scores have improved after applying outlier detection on
seven classes of the Ekush dataset. There is also an improvement in the performance
of the overall classification result of 122 classes. Moreover, outlier detection and
elimination on three (76, 97, 111) classes help our DCGAN to generate good-quality
images. However, certain classes from the dataset, that have been chosen in this
outlier detection approach, have a smaller volume of data, so training the model with
this limited dataset reduced the precision. The outcome could be better if outlier
detection can be applied to the whole dataset.
The DCGAN approach has generated images as an augmentation technique with
an outlier removed dataset has improved the test dataset performance by 0.29% over
the original Ekush dataset. Not only DCGAN has increased the size of the dataset but
also created variant images that add more information to the original dataset. In our
study, only five classes of images have been augmented by the DCGAN approach and
the generated image number is only able to make the training set near to 4 thousand.
However, the whole dataset still has imbalanced classes besides the chosen classes.
Yet with small amounts of generated images, the study has shown an improvement
in the classification result. If we could generate more images for these classes, the
accuracy might be improved further. However, as mentioned in [23], we should be
careful not to generate a large number of images to avoid the probability of degrading
the performance.
6 Conclusion
have taken only a few classes of the Ekush dataset for our experiments. Despite this,
the results which are obtained from the adopted novel approaches have demonstrated
superior performance than majority the related works. In the future, other Bangla
handwritten character datasets may also be used to evaluate the efficacy of these
methods. In addition, other classifier models, such as VGG-16, Xception, DenseNet,
AlexNet, etc. can also be explored with these two proposed methods.
References
1. Yuan, A., Bai, G., Jiao, L., & Liu, Y. (2012). Offline handwritten English character recognition
based on convolutional neural network. In Proceedings 10th IAPR International Workshop on
Document Analysis Systems, DAS 2012 (pp. 125–129). https://doi.org/10.1109/DAS.2012.61
2. Kimura, F., Wakabayashi, T., Tsuruoka, S., & Miyake, Y. (1997). Improvement of handwritten
Japanese character recognition using weighted direction code histogram. Pattern Recognition,
30(8), 1329–1337. https://doi.org/10.1016/S0031-3203(96)00153-7
3. Ciresan, D. C., Meier, U., & Schmidhuber, J. (2012). Transfer learning for Latin and Chinese
characters with deep neural networks. In Proceedings of the international joint conference on
neural networks (pp. 1–6). https://doi.org/10.1109/IJCNN.2012.6252544
4. Azad Rabby, A. K. M. S., Haque, S., Abujar, S., & Hossain, S. A. (2018). Ekushnet: Using
convolutional neural network for Bangla handwritten recognition. Procedia Computer Science,
143, 603–610. https://doi.org/10.1016/j.procs.2018.10.437
5. Ahmed, S., et al. (2019). Hand sign to bangla speech: A deep learning in vision based system
for recognizing hand sign digits and generating bangla speech. https://doi.org/10.2139/ssrn.
3358187
6. Manisha, N., Sreenivasa, E., & Krishna, Y. (2016). Role of offline handwritten character recog-
nition system in various applications. International Journal of Computer Applications. https:/
/doi.org/10.5120/ijca2016908349
7. Rahman, Md. M., Akhand, M. A. H., Islam, S., Chandra Shill, P., & Hafizur Rahman, M. M.
(2015). Bangla handwritten character recognition using convolutional neural network. Inter-
national Journal of Image, Graphics and Signal Processing, 7(8), 42–49. https://doi.org/10.
5815/ijigsp.2015.08.05
8. Ghosh, T., Abedin, M. H. Z., Al Banna, H., Mumenin, N., & Abu Yousuf, M. (2021). Perfor-
mance analysis of state of the art convolutional neural network architectures in Bangla hand-
written character recognition. Pattern Recognition and Image Analysis, 31(1), 60–71. https://
doi.org/10.1134/S1054661821010089
9. Chowdhury, R. R., Hossain, M. S., ul Islam, R., Andersson, K., & Hossain, S. (2019). Bangla
handwritten character recognition using convolutional neural network with data augmentation.
In 2019 Joint 8th international conference on informatics, electronics & vision (ICIEV) and
2019 3rd international conference on imaging, vision & pattern recognition (icIVPR) (pp. 318–
323). https://doi.org/10.1109/ICIEV.2019.8858545
24 T. A. Tani et al.
10. Ahmed, S., Tabsun, F., Reyadh, A. S., Shaafi, A. I., & Shah, F. M. (2019). Bengali handwritten
alphabet recognition using deep convolutional neural network. In 5th International conference
on computer, communication, chemical, materials and electronic engineering, IC4ME2 2019.
https://doi.org/10.1109/IC4ME247184.2019.9036572
11. Shibly, M. M. A., Tisha, T. A., & Ripon, S. H. (2021). Stacked generalization ensemble method
to classify Bangla handwritten character. In Proceedings of international conference on sustain-
able expert systems. Lecture Notes in Networks and Systems 176. https://doi.org/10.1007/978-
981-33-4355-9_46
12. Mamun, M. R., Al Nazi, Z., & Yusuf, M. S. (2018). Bangla handwritten digit recognition
approach with an ensemble of deep residual networks. In International conference on bangla
speech and language processing, ICBSLP 2018 (pp. 21–22). https://doi.org/10.1109/ICBSLP.
2018.8554674
13. Goodfellow, I., et al. (2014). Generative adversarial nets. Advance in Neural Information
Process Systems, 27.
14. Basu, S., Das, N., Sarkar, R., Kundu, M., Nasipuri, M., & Basu, D. K. (2009). A hierarchical
approach to recognition of handwritten Bangla characters. Pattern Recognition, 42(7), 1467–
1484. https://doi.org/10.1016/j.patcog.2009.01.008
15. Bhowmik, T. K., Ghanty, P., Roy, A., & Parui, S. K. (2009). SVM-based hierarchical archi-
tectures for handwritten Bangla character recognition. International Journal on Document
Analysis and Recognition, 12(2), 97–108. https://doi.org/10.1007/s10032-009-0084-x
16. Bhattacharya, U., Gupta, B. K., & Parui, S. K. (2007). Direction code based features for recog-
nition of online handwritten characters of Bangla. In Proceedings of the international confer-
ence on document analysis and recognition, ICDAR, 2007. https://doi.org/10.1109/ICDAR.
2007.4378675
17. Chowdhury, R. R., Hossain, M. S., Ul Islam, R., Andersson, K., & Hossain, S. (2019). Bangla
handwritten character recognition using convolutional neural network with data augmentation.
In 2019 Joint 8th international conference on informatics, electronics and vision, ICIEV 2019
and 3rd international conference on imaging, vision and pattern recognition, icIVPR 2019 with
international conference on activity and behavior computing, ABC 2019 (pp. 318–323). https:/
/doi.org/10.1109/ICIEV.2019.8858545
18. Shopon, M., Mohammed, N., & Abedin, M. A. (2017). Bangla handwritten digit recognition
using autoencoder and deep convolutional neural network. In IWCI 2016-2016 International
Workshop on Computational Intelligence. https://doi.org/10.1109/IWCI.2016.7860340
19. Shopon, M., Mohammed, N., & Abedin, M. A. (2017). Image augmentation by blocky artifact
in deep convolutional neural network for handwritten digit recognition. In IEEE international
conference on imaging, vision and pattern recognition, icIVPR 2017 (pp. 1–6). https://doi.org/
10.1109/ICIVPR.2017.7890867
20. Mashrukh Zayed, M., Neyamul Kabir Utsha, S. M., & Waheed, S. (2021). Handwritten bangla
character recognition using deep convolutional neural network: Comprehensive analysis on
three complete datasets. Advances in Intelligent Systems and Computing. https://doi.org/10.
1007/978-981-33-4673-4_7
21. Radford, A., Metz, L., & Chintala, S. (2016). Unsupervised representation learning with deep
convolutional generative adversarial networks. In 4th International conference on learning
representations, ICLR 2016-conference track proceedings.
22. Haque, S., Shahinoor, S. A., Rabby, A. K. M. S. A., Abujar, S., & Hossain, S. A. (2018).
OnkoGan: Bangla handwritten digit generation with deep convolutional generative adversarial
networks. In Recent Trends in image processing and pattern recognition, second international
conference, {RTIP2R} 2018, Solapur, India, 21–22 Dec 2018, Revised Selected Papers, Part
{III}, 2018, vol. 1037 (pp. 108–117). https://doi.org/10.1007/978-981-13-9187-3_10
23. Jha, G., & Cecotti, H. (2020). Data augmentation for handwritten digit recognition using
generative adversarial networks. Multimed Tools and Applications. https://doi.org/10.1007/s11
042-020-08883-w
24. Biswas, R., Vasan, A., & Roy, S. S. (2020). Dilated deep neural network for segmentation of
retinal blood vessels in fundus images. Iranian Journal of Science and Technology, Transactions
of Electrical Engineering, 44(1), 505–518. https://doi.org/10.1007/s40998-019-00213-7
Autoencoder and Deep Convolutional Generative Adversarial Network … 25
25. Roy, S. S., Rodrigues, N., & Taguchi, Y. (2020). Incremental dilations using CNN for brain
tumor classification. Applied Sciences, 10(14), 4915. https://doi.org/10.3390/app10144915
26. Roy, S. S., Mihalache, S. F., Pricop, E., & Rodrigues, N. (2022). Deep convolutional neural
network for environmental sound classification via dilation. Journal of Intelligent & Fuzzy
Systems, 43(2), 1827–1833. https://doi.org/10.3233/JIFS-219283
27. Roy, S. S., et al. (2022). L2 regularized deep convolutional neural networks for fire detec-
tion. Journal of Intelligent & Fuzzy Systems, 43(2), 1799–1810. https://doi.org/10.3233/JIFS-
219281
28. Reddy, A. S. B., & Juliet, D. S. (2019). Transfer learning with ResNet-50 for malaria cell-image
classification. In International Conference on Communication and Signal Processing (ICCSP)
(pp. 945–949). https://doi.org/10.1109/ICCSP.2019.8697909
29. Rezende, E., Ruppert, G., Carvalho, T., Ramos, F., & de Geus, P. (2017). Malicious software
classification using transfer learning of ResNet-50 deep neural network. In Proceedings of
the 16th IEEE international conference on machine learning and applications, ICMLA 2017
(pp. 1011–1014). https://doi.org/10.1109/ICMLA.2017.00-19
30. Alif, M. A. R., Ahmed, S., & Hasan, M. A. (2017). Isolated Bangla handwritten character recog-
nition with convolutional neural network. In 2017 20th International conference of computer
and information technology (ICCIT) (pp. 1–6).
31. Alom, M. Z., Sidike, P., Hasan, M., Taha, T. M., & Asari, V. K. (2018). Handwritten Bangla char-
acter recognition using the state-of-the-art deep convolutional neural networks. Computational
Intelligence and Neuroscience. https://doi.org/10.1155/2018/6747098
32. Khan, M. M., Uddin, M. S., Parvez, M. Z., & Nahar, L. (2022). A squeeze and excitation
ResNeXt-based deep learning model for Bangla handwritten compound character recognition.
Journal of King Saud University Computer and Information Sciences, 34(6), 3356–3364. https:/
/doi.org/10.1016/j.jksuci.2021.01.021
33. Rabby, A. K. M. S. A., Haque, S., Islam, M. S., Abujar, S., & Hossain, S. A. (2019). Ekush:
A multipurpose and multitype comprehensive database for online off-line Bangla handwritten
characters. Communications in Computer and Information Science. https://doi.org/10.1007/
978-981-13-9187-3_14
34. Sarkar, R., Das, N., Basu, S., Kundu, M., Nasipuri, M., & Basu, D. K. (2012). CMATERdb1:
A database of unconstrained handwritten Bangla and Bangla-English mixed script document
image. International Journal on Document Analysis and Recognition. https://doi.org/10.1007/
s10032-011-0148-6
35. Biswas, M., et al. (2017). BanglaLekha-Isolated: A multi-purpose comprehensive dataset
of handwritten Bangla isolated characters. Data in Brief . https://doi.org/10.1016/j.dib.2017.
03.035
36. Alom, Z., Sidike, P., Taha, T. M., & Asari, V. K. (2017). Handwritten bangla digit recognition
using deep learning, p. 1712.
37. Shibly, M. M. A., Tisha, T. A., Tani, T. A., & Ripon, S. (2021). Convolutional neural network-
based ensemble methods to recognize Bangla handwritten character. PeerJ Computer Science,
7, 1–30. https://doi.org/10.7717/peerj-cs.565
38. Alom, M. Z., Sidike, P., Hasan, M., Taha, T. M., & Asari, V. K. (2017). Handwritten bangla
character recognition using the state-of-art deep convolutional neural networks, p.1712.
39. Sikder, M. F. (2020). Bangla handwritten digit recognition and generation. In: Proceedings of
international joint conference on computational intelligence (pp. 547–556).
40. Rahman, M. S. (2016). Towards optimal convolutional neural network parameters for
bengali handwritten numerals recognition. In 19th international conference on computer and
information technology (ICCIT) (pp. 431–436).
41. Nishat, Z. K., & Shopon, M. (2019). Synthetic class specific Bangla handwritten character
generation using conditional generative adversarial networks. In 2019 International conference
on bangla speech and language processing (ICBSLP 2019). https://doi.org/10.1109/ICBSLP
47725.2019.201475
42. Chaudhuri, B. B. (2006). A complete handwritten numeral database of Bangla-A major Indic
script. In 10th international workshop on frontiers of handwriting recognition (IWFHR), La
Baule, France.
26 T. A. Tani et al.
43. Alam, S., Reasat, T., Doha, R. M., & Humayun, A. I. (2018). NumtaDB-assembled Bengali
handwritten digits, pp 1–4.
44. Kramer, M. A. (1991). Nonlinear principal component analysis using autoassociative neural
networks. AIChE Journal, 37(2), 233–243. https://doi.org/10.1002/aic.690370209
45. Bank, D., Koenigstein, N., & Giryes, R. (2020). Autoencoders. In Machine learning: Methods
and applications to brain disorders (pp. 193–208). https://doi.org/10.1016/B978-0-12-815739-
8.00011-0
46. Alqahtani, H., Kavakli-Thorne, M., & Kumar, G. (2021). Applications of generative adversarial
networks (GANs): An updated review. Archives of Computational Methods in Engineering,
28(2), 525–552. https://doi.org/10.1007/s11831-019-09388-y
47. Haque, S., Shahinoor, S. A., Rabby, A. K. M. S. A., Abujar, S., & Hossain, S. A. (2019).
OnkoGan: Bangla handwritten digit generation with deep convolutional generative adversarial
networks. Communications in Computer and Information Science. https://doi.org/10.1007/978-
981-13-9187-3_10
48. Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. Preprint at arXiv
arXiv:1412.6980.
49. Theckedath, D., & Sedamkar, R. R. (2020). Detecting affect states using VGG16, ResNet50 and
SE-ResNet50 networks. SN Computer Science. https://doi.org/10.1007/s42979-020-0114-9
50. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 770–778).
https://doi.org/10.1109/CVPR.2016.90
Deep Learning-Based Approaches Using
Feature Selection Methods for Automatic
Diagnosis of COVID-19 Disease
from X-Ray Images
Burak Taşci
1 Introduction
The novel coronavirus pandemic (COVID-19) was created a worldwide chaos envi-
ronment in a very short time. As of July 2021, over 206 million official cases were
reported in the world and the number of deaths due to COVID-19 has exceeded
4 million [1]. Many countries have developed various policies to cope with this
pandemic and minimize its effects. In particular, Turkey is among the few countries
that set an example to the world as a result of the early measures and social isola-
tion rules. It is of vital importance to take early action for COVID-19 and similar
pandemics. If the cases of COVID-19 can be detected early, these patients can be
isolated, so that healthy individuals who are not infected can remain safe. Science
and technology make great contributions to the precautionary policies implemented
in this sense. One of the most important of these contributions is to predict how
the pandemic will act in the ongoing times. In this context, two main approaches
appear. The first of these is statistical approaches and mathematical models. The
second approach is artificial intelligence-based approaches that have received more
attention in recent years.
In the literature, there are various approaches for disease detection using
biomedical images based on machine learning and deep learning methods [2–8].
Javaheri et al. [9], tried to detect COVID-19 positive, CAP, and other diseases
from 89,145 images obtained from the data of 5 different hospitals using BCDU-Net
(U-Net). The achievement results were 91.66%, 87.5%, 95%, and 94% accuracy,
sensitivity, AUC, and specificity, respectively. Rehmen et al. [10], used CT and X-
Ray images of 200 COVID19(+), 200 Healthy, 200 Bacterial Pneumonia and 200
viral Pneumonia in their study. Using the RestNet101 transfer learning method,
the reported results were 98.75%, 97.5%, 96.43%, and 100% accuracy, sensitivity,
B. Taşci (B)
Fırat University Vocational School of Technical Sciences Elazığ, Elazığ, Turkey
e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 27
S. S. Roy et al. (eds.), Deep Learning Applications in Image Analysis,
Studies in Big Data 129, https://doi.org/10.1007/978-981-99-3784-4_2
28 B. Taşci
2.1 Methodology
Using previously trained network models, an efficient method for detecting the
COVID-19 virus with a high degree of accuracy is proposed in this research. Figure 1
depicts the planned workflow for the approach. Preprocessing techniques are applied
to X-RAY images as part of the proposed method. The primary purpose of these
techniques is to improve classification performance. In order to draw attention to
the point regions in X-RAY images and cut down on the overall number of house
gray tones, the gradient operator was used in sobel operator mode. After that, we
moved on to the second step, which involved using the Modulator Circulating Water
System (MCWS) to segment the points in the gradient images. In the last step, feature
extraction was performed on 13 pre-trained models. Extracted features were reduced
in number using Chi-square, NCA, mRMR, and ReliefF feature selection methods.
Selected features obtained from pretrained networks were given to 13 different clas-
sifiers. High performance was observed in AlexNet and Resnet101. The AlexNet
and Resnet101 architectures were reused for feature extraction. The FC8 layer of the
AlexNet model and the FC1000 layer of the ResNET101 model have 1000 features.
In the proposed method, feature extractions were carried out during the training
and testing phases. In total (1000 + 1000) 2000 features have been reduced to 200
30 B. Taşci
fc8
Relieff
1000
………….. NCA
Features
mRMR
Covid-19 AlexNet Chi-Spuare
Matmul
Relieff
1000 NCA
…………..
Features
Pneumonia mRMR
EfficientNet B0 Chi-Spuare
Loss3-Classifier
Relieff
1000 NCA
…………..
Features
mRMR
GoogleNet Chi-Spuare
Normal
Predictions
Relieff
1000 NCA
…………..
Features mRMR
Inception ResNet-v2 Chi-Spuare
Gradient fc1000
Relieff
Combined Features
1000
…………..
Features NCA
mRMR
Resnet-18 Chi-Spuare
fc1000
Relieff
1000 NCA
Watershed …………..
Features mRMR
Resnet-50 Chi-Spuare
fc1000
SVM
Relieff
1000 NCA
…………..
Features
mRMR
Resize DenseNet-201 Chi-Spuare
fc8
Relieff
1000 NCA
…………..
Features mRMR
Pre-Processing VGG16 Chi-Spuare
fc8
1000
Relieff
…………..
Features NCA
mRMR
VGG19 Chi-Spuare
Logits
Relieff
1000 NCA
…………..
Features mRMR
MobileNetv2 Chi-Spuare
Predictions
1000
Relieff
…………..
Features
NCA
mRMR
Nasnet-Large Chi-Spuare
fc1000
Relieff
1000 NCA
…………..
Features
mRMR
Resnet-101 Chi-Spuare
features by combined mRMR feature selection methods. In the last step, the reclassi-
fication process was given to 13 different classifiers. It was observed that the highest
performance was obtained in SVM.
2.1.1 Preprocessing
Feature selection, in short, is the creation of a feature vector equivalent to the principal
feature vector and more functional, smaller in size, by creating a subset of features
that belong to a class and obtained by deep learning models.
NCA is a feature weighting method that may be used to select the optimum subset of
features by maximizing the objective function that evaluates classification accuracy
over training data. This is done through the use of NCA as a feature selection method.
In order to obtain the weight vector (w) that corresponds to the feature vector x i , the
approach optimizes the closest neighbor learning classifier in an effort to improve
performance [23]. Within the NCA framework, a reference sample point x j is chosen
for each sample, and then that point is assigned to sample x i . As a result of the
close proximity of the two samples, the probability that x j will be selected as the
reference point for x i will increase as a direct result of this proximity. This distance
can be measured using the weighted distance, which is denoted by Dw and found by
applying Eq. 1 to the equation.
( ) ∑r
| |
Dw= xi , x j = wm2 |xim − x jm | (1)
m=1
wm is the weight that has been allotted to the mth feature. A kernel function that
returns big values for tiny Dw can be used to determine the relationship between
probability Pij and weighted distance Dw . This relationship can be determined by
32 B. Taşci
Also, it takes the vae 1 if i = j and Pii = 0. The kernel function is defined as
k(z) = exp (− z/σ). The parameter k and σ are the core width and this affects the
probability that sample x j will be selected as the reference point. The probability of
x i being classified correctly is written ain Eq. 3.
∑
n
Pi j = Pi j Yi j (3)
j=1, j/=i
ReliefF
One of the most well-known approaches of feature selection is referred to as the relief
algorithm. It is a type of algorithm that has the potential to create features predictions
that are quite accurate and fruitful. The prediction of these features is accomplished
by assigning weights to the characteristics or features If an features is of any use,
one can anticipate that the closest distances of the same class will be closer to one
another than the closest distances of any and all other classes that are given along
that feature [24]. The convex optimization problem is solved, and the result is used
to determine the feature weights. However, the Relief algorithm has the limitation
of only being able to handle two-class situations and cannot process data that is
incomplete. This is a disadvantage. The ReliefF method, which was an enhanced
version of the Relief algorithm, was offered as a solution for these problems as
well as additional difficulties. It’s possible that this enhanced approach can conquer
incredibly powerful, noisy, and incomplete data. If the working logic of the ReliefF
algorithm is examined, firstly, a sample Ri is randomly selected, then, the k nearest
neighbors from the same class called Hj, and k nearest neighbors from each of the
different classes, called Mj(C) are selected. Depending on the values of Ri, Hj, and
Mj(C), the w[A] value was updated for all A features. feature weights range from
−1 to + 1. The largest positive values mean that the feature was important. This
process was continued for the number determined by the user. With the diff function,
the differences between samples and features, that is, distances, are calculated. The
calculation of this function depends on whether the features are written or numeric.
Let I1 and I2 be samples and A be features If the features were written, then the
calculation will be as in Eqs. 4 and 5. Choosing k, increases the robustness of the
algorithm against noisy data. This value can be set by the user; but if k is chosen as 1,
the algorithm will be sensitive to noisy data. In many studies, the k value was chosen
as 10, but choosing the k value differently would be more useful in examining the
importance levels of the features. Finally, choosing the k value too small will cause
similar bad results.
Deep Learning-Based Approaches Using Feature Selection Methods … 33
{
0, I1 = I2
di f f ( A, I1 , I2 ) = (4)
1, I1 /= I2
1
di f f (A, I1 , I2 ) = |I1 − I2 |x (5)
max( A) − min(A)
Chi Square Test; It is a single variable filter method. The Chi-Square method works
on categorical variables. It detects the relationships and dependencies of categorical
variables. The chi-square test is a two-step test. In the first step, the chi-square
statistics of the observed values are calculated according to the expected values. In
the second step, the obtained chi-square statistics are compared with the determined
threshold value and a decision is made accordingly. The features are scored according
to the chi-square statistic and the features with the best score are used. The chi-square
statistic is obtained using Eq. 6–8. I given in Eq. 6; is the number of intervals, and
J is the number of classes. Nij ; The ith interval is the number of samples in the jth
class. While the two properties Eij given in Eq. 6 are independent; The ith interval
is the expected number of units in the jth grade. Finally, the d given in Eq. 7 shows
the degrees of freedom of the Chi-Square distribution to be used for the test statistic
[25, 26].
j ( )2
∑I ∑
Ni j − E i j
X =
2
(6)
i=1 j=1
Ei j
Ni N j
Ei j = (7)
N
d = (I − 1)(j − 1) (8)
Transfer learning is defined as the learning structure created by using the features
obtained by deep learning models developed for special purposes as inputs in
other machine learning methods. In this study, deep learning models AlexNet, Effi-
cientNet B0, GoogleNet, Inception ResNet-v2, Inception-v3, ResNet18, ResNet50,
ResNet101, VGG16 and VGG19 were used. Layer, depth, number of parame-
ters, image input dimensions of the mentioned networks in Table 1, and network
architectures were given in Table 1.
AlexNet
Deep learning pioneers Alex Krishevsky, Ilya Sutskever, and Geoffrey Hinton came
up with the method that would become known as AlexNet [29]. This deep convolu-
tional neural network has a total of 25 layers, with 5 convolution layers, 3 maxpool
layers, 2 dropout layers, 3 fully connected layers, 7 relu layers, 2 normalization
layers, a softmax layer, input, and classification (output) layers making up the struc-
ture. The dimensions of the image that will go into the input layer of Alexnet are
227 by 227 by 3. The final layer is where classification takes place, and this is also
where the value of the classification number in the input image is presented.
DenseNet201
Forward connections are made between each layer of the DenseNet-121 (Densely
Connected Convolutional Network) and other layers. Each layer of the DenseNet
design takes as input the properties of all of the layers that came before it, as well
as the qualities that are unique to that layer, which are then passed on to the layers
that come after it [30]. DenseNet topologies have the advantage of providing feature
propagation and reducing the number of parameters by permitting feature reuse [31].
DenseNet-121 design is composed of four dense blocks, three transition layers, and
121 layers in total (117 loops, 3 passes and 1 classification).
MobileVNet2
MobileNet designs are built on a modular architecture that allows for the develop-
ment of both shallow and deep neural networks. This architecture’s two basic global
hyperparameters provide an optimal balance of latency and precision. Based on the
restrictions of the problem, these hyperparameters allow the model builder to select
the appropriate-sized model for their application.
Nasnet-Large
EfficientNet B0
GoogleNet
ImageNet 2014 came first with a success rate of 93.33% in image classification
competition. GoogLeNet architectural structure consists of 144 layers and this archi-
tecture has proven that too many data sets were increased the performance of the
36 B. Taşci
Inception ResNet-V2
Inception V3
ResNet-18
The ResNet 18 pre-trained model, which provides rich features, works by inputting
more than one million data in the ImageNet dataset with a size of 224 × 224. Although
it has 71 layers and 18 depths, it is analyzed that it gives successful and faster results
compared to some models with a deeper layer [35].
ResNet-50
Resnet microarchitecture module differs from other architectures with its structure.
It may be preferable to switch to the lower layer by ignoring the change between
some layers. By allowing this situation in the Resnet architecture, the performance
rate was increased to higher levels.
Deep Learning-Based Approaches Using Feature Selection Methods … 37
Resnet50 architecture consists of a network of 177 layers. The depth of the net is
50. In addition to this layered structure, there is information about how the inter-layer
connections will be [36].
ResNet-101
The Resnet-101 structure has 347 layers and a depth of 101. ResNet’s bypass
(jumping) between layers is referred to as ResBlock. Even if nothing is learned
in the previous layer, ResBlock makes the model more robust by applying the infor-
mation from the previous layer to the new layer. ResBlock thereby fixed the gradient
deletion issue. Utilizing slope drop as the optimization algorithm. Resnet-101 input
layer dimensions are 224 × 224 × 3 [36].
VGG16
The VGG16 model consists of a total of 41 layers, 16 of which include learnable
weights, followed by ReLu and pooling layers. Learnable layers include thirteen
convolutional and three fully linked layers. Similar to AlexNet, the VGG16 model
employs a 1-pixel pitch shift and 3 × 3 filter in all convolutional layers, and maximum
pooling layers follow convolutional layers. Maximum pooling is attained with a two-
step, two-by-two filter. To extract feature vectors, activations in the first and second
fully connected layers (fc6, fc7) were utilized. fc6 and fc7 result vectors include a
total of 4096 characteristics. Training utilizes 224,224 RGB pictures [37].
VGG19
The Visual Geometry Group at the University of Oxford is responsible for the devel-
opment of the VGG19 computer program (VGG). It consists of 19 layers, 16 of which
are convolutional, 3 of which are completely connected, 5 of which are maximum
pooling, and 1 of which is a Softmax layer. The input for this network is photos with
a dimension of (224, 224, 3). Approximately 144 million trainable parameters are
available. Filters with a step size of one pixel (3 by 3) were employed so that the
overall notion of the image could be conveyed [37].
SVM is a machine learning model, used in clustering and regression problems, espe-
cially in classification, developed by Vapnik–Chervonenkis in 1995. Especially in
recent years, it is one of the most successful machine learning algorithms used for
solving classification problems. The purpose of the SVM model is basically, is to
detect the hyperplane that will separate the classes of target variables from each other
in the most appropriate way [38].
38 B. Taşci
Although the k-NN classifier is a simple type of classifier, it is one of the classifiers
with good results. The reason why it is called “simple”, this classifier does not
require any training steps. This feature distinguishes this training data. This classifier
from other classifiers. used directly during the classification process by the classifier,
without a requirement for a training stage. Let a test sample is given, k nearest
neighbors of this test sample in the training set are detected and the number of
those belonging to each class is subtracted. Here it is said to belong to the class
with the largest number of neighbors [39]. There are certain mathematical formulas
for the concept of distance in the k-NN classifier. These are given in Eqs. 9–11. In
the Minkowski distance equation, if k 1 is chosen, Manhattan, if k 2 is chosen, the
Euclidean distance equation is obtained.
┌
| n
|∑
Oklid = | (xi − yi )2 (9)
i=1
┌
| n
|∑
Manhattan = | |xi − yi | (10)
i=1
⎛┌ ⎞1/k
| n
|∑
Minkowski = ⎝| |xi − yi |k ⎠ (11)
i=1
Decision trees allow the rapid processing of data. Decision trees perform the classi-
fication process by data with certain property values. For this process, some features
are determined as input and some features as output, are presented to the algorithm.
In order to obtain the results in the output feature with the algorithm, what the input
values can be is realized by looking at the decision trees. One of the methods used
to create a model is the EBT method.
To increase the prediction accuracy of discrete learning algorithms, ensemble
approaches mix various learning methods. They are a linear mixture of different
modeling methods that produce better prediction outcomes without increasing
complexity significantly. Bagged and boosted ensemble methods are two of the
most used ensemble methods. While bagged approaches minimize error variance
in constructor learning algorithms, boosted methods specifically reduce bias in
constructor learning algorithms [40, 41].
Deep Learning-Based Approaches Using Feature Selection Methods … 39
2.2 Dataset
The dataset consists of 1061 x-ray images labeled by Radiologists. The dataset has
been edited after downloading from the kaggle website [42, 43]. X-ray images consist
of three classes: COVID-19, Pneumonia and Normal. There are 361 COVID-19, 500
Pneumonia and 200 Normal chest X-ray images in the Dataset. The COVID-19 cases
in the dataset consist of chest X-ray images of 200 male and 161 female patients.
The mean age of the patients is over 45. These images range in height is from 143
to 1637 pixels (average 491 pixels) and in width from 76 to 1225 pixels (average
383 pixels). Figure 2 shows an example of X- RAY scans of COVID-19, Normal and
Pneumonia patients in the dataset.
was referred to as false positive (FP) (FP). For the suggested method, performance
measurement metrics were computed utilizing the TP, TN, FP, and FN numbers
from the matrix of complexity. Using the values of accuracy, sensitivity, specificity,
precision, and F-score, performance measures were developed. Using the following
equations, performance measurement metrics were computed.
TP + TN
Accuracy = (12)
TP + TN + FP + FN
TP
Sensitivit y = (13)
TP + FN
TN
Speci f icit y = (14)
TN + FP
TP
Pr ecision = (15)
TP + FP
Pr ecision × Sensitivit y
F − scor e = 2 × (16)
Pr ecision + Sensitivit y
4 Experimental Studies
Matlab environment was used to obtain the experimental results in this study. Exper-
imental results were obtained using an all-in-one computer with an I7 processor,
16 GB Ram, and a 4 GB graphics card. The images in the data set were sized
as 224 × 224, 227 × 227, 299 × 299 and 331 × 331, and classification was
performed. In the study, convolutional neural networks, AlexNet, EfficientNet
B0, GoogleNet, Inception ResNet-v2, Inception-v3, DenseNet201, MobilevNet2,
Nasnet-Large, ResNet18, ResNet50, ResNet101, VGG16 and VGG19 models were
used. Chi-square, NCA, mRMR and ReliefF feature selection methods were used. A
total of 2000 features were selected, 1000 from the FC8 layer of AlexNet’s features
and 1000 from Resnet101’s FC1000 layer. Selected features have been reduced to
200 features with mRMR feature selection methods. Classification process for 200
features was given to 13 different classifiers. In this study, it was observed that the
highest performance was obtained in SVM. In Fig. 3, the Confusion matrices of the
classification method in which the 13 pre-learned different networks and combined
networks used reach the highest accuracy were given. ResNet50 + AlexNet network
Cubic SVM classifier with mRMR feature selection had the best accuracy result with
98,21% and classifier Inception Resnet-v2 network Cubic SVM classifier with NCA
feature selection had the worst accuracy result with 95,00%.
In Fig. 4, The graphs of the accuracy values of the pre-trained networks according
to the classifiers and feature selections were given.
Deep Learning-Based Approaches Using Feature Selection Methods … 41
Covid-19
Covid-19
Covid-19
350 11 352 9 347 14
True Class
True Class
True Class
Normal
Normal
Normal
27 173 27 173 23 177
Pneumonia
Pneumonia
Pneumonia
500 500
500
Covid-19
Covid-19
361 340 21 352 9
True Class
True Class
True Class
Normal
Normal
Normal
36 164 32 168 32 168
Pneumonia
Pneumonia
Pneumonia
500 500
500
Covid-19
True Class
True Class
Normal
Normal
Normal
Pneumonia
Pneumonia
500 500
500
Covid-19
Covid-19
True Class
True Class
Normal
Normal
Normal
Pneumonia
Pneumonia
500 500
500
Covid-19
359 2 355 6
True Class
True Class
Normal
Normal
44 156 13 187
Pneumonia
Pneumonia
500 500
Fig. 4 Graphs of truth values of pre-trained networks according to classifiers and feature selections
Cubic SVM classifier had the highest accuracy with 96.42% for AlexNet network,
The Medium Gaussian SVM classifier with mRMR feature selection had the worst
accuracy with 89.2%. Cubic SVM classifier with NCA feature selection had the
highest accuracy with 96.61% for DenseNet-201 network, The Quadratic Dicrimi-
nant classifier with Chi2 feature selection had the worst accuracy with 89.6%. Cubic
SVM classifier with NCA feature selection had the highest accuracy with 96.51%
for EfficientNet B0 network, The Fine Tree classifier with Chi2 feature selection
had the worst accuracy with 89.3%. Cubic SVM classifier with NCA feature selec-
tion had the highest accuracy with 96.06% for GoogleNet network, The Quadratic
SVM classifier with mRMR feature selection had the worst accuracy with 89.7%.
Cubic SVM classifier had the highest accuracy with 95.0% for Inception ResNet-v2
Deep Learning-Based Approaches Using Feature Selection Methods … 43
network, The Medium Gaussian SVM classifier with NCA feature selection had the
worst accuracy with 89.7%.
Cubic SVM classifier with Chi2 feature selection had the highest accuracy with
96.14% for Inception v3 network, The Bilayered Neural Network had the worst
accuracy with 89.2%. Cubic SVM classifier with Chi2 feature selection had the
highest accuracy with 96.14% for MobilevNet2 network, The Quadratic Dicrimi-
nant with ReliefF feature selection had the worst accuracy with 90.0%. Cubic SVM
classifier with ReliefF feature selection had the highest accuracy with 96.32% for
Nasnet-Large network, The Medium Gaussian SVM with ReliefF feature selection
had the worst accuracy with 89.7%. Cubic SVM classifier had the highest accu-
racy with 96.04% for ResNet18 network, The Quadratic Dicriminant with ReliefF
feature selection had the worst accuracy with 90.0%. Cubic SVM classifier with
NCA feature selection had the highest accuracy with 97.08% for ResNet50 network,
The Quadratic Dicriminant with NCA feature selection had the worst accuracy with
90.1%.Quadratic Dicriminant with NCA feature selection had the highest accuracy
with 96.04% for ResNet101 network, The Fine Tree classifier with ReliefF feature
selection had the worst accuracy with 90.4%.Cubic SVM classifier with mRMR
feature selection had the highest accuracy with 96.42% for VGG16 network, The
Medium Gaussian SVM with NCA feature selection had the worst accuracy with
90.0%.Quadratic Dicriminant classifier with NCA feature selection had the highest
accuracy with 95.66% for VGG19 network, The Medium Gaussian SVM with NCA
feature selection had the worst accuracy with 89.3%.
In Table 2, the sensitivity, specificity, precision and, F-score results of the classi-
fiers used in the proposed method were given. For the pneumonia class, Accuracy,
Sensitivity, Specificity, Precision, F-Score metrics were all 100%. In the COVID19
class, for the Sensitivity metric, GoogleNet network Cubic SVM classifier with NCA
feature selection had the best result with 100% and classifier Inception Resnet-v2
network Cubic SVM classifier with NCA feature selection had the worst result with
94.18%. For the Specificity metric, ResNet50 + AlexNet network Cubic SVM clas-
sifier with mRMR feature selection had the best result with 98.14% and classifier
VGG19 network Cubic SVM classifier with mRMR feature selection had the worst
result with 93.71%. For the Precision metric, ResNet50 + AlexNet network Cubic
SVM classifier with mRMR feature selection had the best result with 96.47% and
classifier VGG19 network Cubic SVM classifier with mRMR feature selection had
the worst result with 89.08%. For the F-score metric, ResNet50 + AlexNet network
Cubic SVM classifier with mRMR feature selection had the best result with 97.39%
and classifier Inception Resnet-v2 network Cubic SVM classifier with NCA feature
selection had the worst result with 92.77%.
In the Normal class, for the Sensitivity metric, VGG19 network Cubic SVM
classifier with mRMR feature selection had the best result with 99.77% and classifier
GoogleNet network Cubic SVM classifier with NCA feature selection had the worst
result with 96.04%. For the Specificity metric, classifier Inception Resnet-v2 network
Cubic SVM classifier with NCA feature selection had the best result with 98.95%
and classifier VGG19 network Cubic SVM classifier with mRMR feature selection
had the worst result with 78.00%. For the Precision metric, ResNet50 + AlexNet
44 B. Taşci
Table 2 (continued)
Accuracy Sensitivity Specificity Precision F-Score
(%) (%) (%) (%) (%)
Pneumonia 100.00 100.00 100.00 100.00
VGG19-Cubic COVID-19 95.66 99.45 93.71 89.08 93.98
SVM-mRMR Normal 99.77 78.00 95.13 97.39
Pneumonia 100.00 100.00 100.00 100.00
RenNet50 + COVID-19 98.21 98.34 98.14 96.47 97.39
AlexNet-Cubic Normal 99.30 93.50 98.50 98.90
SVM-mRMR
Pneumonia 100.00 100.00 100.00 100.00
network Cubic SVM classifier with mRMR feature selection had the best result with
98.50% and classifier Inception Resnet-v2 network Cubic SVM classifier with NCA
feature selection had the worst result with 84.00%. For the F-score metric, ResNet50
+ AlexNet network Cubic SVM classifier with mRMR feature selection had the best
result with 98.90% and classifier Inception-v3 network Cubic SVM classifier with
Chi2 feature selection had the worst result with 96.38%.
5 Discussion
In this section, the performance criteria of studies with pre-trained models and the
proposed method, consisting of accuracy, sensitivity and specificity, are discussed.
Evaluations in the literature are usually made on combined data sets. Since the data
sets used in the studies are different and the evaluation criteria are different, it cannot
be said that they are completely superior to each other. The performance scores of
these methods are given in Table 3.
Abbas et al. [44], established a modified deep neural network effective on Xray
images to more effectively distinguish between COVID-19 cases. The model they
call DeTraC includes three inner layers. This model was created using ResNet18
on the backend and achieved 95.12% accuracy on the X-Ray dataset. Wang et al.
[45], used 44 COVID19(+), 55 typical viral pneumonia CT images in their study. As
preprocessing, a visual inspection of ROI extraction was performed. In the applied
M-inception algorithm, the obtained results were 82.9%, 81%, 84%, 77%, and 90%
accuracy, sensitivity, F1-score, AUC, and specificity, respectively. Alqudah et al. [46]
used SVM, Random Forest, CNN in this study. 95.2% accuracy, 93.3% Sensitivity,
100% Specificity and 100% Precision were achieved.
Hemdan et al. [47], suggested the COVIDXNET deep learning classifier archi-
tecture for COVID-19 diagnosis using X-Ray pictures. In addition, they validated
seven distinct DCNN models, such as VGG19 and Densenet201, in their investi-
gation. They demonstrated that VGG19 and DenseNet classifications are superior.
46
Narin et al. [48], used deep CNN-based models to classify X-ray images for COVID-
19 illness. Using chest X-ray radiographs, CNN-based models (InceptionResNetV2,
ResNet50, and InceptionV3) were utilized to detect people infected with coronavirus
pneu-monia. 98.00% accuracy was reached with the ResNet50 model, based on the
results of the experiments.
The proposed approach has reached a success rate of 98.21%. It has reached a
100% success rate in the sensitivity and specificity criteria for the pneumonia class.
For the COVID-19 class Sensitivity, Specificity, Precision, F-Score metrics, values
of 98.34%, 98.14%, 0.96.47%, and 97.39% were obtained, respectively.
6 Results
The rapid spread of the COVID-19 pandemic all over the world, its negative effects
on people, clearly demonstrates the detection of positive cases in the early stages and
the rapid and correct intervention. In this study, the three-class data set consisting
of X-Ray images obtained during the COVID-19 epidemic was classified by the
learning transfer method. In this paper, preprocessing techniques have been applied
to X-RAY images to improve classification performance. Gradient operator used as
Sobel operator was used to highlight the point regions in X-RAY images and reduce
the number of house gray tones. Chi-square, NCA, mRMR and ReliefF feature selec-
tion methods were used. First, the results of 13 pre-trained models were compared.
Then, a total of 2000 features were selected from AlexNet and Resnet101. Selected
features have been reduced to 200 features with mRMR feature selection methods.
Classification process for 200 features was given to 13 different classifiers. In this
study, it was seen that the highest performance was obtained at 98.21% SVM after
applying mRMR feature selection to the combined models of RenNet50 + AlexNet
models. In the study, the highest accuracy, sensitivity, specificity, precision and F-
score value for the COVID19 class were; ResNet50 + AlexNet Cubic SVM with
98.21%, GoogleNet network Cubic SVM classifier with 100%, ResNet50 + AlexNet
Cubic SVM with 98.14%, ResNet50 + AlexNet Cubic SVM with 96.47%, ResNet50
+ AlexNet with 97.39% Obtained in Cubic SVM. In the proposed approach, it has
been seen that pre-trained CNN architectures and feature extraction methods can
be used together. In addition, it has been confirmed in this study that the weights
can be combined and efficient rather than considering the performance of feature
selection methods separately. The major limitation of this study is that the method
used requires more powerful hardware if applied to larger datasets.
48 B. Taşci
References
20. Shi, F., Xia, L., Shan, F., Song, B., Wu, D., Wei, Y., Yuan, H., Jiang, H., He, Y., & Gao, Y. (2021).
Large-scale screening to distinguish between COVID-19 and community-acquired pneumonia
using infection size-aware classification. Physics in Medicine & Biology, 66(6), 065031.
21. Tarabalka, Y., Chanussot, J., & Benediktsson, J. A. (2010). Segmentation and classification of
hyperspectral images using watershed transformation. Pattern Recognition, 43(7), 2367–2379.
22. Gauch, J. M. (1999). Image segmentation and analysis via multiscale gradient watershed
hierarchies. IEEE Transactions on Image Processing, 8(1), 69–79.
23. Yang, W., Wang, K., & Zuo, W. (2012). Neighborhood component feature selection for high-
dimensional data. Journal of Computers, 7(1), 161–168.
24. Robnik-Šikonja, M., & Kononenko, I. (2003). Theoretical and empirical analysis of ReliefF
and RReliefF. Machine Learning, 53(1), 23–69.
25. Liu, H., Li, J., & Wong, L. (2002). A comparative study on feature selection and classification
methods using gene expression profiles and proteomic patterns. Genome Informatics, 13, 51–60.
26. McHugh, M. L. (2013). The chi-square test of independence. Biochemia Medica, 23(2), 143–
149.
27. Peng, H., Long, F., & Ding, C. (2005). Feature selection based on mutual information criteria of
max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 27(8), 1226–1238.
28. Ding, C., & Peng, H. (2005). Minimum redundancy feature selection from microarray gene
expression data. Journal of Bioinformatics and Computational Biology, 3(02), 185–205.
29. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). Imagenet classification with deep
convolutional neural networks. Communications of the ACM, 60(6), 84–90.
30. Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected
convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern
recognition (pp. 4700–4708).
31. Iandola, F., Moskewicz, M., Karayev, S., Girshick, R., Darrell, T., & Keutzer, K. (2014)
Densenet: Implementing efficient convnet descriptor pyramids. Preprint at arXiv:14041869
32. Tan, M., & Le, Q. (2019). Efficientnet: Rethinking model scaling for convolutional neural
networks. In International conference on machine learning, PMLR (pp. 6105–6114).
33. Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. A. (2017). Inception-v4, inception-resnet
and the impact of residual connections on learning. In Thirty-first AAAI conference on artificial
intelligence.
34. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke,
V., & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE
conference on computer vision and pattern recognition (pp. 1–9).
35. Ou, X., Yan, P., Zhang, Y., Tu, B., Zhang, G., Wu, J., & Li, W. (2019). Moving object detection
method via ResNet-18 with encoder–decoder structure in complex scenes. IEEE Access, 7,
108152–108160.
36. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In
Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778)
37. Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image
recognition. Preprint at arXiv:14091556.
38. Vapnik, V. (1999). The nature of statistical learning theory. Springer science & business media.
39. McRoberts, R. E., Tomppo, E. O., Finley, A. O., & Heikkinen, J. (2007). Estimating areal
means and variances of forest attributes using the k-Nearest Neighbors technique and satellite
imagery. Remote Sensing of Environment, 111(4), 466–480.
40. Bühlmann, P. (2012). Bagging, boosting and ensemble methods. In Handbook of computational
statistics. Springer, pp 985–1022.
41. Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140.
42. COVID-19 chest xray. (2022). https://www.kaggle.com/bachrr/covid-chest-xray
43. Chest X-Ray Images (Pneumonia). (2022). Retrieved from https://www.kaggle.com/paultimot
hymooney/chest-xray-pneumonia
44. Abbas, A., Abdelsamea, M. M., & Gaber, M. M. (2021). Classification of COVID-19 in chest
X-ray images using DeTraC deep convolutional neural network. Applied Intelligence, 51(2),
854–864.
50 B. Taşci
45. Wang, S., Kang, B., Ma, J., Zeng, X., Xiao, M., Guo, J., Cai, M., Yang, J., Li, Y., & Meng,
X. (2021). A deep learning algorithm using CT images to screen for Corona Virus Disease
(COVID-19). European Radiology, 31(8), 6096–6104.
46. Alqudah, A. M., Qazan, S., Alquran, H., Qasmieh, I. A., & Alqudah, A. (2020). COVID-2019
detection using X-ray images and artificial intelligence hybrid systems. Biomedical Signal and
Image Analysis and Project.
47. Hemdan, E. E.-D., Shouman, M. A., & Karar, M. E. (2020). Covidx-net: A framework of deep
learning classifiers to diagnose covid-19 in x-ray images. Preprint at arXiv:200311055.
48. Narin, A., Kaya, C., & Pamuk, Z. (2021). Automatic detection of coronavirus disease (covid-19)
using x-ray images and deep convolutional neural networks. Pattern Analysis and Applications,
24(3), 1207–1220.
49. Cohen, J. P., Morrison, P., Dao, L., Roth, K., Duong, T. Q., & Ghassemi, M. (2020). Covid-19
image data collection: Prospective predictions are the future. Preprint at arXiv:200611988.
Image Captioning Using Deep Transfer
Learning
1 Introduction
Generating textual description of an image is an easier task for human being, however,
for a machine to explain the image requires computer vision to visualise the image
and NLP to describe the image [1]. Hence in order to generate caption automatically
for a particular photograph, the system must be trained and educated to realise the
content of image and thereafter to express the contents in natural language words
[2]. With the advent of deep learning methods especially for image feature extraction
and processing [3], this particular problem has been swiftly addressed.
Deep learning techniques such as convolutional neural network (CNN) are widely
used for image processing tasks for their ability to deal with millions of underlying
features [4]. It has been well perceived that CNN techniques are quite efficient for
varieties medical image processing e.g. COVID-19 lung CT- scan [5], MRI images
for brain tumor diagnosis [6, 7], retinal blood vessel [8], angiograms [9], chest X-rays
[10] and many more.
By just seeing the picture depicted in Fig. 1, some of us might say “A Little
is talking brown guiding grassy”, some may say “Little boy is playing with toys”
and yet some others might say “A little boy is designing the house”. The answer to
all these observations are true and even few additional captions are also possible.
All these findings do not require any special training or efforts for a human being,
however, this is not the case for a system so that just by overlooking glancing; an
appropriate language can be described.
T. K. Das (B)
School of Information Technology and Engineering, Vellore Institute of Technology,
Vellore 632014, India
e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 51
S. S. Roy et al. (eds.), Deep Learning Applications in Image Analysis,
Studies in Big Data 129, https://doi.org/10.1007/978-981-99-3784-4_3
52 T. K. Das
This study of generating the captions for the images has following significances.
• The experiments are based on transfer learning coupled with Convolution Neural
Network (CNN).
• We aims for boosting the model performance by making subtle changes to the
block diagram.
• The objective is producing the semantic and syntactical captions for the input
images by using the phrases as elementary units instead of words.
Motivation
This problem is immensely useful in real-world applications. We listed below few
applications where this study is being interpreted:
• Self-driving cars: By automatically and readily generating the caption of the scene
around the car, the self-driving system would be truly autonomous.
• Aid to the blind: By designing a product which will guide the blind persons when
walking on the roads will fulfil a lot of aspirations. This is possible by converting
the scene around into text following the text to voice.
• Google image search: Like Google search, image search may be popular if an
image could be first transformed into a caption and then the underlying text can
be searched.
Image Captioning Using Deep Transfer Learning 53
2 Related Studies
Different techniques for image captioning exists; they are retrieval based or template
based. Recently deep learning base captioning become very popular due to the quality
and appropriateness of the textual description of images. Deep learning based atten-
tion mechanism are also delivers promising result in captioning [11]. Most of the
models are encoder- decoder based, and it has been realised that LSTM and bidirec-
tional LSTM networks are used as decoder in most of the systems [12]. Similarly
for encoding purpose VGG16 and ResNet50 are employed for their effectiveness in
vectorising [13].
Few studies on image captioning those have used deep learning for image
processing and text description are represented in Table 1.
3 Methodology
We used combined (CNN-RNN) model to extract the features from the image and
text, further, we used evaluation model to check the accuracy of the proposed model
and finally performance of the model at each epoch is tracked by the help of error rate.
Here we are using top-down approach and transfer learning to extract the features and
to train a model and also to get accurate captions of the image. In fact the concept of
transfer learning is applied twice in our model. InceptionV3 for extracting features
from images and Glove for extracting features from text/captions for better accu-
racy.Finally, we test a model with some images (test images) to know the accuracy
of the model. Detailed methodology consists of following steps:
• Data collection.
• Data cleaning and pre-processing.
• The result from pre-processing is that we have a vocabulary of 1652 unique words
from the training dataset. We employed InceptionV3 transfer learning model.
• We encoded all the training images and testing images which are input to our
model.
• After removing the stop-words in the process of data cleaning we have 7578 words
in our vocabulary.
• We also used a transfer learning model (Glove) to extract the features from our
pre-processed text data.
• Then we built and train our network/model. Finally, we evaluated the performance
on the test data.
3.1 Dataset
We have utilised Flickr8k dataset which contains around 8000 image, out of which
6000 images are used for training the model, 1000 images for validating the model
and remaining 1000 images for testing the model in order to determine the model
efficiency. Each image contains five number of captions (Fig. 2).
Figure 3 exhibits few sample images from the Flickr8k dataset.
From Fig. 4, clearly each individual images have five different captions.
The Flickr dataset are loaded in repository, then the data is pre-processed by
removing extra whitespace, punctuation, and other distractions. For encoding, CNN
is used. The input image is fed to CNN to extract the features. After the features
are processed by a series of layers, the last hidden state of the CNN is connected to
the decoder. In this framework, RNN serves as a decoder which performs language
modelling up to the word level. A schematic diagram of encoder-decoder based
image captioning process is shown in Fig. 2.
Here we have used pretrained Inception V3 model to extract the features from the
images. Inception v3 is a widely-used image recognition model which has shown a
remarkable accuracy of 98.1% on the standard ImageNet dataset.
Architecture of Inception V3 is depicted in Fig. 5.
56 T. K. Das
The process of encoding and decoding and the detailed layers of those models
and the parameters are involved are being represented in Figs. 6 and 7 respectively.
Summary of caption model which depicts that the total parameters trained by the
proposed model and the detailed network layers are represented in Fig. 8.
4 Result
The main objective is to predict the caption for the image. For predicting, we applied
an efficient predictive model using deep learning technique. We mainly focussed on
the predictiveness of the model that suits to find the caption for the given image in
the dataset.
For evaluating the calibre of the text generated, we used BLEU (Bilingual Evalu-
ation Understudy) since it has the principle of matching each text against set of refer-
ence texts composed by human itself. It is being signified a score which reflects overall
quality of generated text. We achieved a BLEU score of 0.645 for our considered
dataset.
For testing the effectiveness of our designed model, we tested the model over the few
images from Flicker8k dataset and exhibited the output caption obtained in Figs. 9,
10, 11, 12 and 13.
58 T. K. Das
5 Conclusion
In this chapter, we have executed image captioning task by integrating two deep
learning techniques i.e. CNN with RNN. For training the encoder-decoder model, we
used Flickr8k dataset. The trained model achieved state of the art performance when
tested with unseen images of the dataset. Efficiency of image retrieval with content
is assessed by the quality of the textual description of the image. This image caption
generation can widen the scope of application areas such as medicine, security and
other fields where the underlying image speaks a lot and has some implicit meaning.
Moreover, the framework of image captioning can automate and promote annotating
the image in large scale which can lead to even video captioning and video dialog.
References
1. Sharma, H., Agrahari, M., Singh, S. K., Firoj, M., & Mishra, R. K. (2020). Image captioning:
A comprehensive survey. In 2020 International Conference on Power Electronics & IoT
Applications in Renewable Energy and its Control (PARC) (pp. 325–328). IEEE.
2. Stefanini, M., Cornia, M., Baraldi, L., Cascianelli, S., Fiameni, G., & Cucchiara, R. (2022).
From show to tell: a survey on deep learning-based image captioning. IEEE Transactions on
Pattern Analysis and Machine Intelligence.
3. Hossain, M. Z., Sohel, F., Shiratuddin, M. F., & Laga, H. (2019). A comprehensive survey of
deep learning for image captioning. ACM Computing Surveys (CsUR), 51(6), 1–36.
4. Chohan, M., Khan, A., Mahar, M. S., Hassan, S., Ghafoor, A., & Khan, M. (2020). Image
captioning using deep learning: A systematic. Image, 11(5).
62 T. K. Das
5. Tiwari, R. S., Das, T. K., Srinivasan, K., & Chang, C. Y. (2022). Conceptualising a channel-
based overlapping CNN tower architecture for COVID-19 identification from CT-scan images.
Scientific Reports, 12(1), 1–15.
6. Roy, S. S., Rodrigues, N., & Taguchi, Y. (2020). Incremental dilations using CNN for brain
tumor classification. Applied Sciences, 10(14), 4915.
7. Das, T. K., Roy, P. K., Uddin, M., Srinivasan, K., Chang, C. Y., & Syed-Abdul, S. (2021). Early
tumor diagnosis in brain MR images via deep convolutional neural network model. Computers,
Materials and Continua, 68(2), 2413–2429.
8. Biswas, R., Vasan, A., & Roy, S. S. (2020). Dilated deep neural network for segmentation of
retinal blood vessels in fundus images. Iranian Journal of Science and Technology, Transactions
of Electrical Engineering, 44(1), 505–518.
9. Roy, S. S., Hsu, C., Samaran, A., Goyal, R., Pande, A., et al. (2023). Vessels segmentation
in angiograms using convolutional neural network: A deep learning based approach. CMES-
Computer Modeling in Engineering & Sciences, 136(1), 241–255.
10. Das, T. K., Chowdhary, C. L., & Gao, X. Z. (2020). Chest X-ray investigation: a convolutional
neural network approach. Journal of Biomimetics, Biomaterials and Biomedical Engineering,
45, 57–70. Trans Tech Publications Ltd.
11. Zohourianshahzadi, Z., & Kalita, J. K. (2022). Neural attention for image captioning: Review
of outstanding methods. Artificial Intelligence Review, 55(5), 3833–3862.
12. Wang, C., Yang, H., Bartz, C., & Meinel, C. (2016). Image captioning with deep bidirectional
LSTMs. In Proceedings of the 24th ACM International Conference on Multimedia (pp. 988–
997).
13. Rampal, H., & Mohanty, A. (2020). Efficient CNN-LSTM based image captioning using neural
network compression. Preprint retrieved from arXiv:2012.09708.
14. Chen, X., & Zitnick, C. L. (2014). Learning a recurrent visual representation for image caption
generation. Preprint retrieved from arXiv:1411.5654.
15. Sharma, H., & Jalal, A. S. (2020). Incorporating external knowledge for image captioning using
CNN and LSTM. Modern Physics Letters B, 34(28), 2050315.
16. You, Q., Jin, H., Wang, Z., Fang, C., & Luo, J. (2016). Image captioning with semantic attention.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4651–
4659).
17. Rampal, H., & Mohanty, A. (2020). Efficient CNN-LSTM based image captioning using neural
network compression. Preprint retrieved from arXiv:2012.09708.
18. Arnav, J. H., & Pulkit, M. (2018). Image captioning using deep learning.
19. Yao, T., Pan, Y., Li, Y., Qiu, Z., & Mei, T., (2017). Boosting image captioning with attributes.
In Proceedings of the IEEE International Conference on Computer Vision (pp. 4894–4902).
20. Singh, Y. P., Ahmed, S. A. L. E., Singh, P., Kumar, N., & Diwakar, M. (2021). Image captioning
using artificial intelligence. In Journal of Physics: Conference Series (Vol. 1854, No. 1,
p. 012048). IOP Publishing.
21. Wang, C., Yang, H., Bartz, C., & Meinel, C. (2016). Image captioning with deep bidirectional
LSTMs. In Proceedings of the 24th ACM International Conference on Multimedia (pp. 988–
997).
Vehicle Over Speed Detection System
1 Introduction
Every year, many individuals die all across the world. One of the most common
causes of death is a vehicle accident. Accidents not only kill people, but also harm a
large number of people. Among the several causes of accidents, high-speed vehicles
are the most important cause. As a result, high-speed vehicles must be managed. As
a result, different government organisations, academic institutions, and automobile
manufacturers have begun various studies and projects to lower the likelihood of
accidents and provide safety to passengers and drivers. Several researchers have
used different kinds of mechanisms to detect vehicle over-speed in highways such
as VANET technology to connect with cloud server [1], video based specific area of
Interest (ROI) [2], and Electronic toll collection data based speed prediction [3]. To
manage high-speed vehicles on the highway, the Tamil Nadu government planned
to install an over-speed detecting device in the toll plaza. Figure 1 depicts a block
diagram of over-speed detection in a toll plaza. This architecture is made up of a
vehicle detection system, a common cloud server that is linked to an RTO server,
and an over-speed detection system.
K. Ganesan (B)
Professor, Higher Academic Grade, School of Information Technology and Engineering, Vellore
Institute of Technology (VIT), Vellore 632014, Tamil Nadu, India
e-mail: [email protected]
N. S. Manikandan
Senior System Architect, TIFAC-CORE Automotive Infotronics, Vellore Institute of Technology,
Vellore 632014, Tamil Nadu, India
V. Sugumaran
Distinguished Professor of Management Information Systems, School of Business
Administration, Oakland University, Rochester, MI, USA
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 63
S. S. Roy et al. (eds.), Deep Learning Applications in Image Analysis,
Studies in Big Data 129, https://doi.org/10.1007/978-981-99-3784-4_4
64 K. Ganesan et al.
Fig. 1 Block diagram of the proposed system for high speed detection
and character segmentation. To recognize characters, a two layer feed forward back
propagation ANN is used. The results show an overall accuracy of 89.5%. When
it comes to Indian license plate irregularities, a pipeline is built by Ravirathinam
et al. [20] using a number of cutting-edge Faster Regional Convolutional Neural
Networks to effectively address the Indian situation in a variety of scenarios. There
is no publicly accessible dataset for Indian licence plates, so they created a balanced
dataset using frames from videos and images from mobile devices, accounting for
all the irregularities. Their pipeline generated an overall total correctness of 88.5%
and a partial correctness of 10% for Indian plates. The overall correctness increased
to 91% with the addition of a new heuristics system. The accuracy of licence plate
detection for all kinds of vehicles was 94.98%. Sometimes the extracted license plate
information is incorrect, for OCR corrections in chaotic Indian traffic videos with
complicated licence plate patterns; Singh et al. [18] proposed a modular framework.
These patterns are produced by a cutting-edge deep learning model that was trained
on video frames. This model includes multi-frame consensus in their framework for
generating suggestions because it reads text from videos rather than images. Their
human-interactive framework uses an object detector and a tracker to first separate
the multi-vehicle videos into multiple clips, each of which contains a single vehicle
from the video, to aid in the correction process. Their framework then offers recom-
mendations for a single vehicle using multi-frame consensus. The user is then given
interactive suggestions that only show them certain extracted clips, allowing them to
quickly and easily verify or correct their predictions. This high-quality output can be
used to update a sizable database continuously for surveillance, which will improve
the accuracy of deep models in difficult real-world scenarios.
In view of the cloud platform, an IoT-based system that uses two detection points
with surveillance cameras to measure the average speed between them was proposed
by Khan et al. [21]. To enforce speed limits, the measured data is sent to the cloud
for additional processing. Entry and exit points are used to detect any uncertainty
in a particular area. The failure of a car, for instance, to reach the end point after
passing through the entrance point, can be highlighted. The system is made up of a
mobile phone application and a web network that exchange real-time data, including
information about passing vehicles like entrance time, pictures, and license plate
registration numbers. Such a system has the advantages of requiring little human
involvement, requiring fewer speed guns to be installed, and monitoring vehicles
even when they are not in the camera’s field of view.
The speed limit between two toll gates is determined by traffic density or govern-
ment traffic rules and regulations. However, the roads between the toll plazas are
generally curvy and have speed limits. The majority of cloud-based vehicle over-
speed detection systems are unaware of road curvature. In terms of extracting data
about horizontal curves from road GIS maps, Li et al. [22] present a fully automated
method. Their proposed methodology aims at four different things: (a) Regardless of
the type of curve, each road’s curves in the selected road’s surface layers are identi-
fied; (b) each curve is automatically classified as either simple or compound; (c) Each
simple curve’s radius, degree of curvature, length, and compound curve’s radius are
66 K. Ganesan et al.
all automatically determined; and (d) curve characteristics and layers are automati-
cally created in the GIS for all detected curves. 96.7% of curves were correctly iden-
tified and their geometric information was computed using the proposed technique.
However, the existing road curvature extraction method is unaware of curvature noise
and curvature in hilly terrain.
Thus, the existing over-speed detection system has some gaps, such as not being
aware of the curvature on the highway and not being aware of curvature noise. To
bridge the generation gap, the proposed system includes the following features:
• The YOLO object detection model has been proposed for vehicle detection and
vehicle type extraction.
• An image processing technique is used to locate and extract licence plates from
detected vehicle images.
• The information on the localised licence plate is extracted using the CRNN deep
learning text extraction model.
• The proposed curvature aware travel time estimation model calculates the travel
time between two toll plazas, and the cloud-based system detects over-speed of
vehicles.
The remaining portions of this paper is arranged as follows: Sect. 2 describes
the vehicle detection & license plate extraction system, which is sub divided into
vehicle detection & type classification, License plate localization and license plate
recognition, travel time estimating & over-speed detection system. The speed detec-
tion system is further sub divided into new curve finding method, curve speed limit
database creation, curve aware travel time estimation, and vehicle over speed detec-
tion system. Section 3 discusses the results of vehicle detection and license plate
localization & text extraction, new curve finding method, curve aware travel time
estimation, and vehicle over speed detection. Finally, Sect. 4 provides the conclusion
and future work.
2 Proposed Model
Figure 2 depicts the proposed system’s architecture. This system has been subdivided
into three subsystems. The first subsystem detects vehicles at toll gates and extracts
license plate information as well as vehicle type. The second subsystem uses a road
curvature extraction module, a curve aware speed limitation module, and a curvature
aware travel time estimator to characterize the curvature aware journey time between
two toll gates. Over speed detection is the third and final subsystem. It is made up of
a toll gate system and a common cloud server infrastructure. The three subsystems
are briefly described below.
Vehicle Over Speed Detection System 67
This system consists of three modules: vehicle detection and vehicle type classifica-
tion, license plate localization, and license plate recognition. Details of each of these
modules are provided below.
YOLO [23] divides the image into M X M grids by a single CNN applied to the
entire image. For each grid, the prediction of B bounding boxes and the associated
confidence score are computed. The class confidence score analyses these bounding
boxes using the formula given below.
Class confidence score = conditional class probability + box confidence score.
It assesses the level of certainty in both classification and localization. The
mathematical definitions are as follows:
box confidence score ≡ Pr (object).I oU
conditional class probability ≡ Pr (class i |object)
class confidence score ≡ Pr (class i ).I oU , then
where Pr (object) denotes the likelihood that an object is present in the box. The
intersection over union, or IoU, between the predicted box and the actual data is
the ground truth. The probability that an object belongs to a given class i , given its
presence, is known as Pr (class i |object). The probability that an object belongs to
a given class i is given by Pr (class i ).
YOLO reduces an input image to 448 × 448 pixels in size. The image is then sent
through a convolutional network, yielding a tensor of 7 × 7x30. Tensor information
includes: (1) the coordinates of the bounding box’s rectangle, and (2) the probability
distribution for all classes for which the system has been trained. By limiting these
class labels, confidence scores (probability) with less than 30% are eliminated.
To calculate the loss, when comparing predictions to ground truth, YOLO uses
the sum-squared error. The categorization loss is part of the loss function. The loss
of localization is the error between the predicted boundary box and the ground truth.
The loss of confidence scores only for the boxes which did not contain any object at
all. Here’s the overall formula:
∑
s ∑
2
B [( )2 ]
obj Ʌ
λcoor d 1i j xi − x i + (yi − yi )2
i=0 j=0
[ ]
∑
s2 ∑
B
obj
(√ √ )2 (√
Ʌ
/ )2 Ʌ
+λcoor d 1i j xi − x i + hi − hi
i=0 j=0
∑
s2 ∑ ∑
s ∑
2
obj ( )2 obj ( )2
B B
Ʌ Ʌ
+ 1i j ci − ci + λnoobj 1i j ci − ci
i=0 j=0 i=0 j=0
∑
s2
obj
∑ ( Ʌ )2
+ 1i j pi (c) − pi (c) (2)
i=0 c∈classes
Finding the location of the License Plate in the vehicle image is a critical assignment.
Grayscale conversion, thresholding, and morphological procedures such as dilatation
and erosion are used to localize plates. Canny edge detector is used to detect license
plate edges and crop the located license plate from vehicle image [24].
The CNN, Bi-directional LSTM, and CTC layer that make up the CRNN [25] can be
viewed as an encoder-decoder structure. A feature sequence encoder known as CNN
creates image feature sequences. Character sequences are produced by a decoder
made up of the bi-directional LSTM and CTC layers.
Vehicle Over Speed Detection System 69
The input image”s width and height are set by CNN to (Wx32)/H and 32 pixels,
respectively, where W and H are the image’s width and height, in order to maintain
the original aspect ratio. The CNN uses stride 21 rather than stride 22 for the pooling
layer because the character is tall and thin, with a height greater than a width. As
a result, the final feature map has a thin and tall pixel point that corresponds to the
original image’s receptive field. The input image is downsampled using two layering
pools with a 22 stride, and three layering pools with a 21 stride. The final dimension
of the feature map is b × 1 × [(W × 8)/H] × C, where b is the batch size, 1 is the
height, (Wx8)/H is the width, and C represents the number of channels. The structure
of CNN used for feature extraction is displayed in Table 1.
The CRNN decoder is composed of the CTC layer and the Bi-directional LSTM
layer. Bi-directional LSTM receives its input from a feature map’s column vector.
The probability matrix of (Wx8)/HxC, where C is the number of character labels
and is set to English uppercase letters in 26, English lowercase letters in 26, and a
space, is the output and it represents the probabilities of characters in each column
vector. The feature map that was recovered has a width of (Wx8)/H. The likelihood
of the label sequence is determined by applying the CTC layer to the Bi-directional
LSTM’s output. The likelihood of the label sequence during training is determined
by the conditional probability defined in the CTC layer. The conditional probability’s
negative log-likelihood serves as the loss function for training the network. The prob-
ability sum of all pathways that are genuine label sequences is calculated by the CTC
layer. The paths ‘hee-ll-o’ and ‘hh-ee-ll-oo’ (where ‘-’ signifies a space) eliminate
duplicates and spaces to show the label sequence ‘helo.‘ The test’s recognition result
is determined by which character sequence has the highest probability.
This system has four modules: road curvature identification, curve speed restriction
declaration, curve aware travel time computation, and vehicle overspeed detection.
They are described in detail below.
A path between the source to the destination is constructed. The path, as shown
in Fig. 3a, is made up of a series of segment points (S1 to S9) with each segment
connected by straight lines. (Note: In India, all vehicles drive on the left side [26]).
The equations for detecting the curve using the sequence of segment point are
shown below. Before calculating the radius of curvature, we must first calculate the
great-circle distance between two points using the ‘Haversine’ formula.
(a × b × c)
radius = √ (6)
(a + b + c) × (b + c − a) × (c + a − b) × (a + b − c)
According to the Indian Roads Congress [27, 28], a vehicle can travel at a speed
of 70 to 80 km/h in a 1000 m radius curve on the Indian highways [26]. So, we
assume that the maximum radius of an Indian road curve is 1000 m. Using Eq. (6)
at segment points S1, S2, and S3 from Fig. 3a, we find that the radius of these three
segment points is more than 1000 m because they are interconnected like a straight
line. So, we check the next adjacent three-segment points S2, S3, and S4. The radius
of these three-segment points (S2, S3, and S4) is less than 1000 m because it looks
like a curve. These three segment points’ radius values are recorded in the radius list,
and this procedure is repeated for subsequent segment points until we reach the final
set of segment points along the path.
Figure 3a segment points S3–S9 yield six curve radii R1, R2, R3, R4, R5, and
R6. This information is saved in the radius list. After that, the average curve radius
(R1 to R6) of the segment points S2 to S9 is calculated. The detected curve of the
path in Fig. 3a is shown in Fig. 3b. (Explained in Algorithm 1).
The curve list keeps track of the curve’s starting segment point S2, ending segment
point S8, mid-segment point S6, and computed average curve radius. The method is
utilized with the route between the origin and destination, and the found curves and
their attributes (curve starting point, curve ending point, curve mid-point, and average
curve radius) are saved in the curve list. Figure 4 shows that the source location is
on a highway, but the destination location is on a mountainous (hilly) terrain. The
curves on the highway are always large. A single curve that is 1000 m long, as seen
in Fig. 4 (top red solid circle), is a good example. The mountainous landscape here
features several hairpin bends. These curves have a radius of 50 to 150 m, and some
curves are 500 m long. As the assumed maximum curve radius of roads in India is
1000 m, multiple hairpin curves form a single curve, as seen in Fig. 4 (bottom two
red solid circles).
72 K. Ganesan et al.
sinΔλ × cosϕ2
Bearing(θ) = tan−1 (7)
cosϕ1 × sinϕ2 − sinϕ1 × cosϕ2 × cosΔλ
Fig. 5 a Result of proposed curve finding algorithm b Curve in tunnel road (Twin tunnel Mumbai)
The references to the Indian Road Congress’s (IRC) [27, 28] articles provide support
for the development of the database of curve speed restriction on Indian roads.
According to the Indian Road Congress (IRC) article, Table 2 measures up the
74 K. Ganesan et al.
planned speed limit and curve radius. Based on the super-elevation of the curve,
the radius range is established.
This module describes the travel time between two toll gates. This is described in
following equation.
Sr d = T r d − Cr d (8)
St = Srd/d
s (9)
C1d
Ct1 =
C1s
Vehicle Over Speed Detection System 75
C2d
Ct2 =
C2s
···
Cn d
Ctn =
Cn s
Cat = St + Ct1 + Ct2 + · · · + Ctn (10)
where the Sr d is straight road distance which is subtraction result of Toll road distance
T r d from total distance of curvature road Cr d . The time taken to travel only on
straight road St is straight road distance Sr d divided by declared speed ds .
Here, the Ct1 , Ct2 , and Ctn is time to travel over the curvature, which is
computed from C1d ,C2d , Cn d curvature distance divided by curvature speed restric-
tion C1s ,C2s , Cn d/Cn s . Finally, the curvature aware travel time Ca t is obtained by
adding the travel time on straight road St with every curvature travel time Ct1 , Ct2 ,
and Ctn .
Figure 7 depicts the block diagram of over speed detection system. In a toll gate,
every booth has an over speed detection system. This system connects with camera
situated outside the toll booth which is focusing on the vehicle for detecting vehicle
type as well as extracting license plate information. This system connects with cloud
server, which stores vehicle information with timestamp and it connects with the
RTO server.
The system connects with the camera to detect vehicle type and extract license
plate and add the current timestamp. Before that, the system downloads vehicle
information and timestamp of vehicle entry in the previous toll gate, along with RTO
information of vehicle from the cloud server. When the vehicle enters the toll booth,
the over speed detection system in each booth checks the vehicle information from
the downloaded information from cloud server. If there is no match, then it considers
the vehicle entering the toll booth as new, so the vehicle information with current
timestamp is added to the cloud server. In case the entered vehicle information
is matched with downloaded information, then it checks for the over-speed. The
over-speed is calculated using the following formula.
Os = V t t < Ca t (12)
where V t t is vehicle travelled time, calculated from subtracting its current toll booth
timestamp Cbt from pervious toll booth timestamp Pbt . Here, the over speed Os is
whether the vehicle travelled time is less than Toll gate declared curve aware travel
76 K. Ganesan et al.
time Ca t (Eq. 10). if the vehicle over speed is detected, then it will be entered into
the violator database and get fined by the field inspector.
3 Result
The vehicle overspeed detection system’s testbed location is set up in two toll plazas
in Tamilnadu, India: Pallikonda and Ranipet. The results of this testbed are described
below.
Using the toll booth outside camera, which was focusing on vehicles and moving
line by line, the YOLO detected and classified the type of vehicle. The YOLO result
is shown in Figs. 8 and 9 at the top left (Vehicle detection) and bottom left (Detected
Vehicle Over Speed Detection System 77
vehicle type). The detected vehicle image has been sent to the license plate localiza-
tion process, as described in Sect. 2.1.2. Figures 8 and 9 bottom right (Number plate
cropped) show the output of this license plate localization.
The cropped license plate image is then fed into the CRNN text recognition
algorithm, which produces the extracted text of the license plate, as shown in the
bottom right (Predicted num) of Figs. 8 and 9. This Vehicle detection & License
78 K. Ganesan et al.
plate extraction system output is vehicle type and license plate text recognition,
which is sent to the vehicle over-speed detection module.
The found curve analysis is assessed using the Type 1 error, Type 2 error, and Type 2
error ratio (TIIR) metrics [22]. Wherever a Type 2 error occurs, the detected curve is
extended beyond the ground truth curve by an additional segment. Wherever a Type
1 error occurs, either the detected curve is not detected or is missing 25, 50, or 75%
of the ground truth curve.
Figure 10a, b, and c show various 25, 50% Type 1 curve identifications error and
Type 2 error, respectively. It’s risky to make this Type 1 error.
The type 2 error ratio is denoted by the formula TIIR = m/n, where m denotes
the quantity of type 2 errors, n denotes the quantity of ground truth curves, and TIIR
denotes the type 2 error ratio.
Table 3 displays the Type 1, Type 2, TIIR, actual, predicted curve numbers, the
overall distance between source and destination, performance delay, types of curves
predicted, and noise corrected curve numbers for locations in India (rows 1 to 5),
France (row 6), and the United States (row 7). The data from Google Maps Road
segments is the same all over the world. As a result, the proposed method can extract
curves from road segments anywhere in the world. One minor distinction is that
vehicles in India travel on the left side of the road, whereas vehicles in France and the
United States drive on the right side. As a result, depending on the vehicle travelling
direction, the starting and ending point of the curve varies from country to country.
Each (7 rows of Table 3) source to destination Google map road segment data was
collected and the starting and ending point of each curve was manually identified;
this ground truth data were compared with the proposed model recognized curve
data. Here, one highway road (row 1), a hilly terrain road (row 2), a university road
Fig. 10 a 25% type 1 curve identification error b 50% type 1 curve identification error c type 2
error
Vehicle Over Speed Detection System 79
(row 3), and a tunnel road (row 5) have all been tested in India. The curve observed in
the highway starting location and hilly terrain destination location is shown in Table
3, row no. 4. The proposed method can extract curve from the tunnel road. Figure 5b
shows the row 5 in the twin tunnel in Mumbai city.
Noise is the result of a GPS segment being drawn incorrectly over an existing
segment. This form of noise, may be corrected using the proposed approach, as
shown in Fig. 6b. However, this noise frequently misrepresents a straight road as a
curving road. Because a hilly terrain road includes more noise segments than a road
segment in the plains, this kind of error is classified as a Type 2 error. However, it
is not dangerous because one can get alerted if a curve exists. If the proposed curve
detection algorithm fails to detect a curve with a radius of fewer than 60 m, it is
dangerous because this type of road includes sharp or blinding bends and is prone to
accidents. This type of curve was successfully identified using the proposed method.
The final column (column no. 11) in Table 3, displays the predicted number of curves
in the specified location using the existing method [22]. The existing method lacks
the capability of removing curve noise and therefore, the noise is declared as a curve.
Table 4 details the study of the proposed curve-aware travel time estimation; column
2 in the table shows the latitude and longitude of the source and destination toll
plazas, column 3 lists the information about the toll plaza and the type of highway,
column 4 lists the distance between the two toll plazas, column 5 lists the number
of curves in the highways that are a result of Sect. 2.2.1, column 6 lists the declared
speed of the car and truck, column 7 displays the declared reaching times for cars
and trucks between toll plazas, while column 8 displays the results of the Sect. 2.2.3
curve aware reaching times.
Ten booth systems and one server system are used at each toll gate in the Vehicle
over Speed Detection System architecture. As shown in Fig. 11a, the GUI for each
toll booth system is connected to a camera to automatically gather information about
the vehicle’s type and license plate number. This information can then be updated
or corrected by a toll booth attendant. When a user clicks the check button, these
details are sent to the local server system, which then uses its local database to check
the vehicle number and type. If the vehicle is first time entry, it enters the details
of the vehicle and a time stamp (date and time) in a local database and pushing the
data to the next tollgate via a cloud server. Details of a vehicle that just entered the
Pallikonda toll gate are shown in Fig. 11b. If the vehicle has previously registered at a
toll gate, its information is stored locally. Such that the local server system calculates
Table 3 Curve detection research and analysis
80
S.no Source and Total No. of No. of Type 1 Type No. of Noise on Type of detected curve TIIR Predicated no. of
destination curve actual Predicated error 2 path is corrected curve by ref. [22]
distance curve curve error
1 12.932459, 113 km 18 19 2 (1–25%, 1 2 Simple curve = 17, compound 0.05 20
79.138573 1–50%) = 2, reverse = 0, sharp = 0
&
13.047688,
80.081534
2 12.600237, 13 km 48 48 1–25% 9 33 Simple curve = 33, compound 0.19 93
78.596748 = 0, reverse = 2, sharp = 13
&
12.593029,
78.631709
3 12.968459, 1.5 km 6 6 0 1 8 Simple curve = 2, compound = 0.17 15
79.155885 0, reverse = 1, sharp = 3
&
12.971857,
79.163570
4 12.931632, 91.5 km 64 64 1–25% 10 33 Simple curve = 44, compound 0.15 98
79.135380 = 0, reverse = 2, sharp = 18
&
12.593029,
78.631709
(continued)
K. Ganesan et al.
Table 3 (continued)
S.no Source and Total No. of No. of Type 1 Type No. of Noise on Type of detected curve TIIR Predicated no. of
destination curve actual Predicated error 2 path is corrected curve by ref. [22]
distance curve curve error
5 19.059460, 10.7 km 5 5 0 1 0 Simple curve = 5, compound = 0.05 5
72.913796 0, reverse = 2, sharp = 18
&
18.949226,
72.840700
6 47.082396, 12.5 km 40 45 4 6 7 Simple curve = 28, compound 0.13 52
3.929467 (2−25%, = 5, reverse = 6, sharp = 6
Vehicle Over Speed Detection System
& 2–50%)
47.136291,
4.016592
7 37.202597, 46.9 km 24 25 3(1–25%, 3 2 Simple curve = 21, compound 0.12 30
− 2–50%) = 0, reverse = 2, sharp = 2
87.010464
&
37.312765,
−
86.614955
81
82 K. Ganesan et al.
the average speed of a vehicle between two tollgates, and if it exceeds that speed, a
fine is assessed as shown in Fig. 11c. Vehicle just entered with timestamp in local
server if it is travelling at normal speed, as shown in Fig. 11d. The Google cloud
platform, which is used to store data on each toll gate vehicle, is depicted in Fig. 12.
Figure 13a demonstrates how to search the Log cloud database. Date, time, a
vehicle’s number, or a Toll gate ID can all be used to search the log’s details. How
to search the Violator cloud database is shown in Fig. 13b. By vehicle identification,
date, time, or toll gate ID, one can search the details of the violator.
Vehicle Over Speed Detection System 83
Fig. 11 a License plate no. and vehicle type entry b Vehicle entered to toll plaza first time c Vehicle
over-speed detected d Vehicle passed between toll plazas in normal speed
3.5 Discussion
In this test-bed for both toll plazas, a total of 3552 vehicles passed through all toll
booths during the two hours test-bed’s time. Figure 14 displays a bar graph of vehicle
passes broken down by booth. During the two hours of testing, two vehicles received
fines for exceeding the government-mandated speed limits of 100 km/h for cars and
80 km/h for trucks. There could be a large number of vehicles that receive fines if the
vehicle speed limit was set using a curve-aware travel time estimation. According to
Fig. 15, there would be 13 to 14 vehicles fined if the speed limit for cars was 90 km/
h.
The Pallikonda and Ranipet Toll Plazas were used for two hours of testing during
the test-bed. This test site was overseen by the Vellore branch of RTO, the Tamil Nadu
84 K. Ganesan et al.
government. Figure 16a depicts the RTO and inspector’s presence at the Pallikonda
Toll Plaza during the test-bed period. The experts testing the vehicle over-speed
detection system in the Ranipet toll plaza are depicted in Fig. 16b.
Vehicle Over Speed Detection System 85
Pallikonda Ranipet
380 386
368
313
279 291
230 237
b1 b2 b3 b4 b5 b6 b7 b8 b9 b10
9
7 7
6 6
5
4 Conclusion
References
1. Nayak, R. P., Sethi, S., & Bhoi, S. K. (2018). PHVA: A position based high speed vehicle detec-
tion algorithm for detecting high speed vehicles using vehicular cloud. In 2018 International
Conference on Information Technology (ICIT). https://doi.org/10.1109/icit.2018.00054
2. Krishnakumar, B., Kousalya, K., Mohana, R., Vellingiriraj, E., Maniprasanth, K., & Krish-
nakumar, E. (2022). Detection of vehicle speeding violation using video processing techniques.
In 2022 International Conference on Computer Communication and Informatics (ICCCI).
https://doi.org/10.1109/iccci54379.2022.9740909
3. Zou, F., Ren, Q., Tian, J., Guo, F., Huang, S., Liao, L., & Wu, J. (2022). Expressway speed
prediction based on electronic toll collection data. Electronics, 11(10), 1613. https://doi.org/
10.3390/electronics11101613
4. Shen, J., Zhou, W., Liu, N., Sun, H., Li, D., & Zhang, Y. (2022). An anchor-free lightweight deep
convolutional network for vehicle detection in aerial images. IEEE Transactions on Intelligent
Transportation Systems.
5. Roy, S. S., Rodrigues, N., & Taguchi, Y. (2020). Incremental dilations using CNN for brain
tumor classification. Applied Sciences, 10(14), 4915.
6. Samui, P., Roy, S. S., & Balas, V. E. (Eds.). (2017). Handbook of neural computation. Academic
Press.
7. Biswas, R., Vasan, A., & Roy, S. S. (2019). Dilated deep neural network for segmentation of
retinal blood vessels in fundus images. Iranian Journal of Science and Technology, Transactions
of Electrical Engineering, 1–14.
8. Rajput, S. K., Patni, J. C., Alshamrani, S. S., Chaudhari, V., Dumka, A., Singh, R., Rashid,
M., Gehlot, A., & AlGhamdi, A. S. (2022). Automatic vehicle identification and classification
model using the YOLOv3 algorithm for a toll management system. Sustainability, 14(15),
9163. https://doi.org/10.3390/su14159163
9. Wang, W., Yang, J., Chen, M., & Wang, P. (2019). A light CNN for end-to-end car license
plates detection and recognition. IEEE Access, 7, 173875–173883. https://doi.org/10.1109/acc
ess.2019.2956357
10. Huang, Q., Cai, Z., & Lan, T. (2021). A new approach for character recognition of multi-style
vehicle license plates. IEEE Transactions on Multimedia, 23, 3768–3777. https://doi.org/10.
1109/tmm.2020.3031074
Vehicle Over Speed Detection System 87
11. Seo, T., & Kang, D. (2022). A robust layout-independent license plate detection and recognition
model based on attention method. IEEE Access, 10, 57427–57436. https://doi.org/10.1109/acc
ess.2022.3178192
12. Henry, C., Ahn, S. Y., & Lee, S. (2020). Multinational license plate recognition using general-
ized character sequence detection. IEEE Access, 8, 35185–35199. https://doi.org/10.1109/acc
ess.2020.2974973
13. Park, S., Yu, S., Kim, J., & Yoon, H. (2022). An all-in-one vehicle type and license plate
recognition system using YOLOv4. Sensors, 22(3), 921. https://doi.org/10.3390/s22030921
14. Alam, N., Ahsan, M., Based, M. A., & Haider, J. (2021). Intelligent system for vehicles number
plate detection and recognition using convolutional neural networks. Technologies, 9(1), 9.
https://doi.org/10.3390/technologies9010009
15. Alghyaline, S. (2022). Real-time Jordanian license plate recognition using deep learning.
Journal of King Saud University-Computer and Information Sciences, 34(6), 2601–2609.
https://doi.org/10.1016/j.jksuci.2020.09.018
16. Raghunandan, K. S., Shivakumara, P., Jalab, H. A., Ibrahim, R. W., Kumar, G. H., Pal, U., & Lu,
T. (2018). Riesz fractional based model for enhancing license plate detection and recognition.
IEEE Transactions on Circuits and Systems for Video Technology, 28(9).
17. Dalarmelina, N. D., Teixeira, M. A., & Meneguette, R. I. (2019). A real-time automatic plate
recognition system based on optical character recognition and wireless sensor networks for
ITS. Sensors, 20(1), 55. https://doi.org/10.3390/s20010055
18. Singh, P., Patwa, B., Saluja, R., Ramakrishnan, G., & Chaudhuri, P. (2019). StreetOCRCorrect:
An interactive framework for OCR corrections in chaotic Indian street videos. In 2019 Inter-
national Conference on Document Analysis and Recognition Workshops (ICDARW). https://
doi.org/10.1109/icdarw.2019.10036
19. Jagtap, J., & Holambe, S. (2018). Multi-style license plate recognition using artificial neural
network for Indian vehicles. In 2018 International Conference on Information, Communication,
Engineering and Technology (ICICET). https://doi.org/10.1109/icicet.2018.8533707
20. Ravirathinam, P., & Patawari, A. (2019). Automatic license plate recognition for Indian roads
using Faster-RCNN. In 2019 11th International Conference on Advanced Computing (ICoAC).
https://doi.org/10.1109/icoac48765.2019.246853
21. Khan, S. U., Alam, N., Jan, S. U., & Koo, I. S. (2022). IoT-enabled vehicle speed monitoring
system. Electronics, 11(4), 614. https://doi.org/10.3390/electronics11040614
22. Li, Z., Chitturi, M., Bill, A., & Noyce, D. (2012). Automated identification and extrac-
tion of horizontal curve information from geographic information system roadway maps.
Transportation Research Record: Journal of the Transportation Research Board, 2291, 80–92.
23. Horzyk, A., & Ergun, E. (2020). YOLOv3 precision improvement by the weighted centers of
confidence selection. In 2020 International Joint Conference on Neural Networks (IJCNN).
https://doi.org/10.1109/ijcnn48605.2020.9206848
24. Jayaraman, S., Esakkirajan, S., Veerakumar, T. (2015). Digital image processing. Tata McGraw
Hill publication, Indian Edition.
25. Shi, B., Bai, X., & Yao, C. (2017). An end-to-end trainable neural network for image-based
sequence recognition and its application to scene text recognition. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 39(11), 2298–2304. https://doi.org/10.1109/tpami.2016.
2646371
26. Bains, M. S., Bhardwaj, A., Arkatkar, S., Velmurugan, S. (2013). Effect of speed limit compli-
ance on roadway capacity of Indian expressways. Procedia-Social and Behavioral Sciences,
104, 458−467
27. IRC: 73. (1980). Geometric design standards for rural (Non-urban) highways. Indian Roads
Congress.
28. IRC: 38. (1988). Guidelines for design of horizontal curves for highways and design tables.
Indian Roads Congress.
An Intelligent System for Video-Based
Proximity Analysis
1 Introduction
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 89
S. S. Roy et al. (eds.), Deep Learning Applications in Image Analysis,
Studies in Big Data 129, https://doi.org/10.1007/978-981-99-3784-4_5
90 S. Antonov et al.
appear in proximity, are among the quantities of interest in the context of contact
tracing purposes.
TP
Pr c = (1)
T P + FP
and recall
TP
Rec = . (2)
T P + FN
Similarly to the approach taken in [16], we also calculate the widely used detection
accuracy measure mAP, obtained as the area under the Pr c×Rec curve. By definition,
both precision and recall are bounded between 0 and 1, and thus mAP is also bounded
between 0 and 1. It is common to estimate mAP from interpolated Pr c × Rec curves
1
m AP = pinter p(r ) , (3)
11 r ∈(0,0.1,...,1)
Next before proceeding to walking trajectories, one has to transform from the homo-
geneous (also known as projective) coordinates to the world coordinates (corre-
sponding to the bird’s eye view) by means of projective geometry techniques [31].
For simplification purposes, each detected object represented by a bounding box is
associated with its pivot point, resulting in a simplified transformation expressed by
3 × 3 matrices
⎡ ⎤ ⎡ ⎤
xi, x
⎣ y , ⎦ = w⎣ y ⎦, where x, y are the coordinates of the plane (4)
i
wi 1
where H the projection matrix, sometimes also referred to as the homography matrix,
which can be estimated using a number of approaches [33], such as direct linear
transformation (DLT) and robust estimation (RANSAC). Assuming that the pivot
point of each bounding boxes is located in the center of its lower edge, it can be
An Intelligent System for Video-Based Proximity Analysis 95
found as
xmax − xmin
xi = xmin + , (6)
2
yi = ymin , (7)
where (xmin , xmax , ymin , ymax ) are the bounding box coordinates. Thus, for a given
homography matrix, transformation to the world coordinates can be expressed as
⎡ ⎤ ⎡ ⎤
wi xi ‘ xmin + xmax −x
2
min
⎣ wi yi ‘ ⎦ = H ⎣ ymin ⎦. (8)
wi 1
The idea of constructing walking trajectories based on locations obtained from indi-
vidual video frames requires linking the location points corresponding to the same
person observed in consecutive frames. For that, the first step is commonly searching
for the nearest neighbor points. The latter is usually performed using one of the algo-
rithms such as linear (full) search, search in kd-trees [35], search in BSP-trees [28],
LS-hashing [36], method with keywords [50] and search in VP-trees [54]. As linear
search is computationally inefficient due to its linear complexity of O(n), alternative
algorithms are in focus. The LS-hashing algorithm is based on finding a simple hash
function that can be used instead of direct comparison of point coordinates, resulting
in superior efficacy once a simple hash function is known, although finding such
function is not a straightforward solution in many real-world scenarios. The idea
of the keyword algorithm is to store a list of objects with rarely observed coordi-
nates, which also limits its applicability. Therefore, in our case the remaining options,
namely kd-tree, BSP-tree and VP-tree search algorithms, are of greater interest. In
the following, we focus on the VP-tree search, since it searches for other points
in a circular vicinity around the current pivot point, that is relevant to the contact
proximity analysis problem.
and to the set of points in the outer (right) subtree otherwise. The same operation is
repeated for each subtree. Thus, each node in the tree has a vantage point and a radius
where the points belong to the node. Complexity of the tree construction algorithm
O(n ∗ logn).
The algorithm for finding the nearest neighbor to the point x is also recursive. At
any given step, one focuses on a tree node that has a vantage point q and a radius r .
Let us assume that point x is located at some distance d from q. If d is below r , a
recursive algorithm to search for a subtree of the node that contains any points closer
to the vantage point than the radius r is activated. Upon reaching the subtree, we
perform a linear search among the points of this subtree. Otherwise, returning to the
subtree of the node containing points displaced from q further than the given radius
r . When constructing the trajectory of a single walker movement, x is obtained from
the coordinates of this person in the previous frame, and the desired nearest point is
the coordinates of the same person in the current frame.
In the context of contact tracing, the next step is typically finding all points that
appear in a close proximity, usually determined by a circular area of a certain radius
around each person, first for a given video frame corresponding to a single point of
time. Since the original VP-tree based search algorithm focuses on finding single
nearest neighbor only, it should be generalized to search for potentially multiple
nearest neighbors within a circle of a given radius. In this context, there are several
possible situations:
1. The Entire Search Area is Included in the Internal Subtree
where d(q, s) is the distance from the center of the node to the search point, r is the
search radius, T is the node radius, determining the border of the inner subtree. The
world scale of the distances between two bird’s eye viewpoints is determined using
the size of the camera pixel obtained from the calibration procedure, and the distance
between two points is calculated as
/
d( pi , p j ) = (x j − xi )2 + (y j − yi )2 (11)
An Intelligent System for Video-Based Proximity Analysis 97
If this condition is met, the search can continue in the internal subtree only.
2. The entire search area is included in the external subtree.
If this condition is met, the search can only continue in the external subtree.
3. The entire search area is distributed over both subtrees.
In this case, the search is performed in both subtrees. The difficulty of searching for
nearest neighbors is O(log n).
the DNN (Deep Neural Network) approach. The above algorithm consists of three
consective steps, with the first one responsible for the image rescaling, the second
one known as the Proposal Network (or P-Net) looks for the candidate facial regions,
followed by the Refine Network (or R-Net) filtering bounding boxes and finally by
the Output Network (or O-Net) that focuses on facial landmarks (such as eyes and
mouth) localization. Another recent and powerful alternative solution is the MMOD
algorithm introduced by Davis E. King and implemented in the Dlib library [23].
Since it appears one of the most accurate of the other methods discussed above,
while also working well for different face orientations and even under substantial
occlusion, it has been chosen as an instrument used in this work.
However, it is also important to note that deep learning algorithms, while typically
outperforming other approaches in terms of accuracy, require considerably higher
computational resources, that may appear a limiting factor for their application under
limited resources scenarios and/or large amounts of data, as well as online analysis
requirements.
Face rescaling and alignment is an intermediate step between face detection and
face recognition. Common solutions are based on finding specific face landmarks
that can be used in the rescaling and alignment procedure as pivot points.
Face recognition techniques are also well developed. Early approaches were
largely based on such algorithms as Eigenface [21], Fisherface [3] and Local Binary
Patterns Histogram (LBPH) [46]. As these algorithms proved to have numerous
drawbacks, here we follow a more recent approach based on Convolutional Neural
Networks (CNN), that remain one of the most effective and reliable solution to the
date. Prominent examples include Google FaceNet [45] based on convolutional layers
learning face representations directly from the image. FaceNet was trained on the
Labelled Faces in the wild (LFW) [19] dataset to achieve invariance to illumination,
pose, and other variable conditions. Other notable examples include OpenFace [2].
In this work, we used also a neural network based solution implemented in the Dlib
library.
Finally, recognized faces should be associated with IDs of particular persons. This
is a typical problem for machine learning classification algorithms. If no matches are
found, a new ID is added to the database. In this work, we used a KNN classifier,
although many alternative classifiers would do the job.
4 Experiments
Next, we evaluated the approach using several sample videos recorded by surveillance
cameras in busy outdoor public places. For neural network training, we combined two
different datasets, that are among the most popular for object detection algorithms
learning, PASCAL VOC [16] and COCO [25]. Although they differ in the amount
An Intelligent System for Video-Based Proximity Analysis 99
By processing a video frame, the system detects people using a previously trained
neural network model “ssd lite mobile net coco v2”.
After detecting people by the trained network, their bounding box coordinates
were subjected to the homography matrix based transformation and nearest neighbor
search algorithm, followed by face detection and recognition algorithms, as described
100 S. Antonov et al.
above. Figure 6 exemplifies a processed video frame with indicated bounding boxes,
where those appearing within a close proximity (for an arbitrary 2 m threshold) are
shown in red, while others are shown in green. Figure 7 shows the corresponding
bird’s eye view for the same frame, using similar color notation.
Another example is shown in Figs. 8 and 9, respectively.
transmission, one can focus on the reduction of the number of links above a certain
threshold weight, i.e., the number of pairs of individuals that appear in close proximity
to each other for durations above a certain threshold value.
Now after more than two years since the onset of the pandemic, public attention
is increasingly shifting towards finding optimal exit strategies, including adaptation
of the technologies that have been rapidly deployed earlier in the course of the
pandemic, and finding their place in the post-pandemic society. In the following, we
consider how the above solutions could be used in different scenarios than individual
contact tracing or epidemiological surveillance of crowds, for example, leading to
the improved public spaces planning.
Planning of public spaces strongly affects the probability of congestions, forma-
tion of crowds, organization of queues, that in turn largely determines the numbers
of total contacts that remain in close proximity above a certain duration. There
104 S. Antonov et al.
Fig. 11 Examples of pairwise contact duration matrices for six representative short scenes captured
from a street video surveilance camera for TQ = 5. Matrix sizes are determined by the total number
of individuals captured in each scene, with their total pairwise duration of proximity (in seconds)
indicated by color
by an exponential with only one free parameter that is the average value, normaliza-
tion by division by this average value for each distribution results in a data collapse
indicated by all curves following the same pattern close to a simple exponential with
the unit average. Deviations from this simple theoretical scenario can be attributed
to the discreteness and finite size effects. As one can see from the figure, these devi-
ations are comparable for the observational and for the simulated data, given that
the simulated data contains similar number of frames, average inter-arrival intervals
106 S. Antonov et al.
and durations of individuals remaining within the scene, and thus also similar total
numbers of individual trajectories in the entire scene.
However, in most real-world scenarios walking trajectories strongly deviate from
the simplest random model. Typical reasons for that are localization of the objects
of attraction (e.g. counters, doors, passages etc.), as well as obstacles (e.g. barriers,
billboards, kiosks etc.) in both indoor and outdoor public spaces, leading to the spatial
clustering of the walking trajectories. In addition, traffic regulations (e.g. revolving
doors, traffic lights at crosswalks etc.) lead to additional temporal clustering of the
walking trajectories. Among various models used to characterize motion from the
statistical physics viewpoint, long-range correlations appear the most relevant in the
context of human dynamics (for a recent and comprehensive review of literature
on the topic, we refer to [22]). To account for both spatial and temporal clustering,
two-dimensional long-range correlated fields seem to be a relevant model.
Recent data including our own results indicate that long-range correlations are
strongly associated with clustering of events, generally leading to heavy-tailed distri-
butions of both inter-event times and event durations, with the latter being crucial
for the contact proximity durations. The impact of long-range temporal correlations
on the event dynamics have been investigated both analytically [32] and numeri-
cally [1, 14] indicating that the interval distributions between consecutive events in
a series broaden from a single exponential for the simplest Poisson process scenario
to a stretched exponential for linear long-term correlations, and finally converge to
a power-law decay for strong long-term correlations, especially in the presence of
nonlinear interactions in the system [6, 7]. Moreover, in recent years similar distribu-
tions of the inter-event times have been observed in a number of real-world complex
systems, ranging from bursty access patterns driven by user interactions in public
computer networks [6, 29, 34, 49] to various natural phenomena, e.g. in geophysics
[8, 13]. Finally, our recent data indicate that spatial long-range correlations lead to
the manifestations of similar laws in biological polymer structures [9–11].
Figure 13 exemplifies similar distributions obtained by computer simulations
for walks with random increments with Hurst exponent H = 0.5 and long-
range spatiotemporally correlated increments with Hurst exponent H = 1.5. The
figure shows explicitly that stronger spatio-temporal correlations lead to broader
contact duration distributions, indicating that a larger fraction of pairs of individuals
remain within the same proximity thresholds for longer times (depicted by a more
pronounced initial decay in the exceedance probability distributions), compared to
the random increments scenario. The figure also shows that, while some general
qualitative conclusions are possible based on these simulations, particular functional
forms of the distributions obtained for finite systems exhibit non-trivial shapes that
are determined by a complex interplay of correlations, discreteness and finite size
effects, and thus are determined not only by their asymptotic behaviors that could be
eventually derived from known theoretical assumptions, but also depend explicitly
on the system size.
As a remark, obtaining Fig. 13 required simulated datasets that contained 110
times more time steps and 11 times more individual walkers, altogether resulting
in ~103 more walker positions, and potentially up to ~106 more pairwise distances,
108 S. Antonov et al.
Fig. 13 Pairwise proximity duration distributions obtained by computer simulations for a random
walk with random increments with Hurst exponent H = 0.5 (blue curves) and long-range
spatiotemporally correlated increments with Hurst exponent H = 1.5 (red curves), respectively
compared to the observational video examples used in our study. Since the amount
of video analysis required to obtain comparable statistics for different public places
requires considerable computational efforts, we believe that more detailed analysis
including long-term video analysis and best correlated walkers model fitting, for a
better understanding of how public space planning affects both the spatio-temporal
walking trajectory correlation patterns and contact proximity distributions, remains
beyond the scope of this study, and could be considered as an outlook for future
reseach directions.
To summarize, digital technologies played a major role in the global responce to the
COVID-19 early on from the onset of the pandemic, especially in the context of digital
epidemiological surveillance and contact tracing, and proved their effectiveness in
the real-world context being strongly associated with a number of success stories
An Intelligent System for Video-Based Proximity Analysis 109
leading to the rapid suppression of the community transmission and reduction of the
incidence rates.
While AI and machine learning techniques have been widely applied in web-
based epidemic information support tools and online case tracing, they have not yet
been fully explored in the context of proximity tracing and consecutive analysis for a
more informed public spaces planning in the context of the reduction of the contacts
and contact durations.
In this paper, we have proposed a framework which is based on video-surveilance
for proximity tracing. However, as with the use of mobile apps and Bluetooth, privacy
considerations cannot be emphasized enough for any approach to be of practical use.
This is one of the fundamental ideas in our framework, realized by using anonymized
IDs to identify individuals. Further exploring how privacy can be integrated in the
proposed solution is the most immediate future research direction. Other directions
include training other neural network models and comparing them to find the best
model. Trained models will be evaluated based on the above parameters, such as
mAP with a set IoU threshold of 0.5, the error of the trained model, and the number
of frames per second (FPS) spent on object detection. In addition, we will further
evaluate the approach using large datasets from crowded streets.
Now after more than two year since the onset of the pandemic, public attention
increasingly shifts towards finding optimal exit strategies, including adaptation of
these technologies and finding their place in the post-pandemic society. Looking
forward towards this goal, we also consider how the proximity tracing based on
video surveillance in public places could be adapted to facilitate the improved public
spaces planning.
Acknowledgment The work of Sergey Antonov was supported by the Ministry of Science and
Higher Education of the Russian Federation “Goszadanie” No 075-01024-21-02 from 29.09.2021
(Project No. FSEE-2021-0014).
References
1. Altmann, E., & Kantz, H. (2005). Recurrence time analysis, long-term correlations, and extreme
events. Physical Review E, 71(5), 056106.
2. Amos, B., Ludwiczuk, B., & Satyanarayanan, M. (2016). Openface: A general-purpose face
recognition library with mobile applications. Technical report, CMU-CS-16–118, CMU School
of Computer Science.
3. Anggo, M., & Arapu, L. (2018). Face recognition using fisherface method. Journal of Physics:
Conference Series, 1028, 012119. https://doi.org/10.1088/1742-6596/1028/1/012119
4. Balaban, S. (2015). Deep learning and face recognition: the state of the art. In Biometric and
Surveillance Technology for Human and Activity Identification XII (vol. 9457, p. 94570B).
International Society for Optics and Photonics.
5. Biswas, R., Vasan, A., & Roy, S. S. (2020). Dilated deep neural network for segmentation of
retinal blood vessels in fundus images. Iranian Journal of Science and Technology, Transactions
of Electrical Engineering, 44(1), 505–518.
110 S. Antonov et al.
6. Bogachev, M., & Bunde, A. (2009). On the occurrence and predictability of overloads in
telecommunication networks. EPL (Europhysics Letters), 86(6), 66002.
7. Bogachev, M., Eichner, J., & Bunde, A. (2007). Effect of nonlinear correlations on the statistics
of return intervals in multifractal data sets. Physical Review Letters, 99(24), 240601.
8. Bogachev, M., Eichner, J., & Bunde, A. (2008). On the occurence of extreme events in long-term
correlated and multifractal data sets. Pure and Applied Geophysics, 165, 1195–1207.
9. Bogachev, M., Kayumov, A., & Bunde, A. (2014). Universal internucleotide statistics in full
genomes: A footprint of the dna structure and packaging? PLoS ONE, 9(12), e112534.
10. Bogachev, M., Kayumov, A., Markelov, O., & Bunde, A. (2016). Statistical prediction of
protein structural, localization and functional properties by the analysis of its fragment mass
distributions after proteolytic cleavage. Scientific Reports, 6, 22286.
11. Bogachev, M., Markelov, O., Kayumov, A., & Bunde, A. (2017). Superstatistical model of
bacterial DNA architecture. Scientific Reports, 7, 43034.
12. Budd, J., Miller, B. S., Manning, E. M., Lampos, V., Zhuang, M., Edelstein, M., Rees, G.,
Emery, V. C., Stevens, M. M., Keegan, N., et al. (2020). Digital technologies in the public-health
response to covid-19. Nature Medicine, 1–10.
13. Bunde, A., Bogachev, M., & Lennartz, S.: Precipitation and river flow: Long-term memory
and predictability of extreme events. Extreme Events and Natural Hazards: The Complexity
Perspective, 139–152.
14. Bunde, A., Eichner, J., Havlin, S., & Kantelhardt, J. (2004). Return intervals of rare events
in records with long-term persistence. Physica A: Statistical Mechanics and its Applications,
342(1), 308–314.
15. Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In 2005
IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005)
(vol. 1, pp. 886–893). IEEE (2005). https://doi.org/10.1109/cvpr.2005.177
16. Everingham, M., Eslami, S. A., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A.
(2015). The pascal visual object classes challenge: A retrospective. International Journal of
Computer Vision, 111(1), 98–136.
17. Ferretti, L., Wymant, C., Kendall, M., Zhao, L., Nurtay, A., Abeler-Dörner, L., Parker, M.,
Bonsall, D., & Fraser, C. (2020). Quantifying sars-cov-2 transmission suggests epidemic control
with digital contact tracing. Science, 368(6491).
18. Hossain, M. S., Muhammad, G., & Guizani, N. (2020). Explainable ai and mass surveillance
system-based healthcare framework to combat covid-i9 like pandemics. IEEE Network, 34(4),
126–132.
19. Huang, G. B., Ramesh, M., Berg, T., & Learned-Miller, E. (2007). Labeled faces in the wild: A
database for studying face recognition in unconstrained environments. Technical Report 07-49,
University of Massachusetts, Amherst.
20. Jalali, S. M. J., Ahmadian, M., Ahmadian, S., Hedjam, R., Khosravi, A., & Nahavandi, S.
(2022). X-ray image based COVID-19 detection using evolutionary deep learning approach.
Expert Systems with Applications, 201, 116942.
21. Jalled, F. (2017). Face recognition machine vision system using eigenfaces.
22. Karsai, M., Jo, H. H., Kaski, K., et al. (2018). Bursty human dynamics. Springer
23. King, D. E. (2015). Max-margin object detection
24. Lellouche, S., & Souris, M. (2020). Distribution of distances between elements in a compact
set. Stats, 3(1), 1–15.
25. Lin, T., Maire, M., Belongie, S. J., Bourdev, L. D., Girshick, R. B., Hays, J., Perona, P., Ramanan,
D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common objects in context. CoRR
abs/1405.0312
26. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., Berg, A. C. (2016). Ssd:
Single shot multibox detector (pp. 21–37). Lecture Notes in Computer Science. https://doi.org/
10.1007/978-3-319-46448-0_2
27. Li, Z., Yang, W., Peng, S., & Liu, F. (2020). A survey of convolutional neural networks: Analysis,
applications, and prospects
An Intelligent System for Video-Based Proximity Analysis 111
28. Maneewongvatana, S., & Mount, D. M. (2001). An empirical study of a new approach to nearest
neighbor searching. In Algorithm Engineering and Experimentation (pp. 172–187). Springer
Berlin Heidelberg. https://doi.org/10.1007/3-540-44808-x_14
29. Markelov, O., Nguyen, V., & Bogachev, M. (2017). Statistical modeling of the internet traffic
dynamics: To which extent do we need long-term correlations? Physica A: Statistical Mechanics
and its Applications, 485, 48–60.
30. Moltchanov, D. (2012). Distance distributions in random networks. Ad Hoc Networks, 10(6),
1146–1166.
31. Mundy, J. L., Zisserman, A., et al. (1992). Geometric invariance in computer vision (Vol. 92).
MIT press Cambridge.
32. Newell, G., & Rosenblatt, M. (1962). Zero crossing probabilities for gaussian stationary
processes. The Annals of Mathematical Statistics, 33(4), 1306–1313.
33. Nguyen, T., Chen, S.W., Shivakumar, S. S., Taylor, C. J., & Kumar, V. (2017). Unsupervised
deep homography: A fast and robust homography estimation model.
34. Nguyen, V., Markelov, O., Serdyuk, A., Vasenev, A., & Bogachev, M. (2018). Universal rank-
size statistics in network traffic: Modeling collective access patterns by zipf’s law with long-
term correlations. EPL (Europhysics Letters), 123(5), 50001.
35. Panigrahy, R. (2008). An improved algorithm finding nearest neighbor using kd-trees. Lecture
Notes in Computer Science, pp. 387–398. Springer Berlin Heidelberg. https://doi.org/10.1007/
978-3-540-78773-0_34
36. Pan, J., & Manocha, D. (2011). Fast gpu-based locality sensitive hashing for k-nearest neighbor
computation. In Proceedings of the 19th ACM SIGSPATIAL international conference on
advances in geographic information systems, GIS, pp. 211–220. Association for Computing
Machinery, New York, NY, USA. https://doi.org/10.1145/2093973.2094002
37. Pönisch, W., & Zaburdaev, V. (2018). Relative distance between tracers as a measure of
diffusivity within moving aggregates. The European Physical Journal B, 91(2), 1–7.
38. Punn, N. S., Sonbhadra, S. K., & Agarwal, S. (2020). Monitoring covid-19 social distancing
with person detection and tracking via fine-tuned yolo v3 and deepsort techniques.
39. Rezaei, M., & Azarmi, M. (2020). Deepsocial: Social distancing monitoring and infection risk
assessment in covid-19 pandemic. arXiv preprint arXiv:2008.11672
40. Roy, S. S., Goti, V., Sood, A., Roy, H., Gavrila, T., Floroian, D., Paraschiv, N. & Mohammadi-
Ivatloo, B. (2014). L2 regularized deep convolutional neural networks for fire detection. Journal
of Intelligent & Fuzzy Systems, 1–12.
41. Roy, S. S., Rodrigues, N., & Taguchi, Y. (2020). Incremental dilations using CNN for brain
tumor classification. Applied Sciences, 10(14), 4915.
42. Roy, S. S., Mihalache, S. F., Pricop, E., & Rodrigues, N. (2022). Deep convolutional neural
network for environmental sound classification via dilation. Journal of Intelligent & Fuzzy
Systems, 1–7.
43. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L. C. (2018). Mobilenetv2: Inverted
residuals and linear bottlenecks.
44. Samui, P., Roy, S. S., & Balas, V. E. (Eds.). (2017). Handbook of neural computation. Academic
Press.
45. Schroff, F., Kalenichenko, D., & Philbin, J. (2015). Facenet: A unified embedding for
face recognition and clustering. In 2015 IEEE conference on computer vision and pattern
recognition (CVPR), pp. 815–823. https://doi.org/10.1109/CVPR.2015.7298682
46. Singh, S., Kaur, A., & Taqdir, A. (2015). A face recognition technique using local binary pattern
method. IJARCCE, 165–168. https://doi.org/10.17148/IJARCCE.2015.4340
47. Skliros, A., & Chirikjian, G. S. (2008). Position and orientation distributions for locally self-
avoiding walks in the presence of obstacles. Polymer, 49(6), 1701–1715.
48. Sokolova, A., Uljanitski, Y., Kayumov, A. R., & Bogachev, M. I. (2021). Improved online event
detection and differentiation by a simple gradient-based nonlinear transformation: Implications
for the biomedical signal and image analysis. Biomedical Signal Processing and Control, 66,
102470.
112 S. Antonov et al.
49. Tamazian, A., Nguyen, V., Markelov, O., & Bogachev, M. (2016). Universal model for collective
access patterns in the internet traffic dynamics: A superstatistical approach. EPL (Europhysics
Letters), 115(1), 10008.
50. Tao, Y., & Sheng, C. (2014). Fast nearest neighbor search with keywords. , IEEE Transactions
on Knowledge and Data Engineering, 26, 878–888. https://doi.org/10.1109/TKDE.2013.66
51. Tejedor, V., Schad, M., Bénichou, O., Voituriez, R., & Metzler, R. (2011). Encounter distribution
of two random walkers on a finite one-dimensional interval. Journal of Physics A: Mathematical
and Theoretical, 44(39), 395005.
52. Vannoorenberghe, P., Motamed, C., Blosseville, J. M., & Postaire, J. G. (1997). Automatic
pedestrian recognition using real-time motion analysis. In International conference on image
analysis and processing (pp. 493–500). Springer.
53. Viola, P., & Jones, M. (2001). Rapid object detection using a boosted cascade of simple features.
In Proceedings of the 2001 IEEE computer society conference on computer vision and pattern
recognition (CVPR 2001, vol. 1, pp. I–I). IEEE
54. Yianilos, P. N. (1993). Data structures and algorithms for nearest neighbor search in general
metric spaces. In Proceedings of the fourth annual ACM-SIAM symposium on discrete
algorithms, SODA, pp. 311–321. Society for Industrial and Applied Mathematics, USA.
55. Zhang, K., Zhang, Z., Li, Z., & Qiao, Y. (2016). Joint face detection and alignment using
multitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10), 1499–
1503. https://doi.org/10.1109/lsp.2016.2603342
56. Apple and google framework. https://www.apple.com/newsroom/2020/04/apple-and-google-
partner-on-covid-19-contact-tracing-technology/
57. Covidsafe app, Australia. https://www.health.gov.au/resources/apps-and-tools/covidsafe-app
58. The dp-3t project. https://github.com/DP-3T/documents
59. Hamagen app, israel. https://govextra.gov.il/ministry-of-health/hamagen-app/download-en/
60. Norway halting smittestop app. https://www.amnesty.org/en/latest/news/2020/06/norway-cov
id19-contact-tracing-app-privacy-win/
61. Pepp-pt project. https://github.com/pepp-pt/pepp-pt-documentation/blob/master/PEPP-PT-
high-level-overview.pdf
Deep Learning-Based Conjunctival
Melanoma Detection Using Ocular
Surface Images
1 Introduction
The eye is a crucial and among the most intricate sensory organs which we have
as humans. It aids in our ability to visualize objects as well as our perception
of light, depth, and colour. Conjunctival nevus [1], which is a relatively ordinary
disorder, possesses several distinct clinical presentations [2]. Sufferers who ask about
conjunctival lesions are frequently encountered during ordinary clinical treatment
K. K. Podder
Department of Biomedical Physics and Technology, University of Dhaka, Dhaka 1000,
Bangladesh
M. K. Alam · Z. S. Siam
Department of Electrical, Electronic and Systems Engineering, Universiti Kebangsaan Malaysia,
43600 Bangi, Malaysia
Z. S. Siam
Department of Electrical and Computer Engineering, Presidency University, Dhaka, Bangladesh
K. R. Islam · A. Khandakar · M. E. H. Chowdhury (B)
Department of Electrical Engineering, Qatar University, 2713 Doha, Qatar
e-mail: [email protected]
P. Dutta
Department of Electrical and Electronic Engineering, Chittagong University of Engineering and
Technology, Chittagong 4349, Bangladesh
A. Mushtak
Clinical Imaging Department, Hamad Medical Corporation, Doha, Qatar
S. Pedersen
Department of Basic Medical Sciences, College of Medicine, Qatar University, 2713 Doha, Qatar
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 113
S. S. Roy et al. (eds.), Deep Learning Applications in Image Analysis,
Studies in Big Data 129, https://doi.org/10.1007/978-981-99-3784-4_6
114 K. K. Podder et al.
study. Considering the research gap available in the field of classifying conjunctival
melanomas, the following contributions are proposed in this study:
• A well-curated dataset for conjunctival melanoma is proposed which is validated
by medical experts.
• An effective and faster augmentation technique is proposed counter to CycleGAN-
based augmentation [3] for increasing a small conjunctival melanoma dataset.
• A high-performing deep learning model is proposed in this study which can
classify the different eye conditions with high accuracy.
Additionally, we incorporated the interpretability of our findings. This study
intends to verify the hypothesis that conjunctival lesions could be classified, and
conjunctival melanoma could be found utilizing optic surface images with the help
of deep learning. The prompt identification of conjunctival lesions might be made
easier by this investigation.
The outline of this study is described in the sections below. The following parts go
into further information about the materials and methods that were utilized. After-
wards, the findings are revealed and discussed. At last, we address the conclusion
and potential future research as we wrap off our study.
2 Methodology
This study proposed a system where an image of the eye taken using a smartphone
can be classified as normal or other eye-related medical conditions. The methods
involved in this system start from data collection, data cleaning and validation, CNN
training and evaluation and visual interpretation. Figure 1 illustrates the step-by-step
workflow of the methodology proposed in this study.
The focus of this research was on analyzing the anterior segment utilizing a deep
learning system and images of the eye’s surface. The preliminary melanoma data set
on which our data set is developed was taken from [3]. Normal, Pterygium, Nevus,
and Conjunctival melanosis were the four categories present in that dataset. The
dataset suggested by [3] contains some irrelevant and problematic images identified
by the medical experts of our team. Ocular images of subjects with conjunctival
anomalies are widely available online and can be accessed through various keyword
searches (for example, “normal conjunctiva”, “pterygium”, “conjunctival nevus”,
“conjunctival melanosis”), so we removed irrelevant data from the dataset proposed
in [3] and added new images to the dataset. Expert physicians double-checked the
data to make sure it was accurate and valuable. The details of the original dataset and
116 K. K. Podder et al.
the proposed dataset in this study are illustrated in Fig. 2 and a sample representation
of the different classes in the dataset is available in Fig. 3.
The “Four Class” Dataset is the label considered for the dataset proposed in this study.
It was from this “Four Class Dataset” that, another dataset was developed. Here we
Deep Learning-Based Conjunctival Melanoma Detection Using Ocular … 117
Fig. 3 A sample representation of the proposed dataset displaying images from class “Normal
Conjunctiva”, “Conjunctival Melanoma”, “Nevus”, and “Pterygium”
Fig. 4 A representation of single and multiple augmentation techniques on an ocular surface image
118 K. K. Podder et al.
Table 1 Augmentation
Augmentation techniques Range
techniques and ranges used in
the training set of proposed Random rotation +20 to −20 degree
datasets Random affine Degree = 0
Translate range = (0.05, 0.15)
Scaling range = (0.9, 0.95)
Padding Range = (0,10)
Fill = (black, white)
Mode = (‘Constant’, ‘Edge’)
Colour correction Brightness = (0, 0.2)
Contrast = (0, 0.2)
used was determined randomly in the augmentation model. The augmentation model
would then randomly decide which combination of augmentations to use if multiple
augmentations are chosen. Single and multiple augmentation techniques were used
to an image of the ocular surface, as shown in Fig. 3.
In each of the two datasets, the size of the training set for each class was expanded
to three thousand samples by applying these four augmentation techniques. As the
validation and test sets were used for evaluating deep learning models in a real-world
setting, these two sets were left unchanged throughout the process. Table 2 contains
a description of the sizes of the datasets along with the augmentation [37] factors.
Table 2 The detailed description of proposed datasets. The curated dataset is validated by
expert doctors and the training samples are increased by an augmentation factor using different
augmentation techniques
Dataset Class Original Validation Testing Training Augmentation Training set
data set set set factor after
samples augmentation
Binary Normal 125 13 25 87 34.48 3000
class conjunctiva
Abnormal 285 28 57 200 15 3000
conjunctiva
Four Normal 125 13 25 87 34.48 3000
class conjunctiva
dataset Nevus 85 8 17 60 50 3000
Pterygium 70 7 14 49 61.44 3000
Conjunctival 130 13 26 91 32.97 3000
melanoma
Deep Learning-Based Conjunctival Melanoma Detection Using Ocular … 119
This project utilized state-of-the-art CNNs for classifying ocular surface images of
normal and different eye conditions. Four CNN architectures, ResNet, DenseNet,
GoogLeNet, and EfficientNet were used in this study with pre-trained weights. We
selected these architectures due to there efficacy in previous publication [38]. These
four CNN architectures were trained on a large benchmark dataset “ImageNet” [39],
and the weights adopted in the training are the pre-trained weights that were utilized
for this study utilizing the well-known concept of transfer learning. CNN models are
initiated with the pre-trained weights and optimized during training on the ocular
surface images. Details of the trained CNN architectures are given below:
2.3.1 GoogLeNet
GoogLeNet was proposed in the literature [40], which was built on the Inception
module. The authors of GoogLeNet proposed a wider and deeper Inception which
performed slightly better performance in the ImageNet Large Scale Visual Recogni-
tion Challenge (ILSVRC) 2014 competition. Inside Inception module with dimen-
sionality reduction of GoogleNet, 1 × 1 convolution was added before every 3 × 3 and
5 × 5 convolution. This model is 22 layers deep with 27 pooling layers where 9 incep-
tion modules are stacked linearly. The end of the inception modules is entangaled
to the global average pooling layer. Detailed model architecture with convolutional
layers, pooling, and activations is available in the literature [40].
2.3.2 ResNet
ResNet architecture proposed in the literature [41] was designed to counter the
vanishing gradient problem in the deeper CNN architectures. In a deep CNN archi-
tecture, the features of the earlier layers start vanishing from the network as it goes
deeper and is introduced to more complex feature extractors. As a result, the vanishing
gradient happens and the residual connection in ResNet architecture solves this
problem by implementing a skip connection which flows the feature from the earlier
layer to deeper layers. In this study, ResNet18, ResNet50 and ResNet152 were used.
The designation ResNet, which is then followed by a number consisting of two or
more digits, indicates, quite simply, the ResNet architecture with a specific number
of neural network layers. So, in this ocular surface image classification research,
18, 50, and 152 layers-based ResNet architectures were utilized for evaluation and
comparison with other counterpart CNN architectures.
120 K. K. Podder et al.
2.3.3 DenseNet
The authors in [42] observed that deeper CNN models are more accurate and effi-
cient when the short connections are built among layers closer to input and closer
to output. By applying this observation, authors in [42] proposed DenseNet, which
works in a feed-forward fashion to connect each layer to every other layer. The
authors discovered that utilizing DenseNet had several benefits, including the elim-
ination of the vanishing-gradient problem, which resulted in better feature propaga-
tion and reuse. This particular sort of connection achieved benchmark results on the
ImageNet dataset while also significantly reducing the number of parameters. Both
the Densenet-161 and the Densenet-201 architectures were utilized in this study;
respectively, the depth of each design is 161 and 201 layers.
2.3.4 EfficientNet
All the CNNs, such as VGGNet, ResNet, MobileNet, and SeNet, employ a variety of
methods to improve the accuracy of the network. The methods may increase any one
of the three dimensions (width, depth, or resolution), but at least one of them will. The
authors in [43], addressed these methods of scaling in the literature. The integration
of all these strategies into EfficientNet was accomplished by the proposal of a scaling
mechanism that scales consistently across all of these dimensions. EfficientNet_B7,
a family member of the EfficientNet architecture, achieved 84.3% top-1 accuracy
on ImageNet and pre-trained weights of this model performance were used in our
ocular surface image classification.
Intuition on how CNN performs and reasoning behind its decision-making is always
an intriguing topic. Over the years with the development of visualization tools, the
curiosity behind how CNN works is satisfied effectively. This leads to model’s func-
tionality by showing the rationale behind the inference in a way that human would
figure out the engineering behind it which results in confidence in the CNNs’ outputs.
Among various visualization tools, Grad-CAM [44] was chosen for this investiga-
tion as Grad-CAM shows promising performance in recent computer vision problems
[45]. The method of Gradient-Weighted Class Activation Mapping utilizes gradient
of the feature at any final CNN layer to yield a localization map on images to find out
which region contributes to the decision-making. The benefit of using Grad-CAM
against other visualization technique is that, it is applicable on wide variety of CNN
architectures such as with or without fully connected layers [45]. Because sensi-
tive medical condition classification was carried out in this study, it was necessary
to confirm the region of interest with visualization for the CNN model to take it
Deep Learning-Based Conjunctival Melanoma Detection Using Ocular … 121
into consideration. As a result, this ultimately strengthened the trust in the decision-
making technique of the models. At the very end of the result section, a discussion
regarding the visual representation and explanation of the Grad-CAM used in this
ocular surface image classification is provided.
Five-fold cross-validation was used in the investigation of the “Binary Class” and
“Four Class” datasets. PyTorch library and Python 3.7 are being utilized in this
study. Google ColabPro platform with a 16 GB Tesla T4 GPU and 120 GB of High
RAM was utilized for training, validation, testing process. Apart from that, hyper-
parameters used in this study for all investigations are given in Table 3.
δ
Specificity = (1)
γ +δ
α
Precision = (2)
α+γ
122 K. K. Podder et al.
α
Recall = (3)
α+θ
2 × Precision × Recall
F1 score = (4)
Precision + Recall
α
Overall Accuracy = (5)
α+γ +δ+θ
The confusion matrix and ROC curves present important model evaluation metrics
for deep learning models’ performance on medical image classification. In this study,
the confusion matrix and ROC curves of each CNN model were evaluated to figure
out the best-performing model by comparing other counterpart models.
3 Results
Fig. 5 Representation of mean and standard deviation in the five-fold accuracy of all models for
binary classification
(a)
Fig. 6 The a ROC curve and b confusion matrix for best-performing EfficientNet_B7 model, which
has been trained and tested on binary class data. The confusion matrices and the ROC curves of the
other models can be found in the supplementary materials
The seven CNN models used in binary classification were also used in four class
classifications. The learning curves of these models are also available in Supplemen-
tary Tables 8 to 14, displaying the trend of well-fitted models. Figure 7 represents
the graphical illustration of mean and standard deviation of accuracies in five-fold
cross-validation of all models on multi-class classification.
Multi-class classification of three cases of ocular illness and normal condition
based on optic images presents significant challenges. EfficientNet_B7, a recently
developed and robust CNN, had the highest mean accuracy across all five folds
(94.43 percent). Although other CNNs, such as DenseNet161, demonstrated larger
Deep Learning-Based Conjunctival Melanoma Detection Using Ocular … 125
Fig. 7 Graphical representation of mean and standard deviation in the five-fold accuracy of all the
models on multi-class classification
Table 5 The performance metrics of different state-of-the-art CNN models in the detection of
conjunctival melanoma with a five-fold cross-validation method on a four-class dataset
Model Trainable Inference Overall Precision Recall F1 Specificity
parameters time (second) accuracy (%) (%) score (%)
(%) (%)
ResNet18 11,178,564 0.00226 93.45 93.49 93.45 93.46 97.86
ResNet50 23,516,228 0.00539 91.02 91.22 91.02 91.01 96.91
ResNet152 58,152,004 0.01494 92.72 92.92 92.72 92.76 97.66
GoogLeNet 5,604,004 0.00627 91.75 91.76 91.75 91.72 96.98
DenseNet161 26,480,836 0.01789 91.02 91.34 91.02 91.1 96.90
DenseNet201 18,100,612 0.02269 94.42 94.43 94.42 94.42 98.15
EfficientNet_ 63,797,204 0.03188 94.42 94.55 94.42 94.43 98.20
B7
126 K. K. Podder et al.
Fig. 8 The a ROC curve and b confusion matrix for best performing EfficientNet_B7 model (the
“others” class is labelled as the “abnormal” class) trained and tested on the multi-class dataset
The proposed method of using data augmentation and pre-trained CNNs showed
improvement in model performance. The comparative analysis between previous
literature [3] and the proposed method in this study is tabulated in Table 6. In multi-
class or four-class classification, the method proposed in this study achieved 13.42%
improved accuracy and 0.036 improved AUC. The EfficientNet_B7 with image
Deep Learning-Based Conjunctival Melanoma Detection Using Ocular … 127
Table 6 Comparative analysis of the performance of the proposed method with counterpart
literature
Datasets Reference Technique AUC Accuracy
Four class Yoo et al. CycleGAN-based image augmentation, 0.954 81
MobileNetV2
Proposed Dataset cleaning, inclusion of related images, 0.99 94.42
method image augmentations, EfficientNet_B7
Binary Yoo et al. CycleGAN-based image augmentation, 0.976 96.5
class MobileNetV2
Proposed Dataset cleaning, inclusion of related images, 1.00 99.73
method image augmentations, EfficientNet_B7
Fig. 9 Visual interpretation of ResNet18 and EfficientNet_B7 model predictions on the “Binary
Class” dataset
128 K. K. Podder et al.
Fig. 10 Visual interpretation of DenseNet201 and EfficientNet_B7 model predictions on the “Four
Class” dataset
4 Conclusion
In conclusion, the proposed study used state-of-the-art CNN models with data cura-
tion, validation and single and multiple augmentation techniques to classify ocular
surface images for different medical condition investigations (“Binary Class” and
“Four Class”). EfficientNet_B7 was the best-performing model with 99.73% and
94.42% accuracy for “Binary Class” and “Multi-Class” respectively utilizing the
methodology proposed in this study. The results for both types of investigation
outperformed previously published literature [3]. Moreover, this model showed a
high degree of sensitivity of 99.51% and 99.42% for the “Binary Class” and “Four
Class” investigations, respectively. The performance of the best model, EfficientNet_
B7, was also evaluated through Grad-CAM-based visual interpretation as this study
includes the diagnosis of sensitive medical conditions using ocular surface images.
In future, the proposed model can be implemented in the server so that the model
can produce predictions with visual interpretation for clinicians and patients. The
implementation of such a server-based implementation of the proposed model can
be used in remote areas for telemedicine facilities and helps people in the rural area
to easily diagnose eye conditions with visual interpretation.
Funding This work was made possible by Qatar National Research Fund (QNRF) NPRP12S-
0227–190164 and International Research Collaboration Co-Fund (IRCC) grant: IRCC-2021–001.
The statements made herein are solely the responsibility of the authors.
Deep Learning-Based Conjunctival Melanoma Detection Using Ocular … 129
References
1. Damato, B., & Coupland, S. E. (2008). Conjunctival melanoma and melanosis: a reappraisal
of terminology, classification and staging. Clinical & Experimental Ophthalmology, 36(8),
786–795.
2. Oellers, P., & Karp, C. L. (2012). Management of pigmented conjunctival lesions. The Ocular
Surface, 10(4), 251–263.
3. Yoo, T. K., Choi, J. Y., Kim, H. K., Ryu, I. H., & Kim, J. K. (2021). Adopting low-shot deep
learning for the detection of conjunctival melanoma using ocular surface images. Computer
Methods and Programs in Biomedicine, 205, 106086.
4. Shields, C. L., Fasiudden, A., Mashayekhi, A., & Shields, J. A. (2004). Conjunctival nevi:
clinical features and natural course in 410 consecutive patients. Archives of Ophthalmology,
122(2), 167–175.
5. Wong, J. R., Nanji, A. A., Galor, A., & Karp, C. L. (2014). Management of conjunctival
malignant melanoma: a review and update. Expert Review of Ophthalmology, 9(3), 185–204.
6. Isager, P., Engholm, G., Overgaard, J., & Storm, H. (2002). Uveal and conjunctival malignant
melanoma in Denmark 1943–97: observed and relative survival of patients followed through
2002. Ophthalmic Epidemiology, 13(2), 85–96.
7. Chang, A. E., Karnell, L. H., & Menck, H. R. (1998). The National Cancer Data Base report
on cutaneous and noncutaneous melanoma: A summary of 84,836 cases from the past decade.
Cancer: Interdisciplinary International Journal of the American Cancer Society, 83(8), 1664–
1678.
8. Larsen, A. C., Dahmcke, C. M., Dahl, C., Siersma, V. D., Toft, P. B., Coupland, S. E., et al.
(2015). A retrospective review of conjunctival melanoma presentation, treatment, and outcome
and an investigation of features associated with BRAF mutations. JAMA Ophthalmology, 133
(11), 1295–1303.
9. Kao, A., Afshar, A., Bloomer, M., & Damato, B. (2016). Management of primary acquired
melanosis, nevus, and conjunctival melanoma. Cancer Control, 23(2), 117–125.
10. Damato, B., & Coupland, S. E. (2008). Conjunctival melanoma and melanosis: a reappraisal
of terminology, classification and staging. Clinical & Experimental Ophthalmology, 36 (8),
786–795.
11. Hallak, J. A., Scanzera, A., Azar, D. T., & Chan, R. P. (2020). Artificial intelligence in ophthal-
mology during COVID-19 and in the post COVID-19 era. Current Opinion in Ophthalmology,
31(5), 447.
12. Ching, T., Himmelstein, D. S., Beaulieu-Jones, B. K., Kalinin, A. A., Do, B. T., Way, G. P.,
et al. (2018). Opportunities and obstacles for deep learning in biology and medicine. Journal
of The Royal Society Interface, 15(141), 20170387
13. Topol, E. J. (2019). High-performance medicine: the convergence of human and artificial
intelligence. Nature Medicine, 25(1), 44–56.
14. DuBois, K. N. (2019). Deep medicine: How artificial intelligence can make healthcare human
again. Perspectives on Science and Christian Faith, 71(3), 199–201.
15. Rahman, T., Akinbi, A., Chowdhury, M. E., Rashid, T. A., Şengür, A., Khandakar, A., et al.
(2022). COV-ECGNET: COVID-19 detection using ECG trace images with deep convolutional
neural network. Health Information Science and Systems, 10(1), 1–16.
16. Rahman, T., Khandakar, A., Islam, K. R., Soliman, M. M., Islam, M. T., Elsayed, A., et al.
(2022). HipXNet: Deep learning approaches to detect aseptic loos-ening of hip implants using
X-ray images. IEEE Access, 10, 53359–53373.
17. Abir, F. F., Alyafei, K., Chowdhury, M. E., Khandakar, A., Ahmed, R., Hossain, M. M., et al.
(2022). PCovNet: A presymptomatic COVID-19 detection framework using deep learning
model using wearables data. Computers in Biology and Medicine, 147, 105682.
18. Chowdhury, M. H., Shuzan, M. N. I., Chowdhury, M. E., Reaz, M. B. I., Mahmud, S., Al Emadi,
N., et al. (2022). Lightweight end-to-end deep learning solution for estimating the respiration
rate from photoplethysmogram signal. Bioengineering, 9(10), 558.
130 K. K. Podder et al.
19. Wang, G., Ye, J. C., Mueller, K., & Fessler, J. A. (2018). Image reconstruction is a new frontier
of machine learning. IEEE Transactions On Medical Imaging, 37(6), 1289–1296.
20. Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical
image segmentation. In International Conference on Medical image computing and computer-
assisted intervention (pp. 234–241).
21. Haskins, G., Kruger, U., & Yan, P. (2020). Deep learning in medical image registration: A
survey. Machine Vision and Applications, 31(1), 1–18.
22. Karimi, D., Dou, H., Warfield, S. K., & Gholipour, A. (2020). Deep learning with noisy labels:
Exploring techniques and remedies in medical image analysis. Medical Image Analysis, 65,
101759.
23. Rahman, T., Chowdhury, M. E., Khandakar, A., Mahbub, Z. B., Hossain, M. S. A., Alhatou,
A., et al. (2022). BIO-CXRNET: A robust multimodal stacking machine learning technique
for mortality risk prediction of COVID-19 patients using chest x-ray Images and clinical data.
Neural Computing and Applications.
24. Tahir, A. M., Qiblawey, Y., Khandakar, A., Rahman, T., Khurshid, U., Musharavati, F., et al.
(2022). Deep learning for reliable classification of COVID-19, MERS, and SARS from chest
X-ray images. Cognitive Computation, 1–21.
25. Tahir, A. M., Chowdhury, M. E., Khandakar, A., Rahman, T., Qiblawey, Y., Khurshid, U.,
et al. (2021). COVID-19 infection localization and severity grading from chest X-ray images
Computers in Biology and Medicine, 139, 105002.
26. Qiblawey, Y., Tahir, A., Chowdhury, M. E., Khandakar, A., Kiranyaz, S., Rahman, T., et al.
(2021). Detection and severity classification of COVID-19 in CT images using deep learning.
Diagnostics, 11(5), 893.
27. Pacheco, A. G. C., & Krohling, R. A. (2020). The impact of patient clinical information on
automated skin cancer detection. Computers in Biology and Medicine, 116, 103545.
28. Han, S. S., Park, G. H., Lim, W., Kim, M. S., Na, J. I., Park, I., et al. (2018). Deep neural networks
show an equivalent and often superior performance to dermatologists in onychomycosis diag-
nosis: Automatic construction of onychomycosis datasets by region-based convolutional deep
neural network. PloS one, 13(1), e0191493.
29. Bhimavarapu, U., & Battineni, G. (2022). Skin lesion analysis for melanoma detection using
the novel deep learning model fuzzy GC-SCNN. In Healthcare, p. 962.
30. Martin-Gonzalez, M., Azcarraga, C., Martin-Gil, A., Carpena-Torres, C., Jaen, P., & Health,
P. (2022). Efficacy of a deep learning convolutional neural network system for melanoma
diagnosis in a hospital population. International Journal of Environmental Research and Public
Health, 19(7), 3892.
31. Haenssle, H. A., Fink, C., Schneiderbauer, R., Toberer, F., Buhl, T., Blum, A., et al. (2018). Man
against machine: diagnostic performance of a deep learning convolutional neural network for
dermoscopic melanoma recognition in comparison to 58 dermatologists. Annals of Oncology,
29(8), 1836–1842.
32. Brinker, T. J., Hekler, A., Enk, A. H., Klode, J., Hauschild, A., Berking, C., et al. (2019). A
convolutional neural network trained with dermoscopic images performed on par with 145
dermatologists in a clinical melanoma image classification task. European Journal of Cancer,
111, 148–154.
33. Yin, G., Gendler, S., & Teichman, J. (2022). Ocular surface squamous neoplasia in a patient
following oral steroids for contralateral necrotising scleritis. BMJ Case Reports CP, 15(12),
e253300.
34. Rahman, T., Chowdhury, M. E., Khandakar, A., Mahbub, Z. B., Hossain, M. S. A., Alhatou,
A., et al. (2022). BIO-CXRNET: A robust multimodal stacking machine learning technique
for mortality risk prediction of COVID-19 patients using chest x-ray images and clinical data.
arXiv preprint arXiv:2206.07595
35. Khandakar, A., Chowdhury, M. E., Reaz, M. B. I., Ali, S. H. M., Kiranyaz, S., Rahman, T.,
et al. (2022). A novel machine learning approach for severity classification of diabetic foot
complications using thermogram images. Sensors, 22(11), 4249.
Deep Learning-Based Conjunctival Melanoma Detection Using Ocular … 131
36. Rahman, T., Khandakar, A., Islam, K. R., Soliman, M. M., Islam, M. T., Elsayed, A. et al.
(2022). HipXNet: Deep learning approaches to detect aseptic loos-ening of hip implants using
x-ray images. IEEE Access, 10, 53359–53373.
37. Wang, H., Wang, Z., Du, M., Yang, F., Zhang, Z., Ding, S., et al. (2020). Score-CAM: Score-
weighted visual explanations for convolutional neural networks. In Proceedings of the IEEE/
CVF conference on computer vision and pattern recognition workshops (pp. 24–25).
38. Schlemper, J., Oktay, O., Schaap, M., Heinrich, M., Kainz, B., Glocker, B., et al. (2019).
Attention gated networks: Learning to leverage salient regions in medical images. Medical
Image Analysis, 53, 197–207.
39. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). ImageNet
large scale visual recognition challenge. International Journal of Computer Vision, 115(3),
211–252.
40. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., et al. (2015). Going deeper
with convolutions. In Proceedings of the IEEE conference on computer vision and pattern
recognition (pp. 1–9).
41. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In
Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).
42. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K. Q. (2017). Densely connected convo-
lutional networks. In Proceedings of the IEEE conference on computer vision and pattern
recognition (pp. 4700–4708).
43. Tan, M., & Le, Q. (2019). Efficientnet: Rethinking model scaling for convolutional neural
networks. In International conference on machine learning (pp. 6105–6114).
44. Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Grad-
cam: Visual explanations from deep networks via gradient-based localization. In Proceedings
of the IEEE international conference on computer vision (pp. 618–626).
45. Podder, K. K., Chowdhury, M. E., Tahir, A. M., Mahbub, Z. B., Khandakar, A., Hossain, M. S.,
et al. (2022). Bangla sign language (bdsl) alphabets and numerals classification using a deep
learning model. Sensors, 22(2), 574.
Plant Diseases Classification Using
Neural Network: AlexNet
1 Introduction
Not so long ago, India was an agricultural country. Even today, roughly, there are
118 million farmers in the country [1]. One of the major issues that these farmers/
cultivators face is several diseases that affect their plants. Not only this exacerbates
their economic problem, but also their social life; several hours, and sometimes years,
of hard work. There are several chemicals that can be employed to alleviate this
problem. The major issue here is diagnosis, and unless farmers have a lab in their
vicinity, it is likely that diseases will be misidentified. Furthermore, the situation
may get worsen, as it is often and spread to other farms. India has seen a large
increase in smartphone sales and this is coupled with the rise of middle class. Various
telecommunication companies want to have hold of the rising market and this has
led to the cost of internet usage to almost nearly zero. There are nearly 833 million
internet [2] users which is equal to 59.28% of the population of India. In this chapter,
we have work to provide all the farmers and cultivators with smartphones with internet
access, we could reduce the food loss in the country.
In order to help these farmers, David. P. Hughes and Marcel Salathe, in their
paper have created a database called, PlantVillage, which is an open access database
of 50,000 + images of healthy and diseased crops. This database has more than 150
crops and 1800 diseases. PlantVillage is a community of people helping each other,
by answering the questions and identifying the diseases by looking at pictures in the
questions. It is helpful but it has drawbacks as stated above [3]. In the paper, David
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 133
S. S. Roy et al. (eds.), Deep Learning Applications in Image Analysis,
Studies in Big Data 129, https://doi.org/10.1007/978-981-99-3784-4_7
134 M. Anas et al.
P. Hughes & Marcel Salathé, have described the advantage of computer diagnos-
tics tools over human diagnosis. And we cannot download all the images in their
dataset. But in April 2016, PlantVillage released a subset of their dataset for image
classification challenge on CrowdAI [3].
In this section, we have discussed about the machine learning and neural network in
details.
2.1 History
Deep Learning was an underappreciated field due to several reasons such as, absence
of powerful GPUs, absence of required data and limited scientific work. In fact, deep
learning is term coined to attract interest in neural networks again. There have been
three phases of development in the field: it was known as cybernetics in 1940−1960s,
connectionism in 1980−1990s and deep learning from late 2000s. It is also known
as artificial neural network (ANNs) due to the fact that its design is inspired from
biological neural network [4].
So, earliest neural network models were simple linear models. They were designed
to take inputs{x1 , x2 …..xN } at the input layer, corresponding to an output y. The
network would learn the weights {w1 , w2 ……wN } such that
f (x, w) = y = x1 ∗ w1 + x2 ∗ w2 + . . . . . . . . . . . . . x N ∗ W N (1)
The main challenge in machine learning is that our trained model must perform
well on new data points. This ability to perform well on new data points is called
generalization. When we train a model on a dataset, we have an error measure known
as, training error. We want this error to be as low as possible. But, in order to have a
working model, we want our model to have good generalization as well which means
that our test error should be low [4, 5].
Take linear regression for example, we train the model by minimizing the training
error, which is:
1 train 2
X w − Y train (2)
m train 2
Generally, machine learning algorithms have several parameters that control the
behaviour of training algorithm, these parameters are called Hyperparameters. We
136 M. Anas et al.
usually do not learn hyperparameters, because it is not appropriate to learn the hyper-
meter on training set. If we learn hyperparameters on training set, it will almost
always overfit. To solve this problem, we need another dataset, known as validation
set. Validation set is taken from training set but not included during training process.
Validation set is used during and after training, in order to estimate generalization
or test loss. We use this to update hyper parameters accordingly [4]. Typically, we
take 80% of training dataset for training and 20% for validation.
Gradient Descent and its variations is widely used in several deep learning algorithms
[6]. It minimizes an error function.
1
N
E in (w) = e(h(xn ), yn ) (4)
N n=1
In order to compute error or the gradient of error, we have to evaluate the hypoth-
esis at every point in the sample. We go down the error surface along the direction
subjected by gradient descent. The steps used in this case are iterative, and we take
one step at a time and one step is full epoch. Simply, we consider epoch when we
take all the example at once. So, weight update formula in this case:
E n [−∇e(h(xn ), yn )] (6)
If we take the error measure that we are going to minimize, in this case, just one
example, and take the expected value, we get an equation which is similar to equation
mentioned above [4, 6, 7].
Average direction:
1
N
E n [−∇e(h(xn ), yn )] = −∇e(h(xn ), yn ) (7)
N n=1
Plant Diseases Classification Using Neural Network: AlexNet 137
So, this is as if we are actually going in the direction we want, except that we only
use one example in the computation and then keep repeating. Thus, we will always
get the expected value in that direction and with time, the noise will average out and
we’ll go along the desired direction.
Suppose we assign weights the notations wil j where l is hidden notation for layers
[7, 8] (Fig. 1).
⎧
⎨ 1 ≤ l ≤ L , layer s
W eight : Wilj = 0 ≤ i ≤ d (l−1) , input (8)
⎩
1 ≤ j ≤ d (l) , out put
es − e−s
θ (s) = tanh(s) = (9)
es + e−s
⎛ (l−1)
⎞
d
x (l) (l)
j = θ (s j = θ
⎝ wi(l)j xi(l−1) ⎠ (10)
i=0
We take one example at a time and apply it to the network and adjust the weight of
the network in the direction of negative of the gradient descent and thus makes it
stochastic [7].
All the weights w = {wil j } determine h(x).
Error on example (x n , yn ) is:
1
e(h(xn ), yn ) = e(w) = (h(xn ) − yn )2 (11)
2
So, to implement SGD, all we have to do is implement gradient of ∇e(w)
∂e(w)
∇e(w) = f or all i, j, l (12)
∂wil j
138 M. Anas et al.
Fig. 3 Backpropagation:
phase I
All we have to do is compute this for every i,j,l and then take entire value of
weight and move along negative gradient (Fig. 3).
We can evaluate ∂e(w)
∂wl
using chain rule:
ij
∂e(w) ∂e(w) ∂s lj
= ×
∂wil j ∂s lj ∂wil j
∂e(w) ∂s lj
wher e = δ l
j and = xil−1 (13)
∂s lj ∂wil j
Now let’s find δ for final layer. When we computed the same we got xs for first
layer and then we propagate it forward until we get to the output. The reason is that
if we know δ for final layer, we will be able to use it to find δ for previous layers by
propagating backwards, and hence the name, backpropagation.
∂e(w)
= δlj , f or f inal layerl = L and j = 1
∂s lj
So,
∂e(w)
= δ1L and e(w) = e(h(xn ), yn ) (14)
∂s1L
140 M. Anas et al.
Fig.4 Backpropagation:
phase II
e(w) is error measure. This is applied on each layer until we reach the output, h(x n )
and compare it to target output yn.
2
e(w) = x1L − yn
x1L = θ s1L ; θ (s) = 1 − θ 2 (s) f or tanh (16)
∂e(w)
δil−1 =
∂s l−1
j
d(l) ∂e(w) ∂s lj ∂ xil−1
= × × (17)
j=1 ∂s lj ∂ xil−1 ∂sil−1
∂e(w) ∂s lj l ∂ xi
l−1
wher e = δ l
j , = wi j , = θ sil−1
∂s lj ∂ xil−1 ∂sil−1
d(l)
= δlj × wil j × θ sil−1
j=1
Plant Diseases Classification Using Neural Network: AlexNet 141
2
d(l)
δil−1 = 1 − xil−1 wil j × θ sil−1 (18)
j=1
Convolution neural network is a special kind of neural network. It was given the
name because it uses convolution in at least one layer. It is widely used in computer
vision, image segmentation, classification etc. among other things [9, 10].
2.3.1 Convolution
2.3.2 Pooling
A layer of convolution network has three stages: convolution layer, activation function
such as ReLU and a pooling layer. A pooling layer changes the output of the net by
replacing some areas of input by its statistical summary. It performs down sampling
in height and width dimensions. The commonly used pooling layer is max pooling.
2.3.3 ReLU
Convolutional nets were some of the first working deep networks trained with
backpropagation. It is not fully clear why convolutional networks succeeded when
general backpropagation networks were considered to have failed.
There are several deep learning libraries to choose from. Some popular ones are:
2.4.1 Theano
Theano is a framework based on python developed by the LISA group and run by
Yoshua Bengio at the University of Montreal [11].
2.4.2 Torch
2.4.3 Caffe
Caffe is a Python deep learning library developed by Yangqing Jia at the Berkeley
Vision and Learning Centre. The biggest advantage of Caffe is the number of pre-
trained network that be downloaded from their model zoo [13].
Plant Diseases Classification Using Neural Network: AlexNet 143
2.4.4 Tensorflow
In this section, we have discussed the experimental results and the model used.
3.1 Dataset
The dataset on CrowdAI consists of 54,309 images for training the neural network.
It has 14 different species of crop, 17 fungal diseases, 4 bacterial diseases, 2 mold
diseases, 2 viral disease, 1 disease caused by a mite and 12 crop species that are
visibly healthy. This means that there are 38 classes of images.
These 14 crop species are: Apple, Blueberry, Cherry, Corn, Grape, Orange, Peach,
Bell Pepper, Potato, Raspberry, Soybean, Squash, Strawberry, and Tomato (Fig. 5).
In the Fig. 1 above, there are 38 images each corresponding to different class of
diseases [3].
Since we are trying to tune AlexNet, we have to make sure that the size of images
must be of exactly the same size as was used to originally train it. AlexNet was
trained on images of size 256 × 256 pixels with central crop of 227 × 227 pixels.
This means that we have to resize all the images of PlantVillage dataset. Instead
of having to deal with images straight from the disk, we will store them in LMDB
which is a high performance embedded transactional database. While Caffe does
supports reading images directly from the disk, using LMDB as the data store has
144 M. Anas et al.
quite significant performance gains. Finally, we will compute the mean of all the
images. This will be useful in both, training and testing processes. After correctly
updating LMDB store references, fine tuning the parameters in configuration files,
and changing hyperparameters in solver configuration file, we will train the model.
3.3 Architecture
In 2012, Alex Krizhevsky, Ilya Sutskever and Geoff Hinton submitted a convolution
network called, Alexnet, for an Imagenet ILSVRC challenge. ILSVRC challenge
also known as ImageNet challenge is conducted every year where participants have
to make a model that can classify millions of images into 1000 classes of object.
They won the challenge the same year and since then it was always a variation of
CNN that won the challenge (Fig. 6).
The input layers in AlexNet are formed by the raw pixel values obtained from the
image, and the final layer gives a probability distribution across all the classes. The
intermediate layers use a “processed version” of the output of the previous layer as
Plant Diseases Classification Using Neural Network: AlexNet 145
their input, and over the whole training period they learn to activate against more and
more complex features depending on how deep they are in the overall architecture.
The neural net such as AlexNet are computationally very expensive and intensive. It
usually takes several weeks to train on ImageNet dataset. Fortunately, the features
learnt by earlier layers are very generic in nature, and thus can be used on new
dataset with totally different classes. This approach is known Transfer Learning or
Fine Tuning. In transfer learning, we take a pre-trained model and use the learnt
weight and after modification of the final fully connected layers, we use them to
train on new dataset. This gives us better result. In our PlantVillage dataset, we have
38 classes instead of 1000 classes from ImageNet. So, we have to change the num_
output parameter of fully connected layer in the training configuration file Caffe
[3, 8, 15, 16].
3.4 Results
If data is pre-processed and files are correctly configured, there will be no problem
in training the model. So, when we train the model, we have to make sure that we
are maintaining the log file. This is done in order to understand the training process.
Also, this log file can be used to generate graph. It took roughly around 2 h for
training the model for 2000 iterations (Fig. 7).
We can see the development of three performance measure: training loss, test
loss, test accuracy. Training and Test loss has significantly decreased from nearly 1
to 0.1, whereas the test accuracy on the test dataset was around 91.3%, which is pretty
impressive. The two most important factors to be considered in transfer learning are
size of the data and similarity of the data to the original dataset. If new dataset is
small and similar to original dataset, there is a high chance that the model will over
fit. In case we have large dataset, this may work given that both datasets are similar
[17–27].
146 M. Anas et al.
Fig. 7 Training curve for accuracy and loss with 2000 iterations
4 Conclusion
In conclusion, the use of deep learning in the form of image classification can provide
a budget-friendly and efficient solution to the problem of plant diseases affecting
farmers and cultivators. Otherwise, farmers would need well equipped labs to deter-
mine the disease. AlexNet is able to obtain 98 to 99% accuracy on training set and
91.3% accuracy on test set. In the future, we would like to employed different deep
learning models and perform different types of augmentations.
References
1. Agarwal, K. (2021). Indian agriculture’s enduring question: Just how many farmers does the
country have?. The Wire. Retrieved, 22.
2. BBC. (2023, January 23). India media guide. BBC News. https://www.bbc.com/news/world-
south-asia-12557390
3. Hughes, D., & Salathé, M. (2015). An open access repository of images on plant health to
enable the development of mobile disease diagnostics. arXiv preprint arXiv:1511.08060.
4. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. Book in preparation for
MIT Press. http://www.deeplearningbook.org
5. Jabbar, H., & Khan, R. Z. (2015). Methods to avoid over-fitting and under-fitting in
supervised machine learning (comparative study). Computer Science, Communication and
Instrumentation Devices, 70, 163–172.
6. Robbins, H., & Monro, S. (1951). A stochastic approximation method. The annals of
mathematical statistics, 400–407.
Plant Diseases Classification Using Neural Network: AlexNet 147
7. Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine
Learning, 2, 1–127. Also published as a book. Now Publishers.
8. Hecht-Nielsen, R. (1992). Theory of the backpropagation neural network. In Neural networks
for perception (pp. 65–93). Academic Press.
9. Roy, S. S., Awad, A. I., Amare, L. A., Erkihun, M. T., & Anas, M. (2022). Multimodel phishing
URL detection using LSTM, bidirectional LSTM, and GRU models. Future Internet, 14(11),
340.
10. O’Shea, K., & Nash, R. (2015). An introduction to convolutional neural networks. arXiv
preprint arXiv:1511.08458.
11. Al-Rfou, R., Alain, G., Almahairi, A., Angermueller, C., Bahdanau, D., Ballas, N., Bastien,
F., Bayer, J., Belikov, A., Belopolsky, A., Bengio, Y., Bergeron, A., Bergstra, J., Bisson, V.,
Snyder, J. B., Bouchard, N., Boulanger-Lewandowski, N., Bouthillier, X., de Brébisson, A.,
… Zhang, Y. (2016). Theano: A python framework for fast computation of mathematical
expressions. arXiv e-prints, arXiv-1605.
12. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z.,
Gimelshein, N., Antiga, L., Desmaison, A., Köpf, A., Yang, E., DeVito, Z., Raison, M., Tejani,
A., Chilamkurthy, S., Steiner, B., Fang, L., ... Chintala, S. (2019). Pytorch: An imperative style,
high-performance deep learning library. Advances in neural information processing systems, 32.
13. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., … Guadarrama, S. &
Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. In Proceedings
of the 22nd ACM international conference on Multimedia (pp. 675–678).
14. Gibson, A., Nicholson, C., Patterson, J., Warrick, M., Black, A. D., Kokorin, V., ... & Eraly, S.
(2016). Deeplearning4j: Distributed, opensource deep learning for Java and Scala on hadoop
and spark. Towards Data Science.
15. Fei Fei, L., Karpathy, A., Johnson, J. CS231N–Stanford University
16. Krizhevsky, A., Sutskever, I., Hinton, G. E. ImageNet classification with deep convolutional
neural networks. University of Toronto.
17. Roy, S. S., Goti, V., Sood, A., Roy, H., Gavrila, T., Floroian, D., Paraschiv, N., Mohammadi-
Ivatloo, B. (2014). L2 regularized deep convolutional neural networks for fire detection. Journal
of Intelligent & Fuzzy Systems, (Preprint), 1–12.
18. Roy, S. S., Mihalache, S. F., Pricop, E., & Rodrigues, N. (2022) Deep convolutional neural
network for environmental sound classification via dilation. Journal of Intelligent & Fuzzy
Systems Preprint, 1–7.
19. Roy, S. S., Rodrigues, N., & Taguchi, Y. (2020). Incremental dilations using CNN for brain
tumor classification. Applied Sciences, 10(14), 4915.
20. Biswas, R., Vasan, A., & Roy, S. S. (2020). Dilated deep neural network for segmentation of
retinal blood vessels in fundus images. Iranian Journal of Science and Technology, Transactions
of Electrical Engineering, 44(1), 505–518.
21. Roy, S. S., Mihalache, S. F., Pricop, E., & Rodrigues, N. (2022). Deep convolutional neural
network for environmental sound classification via dilation. Journal of Intelligent & Fuzzy
Systems, (Preprint), 1–7.
22. Deep learning research should be encouraged for diagnosis and treatment of antibiotic
resistance of microbial infections in treatment associated emergencies in hospitals.
23. Lee, K. C., Roy, S. S., Samui, P., & Kumar, V. (Eds.). (2020). Data analytics in biomedical
engineering and healthcare. Academic Press.
24. Samui, P., Roy, S. S., & Balas, V. E. (Eds.). (2017). Handbook of neural computation. Academic
Press.
25. Forecasting stock price by hybrid model of cascading multivariate adaptive regression splines
and deep neural network
26. Roy, S. S., & Taguchi, Y. H. (2021). Identification of genes associated with altered gene
expression and m6A profiles during hypoxia using tensor decomposition based unsupervised
feature extraction. Scientific reports, 11(1), 1–18.
27. Ali, M., Magdon-Ismail, M., Lin, H. T. Learning from Data-Abu. https://amlbook.com/
Hyperspectral Images: A Succinct
Analytical Deep Learning Study
1 Introduction
Since the advent of imaging spectrum (1980s), Hyperspectral images (HIs) have
been acquired owing to computational classificatory capability for fine spectra that
provides a resolving power for a diverse range of applications. Some includes remote
sensing based environmental, atmospheric and ocean observations [66], meteorolog-
ical applications, military [37], geological exploration and mining [53], crops, vege-
tation and food analysis and standalone biomedical fields [56]. In addition to having
high spectral and spatial resolution, HIs have many bands and abundant information
because they cover ultraviolet, visible, near-infrared, and mid-infrared wavelengths.
This offers an avenue of research HI-based image correction [77], noise reduction
[40], transformation [48], dimensionality reduction, and classification [8].
For the machine learning (ML) based methods to processes HIs, there is a high
need to label several legitimate samples for training. Early researches on this regard
were focused with spectral information based HI classification methods like, support
vector machine [72], random forest, neural networks [20, 67], and Polynomial logistic
regression [45]. An HI represents the image as a “hypercube, (x, y, λ)” in which
the first two dimensions indicate its spatial coordinates and the third indicates the
number of bands. As a result, each pixel represents a pattern with as many attributes
as there are bands. With a complexity on bands (large number) associated in HIs,
the high data volume (populates exponentially) to be processed further relates to the
avenue to reduce the dimensionality and to minimize the computation complexity
L. S. Kumar
Biju Patnaik University of Technology, Rourkela, Odisha, India
G. K. Panda
MITS School of Biotechnology, Utkal University, Bhubaneswar, Odisha 765017, India
B. K. Tripathy (B)
School of Information Technology and Engineering, VIT, Vellore, Tamil Nadu 632014, India
e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 149
S. S. Roy et al. (eds.), Deep Learning Applications in Image Analysis,
Studies in Big Data 129, https://doi.org/10.1007/978-981-99-3784-4_8
150 L. S. Kumar et al.
Fig. 1 Bhitarkanika mangrove (Source Google Maps): a Binary image b Grayscale image c RGB
image
Hyperspectral Images: A Succinct Analytical Deep Learning Study 151
2 Related Researches
In the process of classification of HIs, the spectral dimension (Fig. 2) helps in iden-
tifying the significant variations of reflectance between image pixels which change
with wavelength [38]. In a study [31], it was observed that, the classification accu-
racy drops dramatically after a certain increased value of spectral bands. Since a
majority of spectral bands are redundant in nature, so carrying all bands into consid-
eration, affect to the model’s performance. Dimension reduction techniques [28, 57]
on this regard are used to identify such unnecessary bands without compromising the
image’s information content. The modified brown stick rule for HI [3] contributes
a phenomenal aspect in dimension reductions. In majority of the cases, the reduced
band features suffer with the anomalies of object identification and necessitates for
discriminative spatial features. As per a study [19] the pixels next to each other
belongs to the same class in HIs, hence applications of HI’s spatial features along
with spectral features is an intuitional and motivation for an effective classification,
to study. There have been some methodologies adapted on feature extractions like
Gray level co-occurrence matrix [44, 54], stationary wavelet transform (SWT) [43,
73], discrete wavelet transforms [10, 22], morphological profiles [4, 55] have been
used in may real-world applications.
Neural network-based techniques have been implemented to tackle many complex
problems of remote sensing [67]. DL techniques have become extremely popular in
recent years with several real life applications like, study of gene characteristics [25],
text-based image retrieval [60], audio signal classification [9], image processing [2],
health care analysis [36], measuring confidence in interviewees [61], Face mask
detection [64], classification of skin cancer [63] and computer vision [70]. In general
it has influenced the research in AI in a major way. Some such studies are presented
in [1, 69]. The application of Deep learning (DL) in ANN has led to the development
of Deep Neural Networks (DNN) [6]. Some of the DL algorithms which are used in
HI classifications are stacked autoencoders (SAEs) [5], deep belief networks (DBNs)
[30]. Convolutional neural networks (CNNs) [21, 49] are used in HI classifications
[11]. CNNs have wide applications like, MRI segmentation [68], Diabetic retinopathy
[58], in study of Big Data [7], Classification of pests [16], COVID-19 detection [62]
and whether classification [23], There are some innovative approaches with 2D-CNN
[50], 3D-CNN [26, 46], spectral-spatial LSTMs [80], SSUN [76], SSRN [79] have
also been employed in HI classifications. Literature shows that 2D-CNN alone, is
not able to generate discriminative features of classification [59] whereas 3D CNN is
found to be suitable for volumetric samples. However, it lacks in generating discrim-
inative features of classes that have textural similarity across several spectral bands.
Taking these shortcomings a HybridSN model [59] was proposed which comprises
of 2-D and 3-D convolution layers to generate discriminative spectral and spatial
features. MCNNCP model [78] also contributes promising accuracy in using 3-D
and 2-D convolution layers based solutions.
DLs have achieved noteworthy performance in the domains of visual information
processing and AI. Some special DNNs like Gated recurrent unit networks are used
for detecting toxicity [39] and wide res-Net being used for age and gender estimation
[14]. This approach pioneered the extraction of hierarchical deep features automat-
ically in a practicable way for an HI. They consider an image to be organized with
hierarchical components like pixels, edges, parts and objects. In contrast to shallow
handcrafted features, end-to-end deep features are capable of representing more
abstract and complex shapes in the image. They perform well even in circumstances
where there are rapid regional changes in an image.
Normal image classifications presume on the data that follows uniform distri-
bution between diverse classes and prone with discriminate samples belonging to
the majority classes leading to an imbalanced phenomenon (in case of HIs). Hence,
special care or measure needs to be addressed to tackle such imbalanced character-
istics of HI classification [65]. Studies in [29, 47, 74] demonstrate on data augmen-
tations, pixel-pairing and auto allocations of unlabeled samples respectively and
demonstrate their efficacy in HI classifications. Studies in [27, 41–43] modeled with
recent novel concepts of SWT and CNN, decomposition and deep residual nets, 3D-
2D-and depth wise separable-1D CNNs, CNN with optimization (Grey Wolf) and
1D-EWT and 3D-CNN.
So, following are some of the intuitive literature outcomes that motivate us to
address through the proposed models undertaken in the following sections.
1. Ensemble a DL-model to address the hierarchical feature extractions.
2. To perform and learn with limited training data.
3. To demonstrate minimum information loss due to dimension reductions, convo-
lutions and Max pooling operations.
4. To address the issues of vanishing gradients, minimum computational time, class
imbalance problem and tolerance to noise.
Hyperspectral Images: A Succinct Analytical Deep Learning Study 153
3 Hyperspectral Images
The idea behind Deep Learning (DL) is to train computers/machines artificially with
an approach to model complex algorithms to learn from experience, classify and
recognize data or images just like a human brain does. As a type of ANN, CNN is
also used for image or object recognition (processing images, analyzing videos, and
detecting obstacles in autonomous vehicles). There have been phenomenal develop-
ments in devising methods pertain to ANN in DL-based classification and object/
image recognition domain. DL-based three Core layers (dense, convolution and
output) offer learning based HI solutions towards a supervised, semi-supervised or
unsupervised models.
Hi-based DL models are being developed for many classifications and object
identification purposes in using these three designs. The adaptability of these design
for application models depends on the availability of labeled HI-data. To be specific,
if the HI-model is based on the mapping process of labeled datasets in respect to
the ground truth then the supervised model is used. To extract/unavil properties of
HI data from unlabeled datasets, the unsupervised design is addressed and while
with availability of little/small portion of HI based labeled data, the semi-supervised
design is preferred to get use in the model. Further, convolutional neural networks
(CNNs) in contrast to deep forward neural networks (DNNs) and autoencoders (AEs)
154 L. S. Kumar et al.
With high spectral resolution based HIs, information from each pixel is generally
interpolated to one-dimensional spectral vectors. 1D-CNN model helps in identifying
specific features (from the pool of spectral information) of the HI through such
pixels for further classifications. In simpler description, 1D-CNN takes labeled HI-
data as input, process with class labels during training and updates network weights
iteratively using stochastic gradient descent algorithms and results with classified
data being trained with each pixel classification. Convolution operations on 1-D
feature vectors are performed using a 1-D convolution kernel defined in Eq. 1.
Hi −1
(x+h)
vl,x j = f kl,h j,m v(l−1),m + biasl, j (1)
m h=0
To perform a convolution operation on 3-D data, 3-D CNNs use 3-D convolution
kernels (Ri refers to the size of each kernel). As the main objective is to extract the
Hyperspectral Images: A Succinct Analytical Deep Learning Study 155
low-level features contained in the HIs, we use 3-D filter at the input image and
generates a cube or cuboid in the 3-D volume space. In 3-D convolution, the same
3-D kernel is applied to overlapping 3-D cubes in the input to extract the features.
Max pooling, Dropout, Batch Normalization, Flatten methods are generally used to
route multi-scale feature maps generated from each 3-D convolution layer.
⎛ ⎞
l −1 Q
P i −1 R
i −1
(x+ p)(y+q)(z+r )
= tanh ⎝ + biasl, j ⎠
x,y,z p,q,r
mapi, j wi, j,m map(i−1),m (3)
m p=0 q=0 r =0
m t = β1 ∗ m t−1 + (1 − β1 ) ∗ εt (4)
vt = β2 ∗ vt−1 + (1 − β2 ) ∗ εt2
vt
wt+1 = wt + wt , wher ewt = −η √ ∗ εt (5)
st + ∈
The relative contribution of past history with regards to the present gradient is
controlled through the decay rates (β1 and β2 hyper parameters), each parameter wt
replaces with w. η is the first level learning rate, εt represents the gradient at time t,
vt signify for the exponential average and st is the exponential average of the square
of the gradient.
Sundarban is one of the largest mangrove areas in the world stretching from India to
Bangladesh with a delta formed by the rivers, Brahmaputra, Meghan and Padma in
the Bay of Bengal. Around 106 islands and supports a good number of biodiversity
Hyperspectral Images: A Succinct Analytical Deep Learning Study 157
(Fig. 5). It is home to a wide range of wildlife species including endangered species
and supports for biodiversity through its 106 islands.
Fig. 5 Satellite data: Sundarban mangrove a Composite image b Ground truth image c 12-band
HI visualization
Barren land (BL): Land devoid of vegetation or sand dunes, River(RV) bodies, Dense
Mangrove (DM): Mangrove forest with dense canopy cover, Open Mangrove (OM):
Mangrove forest with open canopy and mudflats with very less mangrove cover,
Agriculture (AG): Active agricultural practice and Human habitat (HUM): Human
habitation often under the canopy shade of non-mangrove plants.
Principal Component Analysis (PCA) and TensorFlow based Keras package of
Python is used to extract 3D patches (containing true-classes) and to categorize the
reduced high-dimensional input with (0.7 to 0.3 of scale-1) for encoding.
Next, we processed into a 3D-CNN through Convolution, Dropout and Dense
layers with 1,204,098 trainable parameters. The model adapts 6optimizers discussed
in Sect. 2.3 and selects the best (here, the Adams). Methods like TesnsorBoard,
EarlyStopping, and ModelCheckpoint were used to tackle issues of keeping track
of learning logs during every batch, monitor metric of learning status to overcome
issues of overfitting and to epoch-leveled control checkpoint losses respectively.
Classification accuracy of the undertaken input HI is shown below.
Plot the classification report (page 11) in graph.
To overcome the unbalanced classes and to minimize the loss in the training and
validation of HI patches the categorical cross-entropy (CCE) (Fig. 6). The functions
of CCE can be identified as CC E = − iC ti log f (S i ) and f (S i ) = ee Si / Cj e S j
with C set of classes, t i ground truths and S i corresponding CNN score for each
class-i having softmax activation function. The data were augmented using random
horizontal and vertical flips. After tuning, the monitor = ‘val_loss’ and restore_best_
weights = True, the batch size to (1024 × 6) and the optimizer used was Adam with
CCE.
Hyperspectral Images: A Succinct Analytical Deep Learning Study 159
It is often the case that scientists combine two or more types of architectures instead
of relying on a single approach (hybrid models), which can result in better results
when dealing with complex problems. In other words, they are a class of methods
that integrate the advantages of different models in the same system. The following
sections describe on the methodology for classifying (deep) Hyperspectral images
(HIs) from three HI-datasets.
We have used the following three popular HI-datasets [13] to validate our Hybrid-
MSSN model (Table 1).
160 L. S. Kumar et al.
The detailed process of the model is outlined in Algorithm 1. The undertaken exper-
imental setup is based on Google Colaboratory pro cloud platform with Python,
Jupyter notebook and GPUs. Keras, a deep learning tool, was used to validate the
model.
162 L. S. Kumar et al.
In the deep CNN classification, as the layer becomes deeper, the spatial dimensions
of feature maps shrink sharply and results to a loss. In conventional cases, the FC
layer frequently points to the deepest Convolutional (or pooling) layer and hence the
network seriously depends on the global data which reflect to high computation time.
Thus, to overcome such issues, we use both shallow and deep convolution features
[76] to account the complexity in HIs, where distinct items likely to have varying
scales and the Spinal Fully Connected network (SpinalNet, [34]) instead of the dense
layer.
To experiment the HI, first we use PCA transformation to extract the most infor-
mative r spectral bands (IPDr = 30, PUDr = 15 and SDr = 15) as per the Modified
Brown Stick Rule (MBR) [3]. With iterative noise filtrations we get HI-cubes with
reduced dimension of (13 × 13 × r) which is relevant to the findings in [59].
The HI-cubes were further categorized into two groups with distinct training and
testing samples (Fig. 8). One group comprises with 10 and 90 percent of train to test
samples and the other group with 30 and 70 percent to compensate for the problem of
class imbalance. Table 2 represents the classification outcomes of the three datasets
with oversampling.
The undertaken approach also achieves impressing results on all 3d-patches of
3-datasets without oversampling; for instance with (13 × 13) patch size, accuracies
at 3-datasets are represented in Table 3. We use accuracy measures like, Overall
Hyperspectral Images: A Succinct Analytical Deep Learning Study 163
Fig. 8 Testing: IPD, SD, PUD a Overall accuracy b Average accuracy c Kappa score
Accuracy (OA), Average Accuracy (AA), Kappa value (KA) and Class-wise accuracy
to evaluate the model.
We use the first category of 10 percent training samples for model validation.
With 26 , 27 and 28 filters (3 × 3 × 3 dimension) in the first, second and third phase
of 3D-Convolution layers respectively, we adapt ‘Relu’ activation function. In the
model, each 3D convolution layer follows with 3D Max pooling with pooling size
164 L. S. Kumar et al.
2 and dropout ratio of 0.5. The 2D-convolution layer in the model has 256 filters
(3 × 3 dimension), dropout ratio of 0.25. In all SFCNs (1–5), the layer width is set
to 256 and half width is set to the round of integer value to half of the layer width,
which play a significant role. We use Adam optimizer, having categorical cross-
entropy loss function (Fig. 9). The learn-rate and decays were assigned as 0.001 and
1e−06 respectively. The model is trained over 20 epochs with a batch size of 256.
The model is compared with four published methods (Fig. 10), EMP-SVM [18],
MCNN-CP [78], 2D-CNN [50] and hybrid-SN [59].
The performance of the model is also investigated (Table 4) by repeating the
experiments with data that contains noise, with and without weak class oversampling
and with different spatial sizes and train-test ratios (Fig. 11).
Fig. 9 Epochs and training/validation loss a–c 10% Training IPD, SD, PUD d–f30% Training IPD,
SD, PUD
Hyperspectral Images: A Succinct Analytical Deep Learning Study 165
Fig. 10 Class-wise classification accuracy: Training sample (T.S.) with oversample (O.S.) a & b
IPD c & d SD e & f PUD
This literature addresses basic issues related to satellite imaging techniques and hyper
spectral based classification techniques. In the first part of experiment analysis, we
used sentinel-2 based satellite image of Sundarban Mangrove and classified the land
coverage with respect to six ground truth labels with comparative better accuracy.
Further with identified issues like training size limitation, better computational time
and better classification performance under noise, we adapted a combined 3D-2D
DL approach for the generation of hierarchical discriminative deep spectral-spatial
features and HI classification. A multi-scale feature learning technique is employed
in the framework, which increases the ability of the model to classify the objects of
diverse shapes even after the information loss by the convolution mechanism. The
use of SpinalNet model enhances the accuracy and controls the error. Experimental
results demonstrate that our model is capable enough to classify with a limited number
of training samples and thus avoid the need for oversampling and performs well even
166 L. S. Kumar et al.
Fig. 11 Classification maps: a–c Ground truth of IPD, SD and PUD d–f Our model (with 30%
training) of IPD, SD and PUD g–i Our model (with 10% training and oversampling) of IPD, SD
and PUD
in the presence of Gaussian and Poisson noise. The model demonstrates with three
benchmark datasets by giving consistent and competitive values for Overall Accuracy
(OA), Aver- age Accuracy (AA), and Kappa Accuracy (KA) compared to the other
four state-of-the-art models. Being a supervised classification based model, it offers
with best usage on labeled Hyperspectral datasets and most suitable for applications
based on land cover mapping, agriculture and global climate.
Hyperspectral Images: A Succinct Analytical Deep Learning Study 167
References
1. Adate, A., Arya, D., Shaha, A., & Tripathy, B. K. (2020). Impact of deep neural learning on
artificial intelligence research. In S. Bhattacharyya, A. E. Hassanian, S. Saha, & B. K. Tripathy
(Ed.), Deep Learning Research and Applications (pp.69–84). De Gruyter Publications. https:/
/doi.org/10.1515/9783110670905-004
2. Adate, A., & Tripathy, B. K. (2018). Deep learning techniques for image processing. In S.
Bhattacharyya, H. Bhaumik, A. Mukherjee & S. De (Eds.), Machine Learning for Big Data
Analysis (pp. 69–90). De Gruyter. https://doi.org/10.1515/9783110551433-00357
3. Bajorski, P. (2010). Investigation of virtual dimensionality and broken stick rule for hyperspec-
tral images. In 2010 2nd Workshop on Hyperspectral Image and Signal Processing: Evolution
in Remote Sensing (pp. 1–4).
4. Benediktsson, J. A., Palmason, J. A., & Sveinsson, J. R. (2005). Classification of hyperspec-
tral data from urban areas based on extended morphological profiles. IEEE Transactions on
Geoscience and Remote Sensing, 43(3), 480–491.
5. Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H., et al. (2007). Greedy layer-wise training
of deep networks. Advances in neural information processing systems, 19, 153.
6. Bhattacharyya, S., Snasel, V., Hassanian, A. E., Saha, S., & Tripathy, B. K. (2020). Deep
learning research with engineering applications. De Gruyter Publications. ISBN: 3110670909,
9783110670905. https://doi.org/10.1515/9783110670905
7. Bhardwaj, P., Guhan, T., & Tripathy, B. K. (2021). Computational biology in the lens of CNN,
Studies in Big Data. In S.S. Roy, & Y.-H. Taguchi (Eds.), Handbook of Machine Learning
Applications for Genomics, (Chapter 5) (vol. 103). ISBN: 978–981–16–9157–7 496166_1_En
8. Binol, H. (2018). Ensemble learning based multiple kernel principal component analysis for
dimensionality reduction and classification of hyperspectral imagery. Mathematical Problems
in Engineering, 2018, 14. Article ID 9632569.
9. Bose, A., & Tripathy, B. K. (2020). Deep learning for audio signal classification. In S. Bhat-
tacharyya, A. E. Hassanian, S. Saha, & B. K. Tripathy (Ed.), Deep Learning Research and
Applications (pp. 105–136). De Gruyter Publications. https://doi.org/10.1515/9783110670905-
00660
10. Bruce, L. M., Li, J., & Huang, Y. (2022). Automated detection of subpixel hyperspectral targets
with adaptive multichannel discrete wavelet trans-form. IEEE Transactions on Geoscience and
Remote Sensing, 40(4), 977−980
11. Chen, Y., Lin, Z., Zhao, X., Wang, G., & Gu, Y. (2014). Deep learning-based classi-fication of
hyperspectral data. IEEE Journal of Selected topics in applied earth observations and remote
sensing, 7(6), 2094–2107.
12. COAH: Copernicus Open Access Hub. https://scihub.copernicus.eu
13. Grupo de Inteligencia Computacional. (2014). Hyperspectral remote sensing scenes. http://
www.ehu.eus/ccwintco/index.php
14. Debgupta, R., Chaudhuri, B. B., Tripathy, B. K. (2020). A eide resNet-based approach for
age and gender estimation in face images. In A. Khanna, D. Gupta, S. Bhattacharyya, V.
Snasel, J. Platos, A. Hassanien (Eds.), International Conference on Innovative Computing and
Communications, Advances in Intelligent Systems and Computing (vol. 1087, pp. 517–530).
Springer. https://doi.org/10.1007/978-981-15-1286-5_44
15. Deepa, P., & Thilagavathi, K. (2015). Feature extraction of hyperspectral image using prin-
cipal component analysis and folded-principal component analysis. In 2015 2nd International
Conference on Electronics and Communication Systems (ICECS) (pp. 656–660).
16. Dharmasastha, K. N. S., Banu, K. S., Kalaichevlan, G., Lincy, B., & Tripathy, B.K. (2022).
Classification of pest in tomato plants using CNN. In M. N. Mohanty, S. Das, M. Ray, B. Patra
(Eds.), Meta Heuristic Techniques in Software Engineering and Its Applications. METASOFT
2022. Artificial Intelligence-Enhanced Software and Systems Engineering (vol. 1). Springer.
https://doi.org/10.1007/978-3-031-11713-8_6
17. Du, Q. (2007). Modified fisher’s linear discriminant analysis for hyperspectral imagery. IEEE
Geoscience and Remote Sensing Letters, 4(4), 503–507.
168 L. S. Kumar et al.
18. Fauvel, M., Benediktsson, J. A., Chanussot, J., & Sveinsson, J. R. (2008). Spectral and spatial
classification of hyperspectral data using svms and morphological profiles. IEEE Transactions
on Geoscience and Remote Sensing, 46(11), 3804–3814.
19. Fauvel, M., Tarabalka, Y., Benediktsson, J. A., Chanussot, J., & Tilton, J. C. (2012). Advances
in spectral-spatial classification of hyperspectral images. Proceedings of the IEEE, 101(3),
652–675.
20. Fu, A., Ma, X., & Wang, H. (2018). Classification of hyperspectral image based on hybrid
neural networks. In: IGARSS 2018 2018 IEEE International Geoscience and Remote Sensing
Symposium (pp. 2643–2646).
21. Fukushima, K., & Miyake, S. (1982). Neocognitron: A self-organizing neural net-work model
for a mechanism of visual pattern recognition. In Competition and Cooperation in Neural Nets
(pp. 267–285). Springer.
22. Ghasemzadeh, A., & Demirel, H. (2016) Hyperspectral face recognition using 3d discrete
wavelet transform. In 2016 Sixth International Conference on Image Processing Theory, Tools
and Applications (IPTA) (pp. 1–4).
23. Ghiya, A.S., Vijay, V., Ranganath, A., Chaturvedi, P., Tripathy, B.K. & Banu, K. S. (2021).
Weather classification: Image embedding using xonvolutional autoencoder and predictive
analysis using stacked generalization. In ANTIC conference. BHU.
24. Guo, Y., Liu, Y., Oerlemans, A., Lao, S., Wu, S., & Lew, M. S. (2016). Deep learning for visual
understanding: A review. Neurocomputing, 187, 27–48.
25. Gupta, P., Bhachawat, S., Dhyani, K., & Tripathy, B. K. (2021). A study of gene characteristics
and their applications using deep learning, (Chapter 4), Studies in Big Data. In S. S. Roy, &
Y.-H. Taguchi (Eds.), Handbook of Machine Learning Applications for Genomics (vol. 103).
ISBN: 978–981–16–9157–7, 496166_1_En
26. Hamida, A. B., Benoit, A., Lambert, P., & Amar, C. B. (2018). 3-d deep learning approach
for remote sensing image classification. IEEE Transactions on geoscience and remote sensing,
56(8), 4420–4434.
27. Harikiran, J., Ladi, S. K., Panda, G. K., Dash, R., Ladi, P. K. (2020). Hyperspectral image
classification bi-dimensional empirical mode decomposition and deep residual networks. In
2020 International Conference on Artificial Intelligence and Signal Processing (AISP) (pp.1–
6).
28. Harsanyi, J. C., & Chang, C.-I. (1994). Hyperspectral image classification and dimensionality
reduction: An orthogonal subspace projection approach. IEEE Transactions on geoscience and
remote sensing, 32(4), 779–785.
29. Haut, J. M., Paoletti, M. E., Plaza, J., Plaza, A., & Li, J. (2019). Hyperspectral image clas-
sification using random occlusion data augmentation. IEEE Geoscience and Remote Sensing
Letters, 16(11), 1751–1755.
30. Hinton, G. E., Osindero, S., & Teh, Y.-W. (2006). A fast learning algorithm for deep belief
nets. Neural computation, 18(7), 1527–1554.
31. Hughes, G. (1968). On the mean accuracy of statistical pattern recognizers. IEEE transactions
on information theory, 14(1), 55–63.
32. Imani, M., & Ghassemian, H. (2014). Principal component discriminant analysis for feature
extraction and classification of hyperspectral images. In 2014 Iranian Conference on Intelligent
Systems (ICIS) (pp. 1–5).
33. Jayaprakash, C., Damodaran, B. B., Sowmya, V., & Soman, K. P. (2018). Dimensionality
reduction of hyperspectral images for classification using randomized independent component
analysis. In 2018 5th International Conference on Signal Processing and Integrated Networks
(SPIN) (pp. 492–496)
34. Kabir, H. M. D., Abdar, M., Jalali, S. M. J., Khosravi, A., Atiya, A.F., Nahavandi, S., &
Srinivasan, D. (2020). SpinalNet: Deep neural network with gradual input
35. Kathuria, A. (2018) Intro to optimization in deep learning: Momentum, Rmsprop and Adam.
https://blog.paperspace.com/intro-to-optimization-momentum-rmsprop-adam/
Hyperspectral Images: A Succinct Analytical Deep Learning Study 169
36. Kaul, D., Raju, H., & Tripathy, B. K. (2022). Deep learning in healthcare, in: Deep Learning in
Data Analytics. In: D.P. Acharjya, A. Mitra, N. Zaman (Eds,), Deep Learning in Data Analytics-
Recent Techniques, Practices and Applications, Studies in Big Data (vol. 91, pp. 97–115).
Springer. https://doi.org/10.1007/978-3-030-75855-4_6
37. Ke, C. (2017). Military object detection using multiple information extracted from hyperspec-
tral imagery. In 2017 International Conference on Progress in Informatics and Computing
(PIC) (pp. 124–128).
38. Khan, M.J., Khan, H.S., Yousaf, A., Khurshid, K., & Abbas, A. (2018). Modern trends in
hyperspectral image analysis: A review. IEEE Access. 6, 14118−14129
39. Kumar, V., & Tripathy, B. K. (2020). Detecting toxicity with bidirectional gated recurrent unit
networks. In V. Bhateja, S. Satapathy, Y.D. Zhang, V. Aradhya (Eds.), Intelligent Computing
and Communication. ICICC 2019. Advances in Intelligent Systems and Computing (vol. 1034).
Springer. https://doi.org/10.1007/978-981-15-1084-7_57
40. Kwon, H., Hu, X., Theiler, J., Zare, A, & Gurram, P. (2013). Algorithms for multispectral
and hyperspectral image analysis. Journal of Electrical and Computer Engineering, 2013, 2.
Article ID 908906
41. Ladi, S. K., Panda, G. K., Dash, R., et al. (2022). A novel grey wolf optimisation based CNN
classifier for hyperspectral image classification. Multimed Tools Appl, 81, 28207–28230.
42. Ladi, S. K., Panda, G. K., Dash, R. et al. (2022). A novel strategy for classifying spectral-spatial
shallow and deep hyperspectral image features using 1D-EWT and 3D-CNN. Earth science
informatics
43. Ladi, S. K., Dash, R., Panda, G. K., Ladi, P. K., & Dhupar, R. (2019). Hyperspectral image
classification using swt and cnn. In 2019 International Conference on Information Technology
(ICIT) (pp. 172–177).
44. Li, C., Zuo, H., Fan, T. (2017). Hyperspectral image classification based on gray level co-
occurrence matrix and local mean decomposition. In 2017 4th International Conference on
Systems and Informatics (ICSAI) (pp. 1219–1223).
45. Li, J., Bioucas-Dias, J. M., & Plaza, A. (2010). Semisupervised hyperspectral image segmen-
tation using multinomial logistic regression with active learning. IEEE Transactions on
Geoscience and Remote Sensing, 48(11), 4085–4098.
46. Li, Y., Zhang, H., & Shen, Q. (2017). Spectral–spatial classification of hyperspectral imagery
with 3d convolutional neural network. Remote Sensing, 9(1), 67.
47. Li, W., Wu, G., Zhang, F., & Du, Q. (2017). Hyperspectral image classification using deep
pixel-pair features. IEEE Transactions on Geoscience and Remote Sensing, 55(2), 844–853.
48. Ma, Y., Li, R., Yang, G., Sun, L., & Wang, J. (2018). A research on the combination strategies
of multiple features for hyperspectral remote sensing image classification. Journal of Sensors,
2018, 14. Article ID 7341973.
49. Maheswari, K., Shaha, A., Arya, D., Tripathy, B. K., & Rajkumar, R. (2020). Convolutional
neural networks: A bottom-ip approach. In S. Bhattacharyya, A. E. Hassanian, S. Saha, &
B.K. Tripathy (Ed.), Deep Learning Research with Engineering Applications (pp.21–50). De
Gruyter Publications. https://doi.org/10.1515/9783110670905-002
50. Makantasis, K., Karantzalos, K., Doulamis, A., & Doulamis, N. (2015). Deep super-vised
learning for hyperspectral data classification through convolutional neural networks. In 2015
IEEE International Geoscience and Remote Sensing Symposium (IGARSS) (pp. 4959–4962).
51. Makantasis, K., Doulamis, A. D., Doulamis, N. D., & Nikitakis, A. (2018). Tensor-based
classification models for hyperspectral data analysis. IEEE Transactions on Geoscience and
Remote Sensing, 56(12), 6884–6898.
52. Makantasis, K., Doulamis, A., Doulamis, N., Nikitakis, A., & Voulodimos, A. (2018). Tensor-
based nonlinear classifier for highorder data analysis. In 2018 IEEE International Conference
53. Notesco, G., Dor, E. B., & Brook, A. (2014). Mineral mapping of makhtesh ramon in israel
using hyperspectral remote sensing day and night LWIR images. In 2014 6th Workshop on
Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS) (pp. 1–
4).
170 L. S. Kumar et al.
54. Pesaresi, M., Gerhardinger, A., & Kayitakire, F. (2008). A robust built-up area presence index
by anisotropic rotation-invariant textural measure. IEEE Journal of selected topics in applied
earth observations and remote sensing, 1(3), 180–192.
55. Pesaresi, M., & Benediktsson, J. A. (2001). A new approach for the morphological segmentation
of high-resolution satellite imagery. IEEE transactions on Geoscience and Remote Sensing,
39(2), 309–320.
56. Pike, R., Lu, G., Wang, D., Chen, Z. G., & Fei, B. (2016). A minimum spanning forest-based
method for noninvasive cancer detection with hyperspectral imaging. IEEE Transactions on
Biomedical Engineering, 63(3), 653–663.
57. Plaza, A., Mart´ınez, P., Plaza, J., P´erez, R. (2005). Dimensionality reduction and classification
of hyperspectral image data using sequences of extended morphological transformations. IEEE
Transactions on Geoscience and remote sensing, 43(3), 466–479.
58. Prabhavathy, P., Tripathy, B.K., Venkatesan, M. (2022). Analysis of diabetic retinopathy detec-
tion techniques using CNN Models. In: S. Mishra, H. K. Tripathy, P. Mallick, K. Shaalan
(Eds.), Augmented Intelligence in Healthcare: A Pragmatic and Integrated Analysis. Studies
in Computational Intelligence (vol. 1024). Springer, https://doi.org/10.1007/978-981-19-107
6-0_6
59. Roy, S. K., Krishna, G., Dubey, S. R., & Chaudhuri, B. B. (2020). Hybridsn: Exploring 3-d-2-d
cnn feature hierarchy for hyperspectral image classification. IEEE Geoscience and Remote
Sensing Letters, 17(2), 277–281.
60. Singhania, U., & Tripathy, B. K. (2021). Text-based image retrieval using deep learning. In
Encyclopedia of Information Science and Technology (5th ed., p. 11). https://doi.org/10.4018/
978-1-7998-3479-3.ch007
61. Rungta, R. K., Jaiswal, P., Tripathy, B. K. (2022). A deep learning based approach to measure
confidence for virtual interviews. In A. K. Das et al. (Eds.), Proceedings of the 4th International
Conference on Computational Intelligence in Pattern Recognition (CIPR) (pp. 278–291). CIPR
2022, LNNS 480.
62. Sihare, P., Khan, A. U., Bardhan, P., & Tripathy, B. K. (2022). COVID-19 detection using
deep learning: A comparative study of segmentation algorithms. In A. K. Das et al. (Eds.),
Proceedings of the 4th International Conference on Computational Intelligence in Pattern
Recognition (CIPR) (pp. 1–10). CIPR 2022, LNNS 480.
63. Jain, S., Singhania, U., Tripathy, B.K., Abouel, E. N., Aboudaif, M. K., & Ali, K. K. (2021).
Deep learning based transfer learning for classification of skin cancer. Sensors (Basel), 21(23),
8142 https://doi.org/10.3390/s21238142. (IF:4.35)
64. Surya, Y. S., Geetha Rani, K. T., & Tripathy, B. K. (2022). Social distance monitoring and face
mask detection using deep learning. In: J. Nayak, H. Behera, B. Naik, S. Vimal, D. Pelusi (Eds.),
Computational Intelligence in Data Mining. Smart Innovation, Systems and Technologies (vol.
281). Springer. https://doi.org/10.1007/978-981-16-9447-9_36
65. Sun, T., Jiao, L., Feng, J., Liu, F., & Zhang, X. (2015). Imbalanced hyperspectral image clas-
sification based on maximum margin. IEEE Geoscience and Remote Sensing Letters, 12(3),
522–526.
66. Teng, M. Y., Mehrubeoglu, R., King, S. A., Cammarata, K., & Simons, J. (2013). Investig tion
of epifauna coverage on seagrass blades using spatial and spectral analysis of hyperspectral
images. In 2013 5th Workshop on Hyperspectral Image and Signal Processing: Evolution in
Remote Sensing (WHISPERS) (pp. 1–4).
67. Tripathy, B. K., & Anuradha, J. (2015). Soft computing-advances and applications. Cengage
Learning publishers. ASIN: 8131526194, ISBN-109788131526194
68. Tripathy, B. K., Parikh, S., Ajay, P., & Magapu, C. (2022). Brain MRI segmentation techniques
based on CNN and its variants, (Chapter-10). In J. Chaki (Ed.), Brain Tumor MRI Image
Segmentation Using Deep Learning Techniques (pp. 161−182). Elsevier publications. https://
doi.org/10.1016/B978-0-323-91171-9.00001-6
69. Tripathy, B. K., & Adate, A. (2021). Impact of deep neural learning on artificial intelligence
research, Chapter-8. In D. P. Acharjya et al (Ed.), Springer publications.
Hyperspectral Images: A Succinct Analytical Deep Learning Study 171
70. Voulodimos, A. (2018). Deep learning for computer vision: a brief review. Computational
Intelligence and Neuroscience, 2018, 13. Article ID 7068349.
71. Wang, & Chang, C. I. (2006). Independent component analysis based dimensionality reduction
with applications in hyperspectral image analysis. In IEEE Transactions on Geoscience and
Remote Sensing (vol. 44, no. 6, pp. 1586–1600).
72. Wang, X., & Feng, Y. (2008). New method based on support vector machine in classification
for hyperspectral data. In 2008 International Symposium on Computational Intelligence and
Design (pp. 76–80)
73. Wang, Y., & Cui, S. (2014). Hyperspectral image feature classification using stationary wavelet
transform. In 2014 International Conference on Wavelet Analysis and Pattern Recognition
(pp. 104–108)
74. Wu, Y., Mu, G., Qin, C., Miao, Q., Ma, W., & Zhang, X. (2020). Semi-supervised hyperspectral
image classification via spatial-regulated self-training. Remote Sensing, 12(1)
75. Xingjian, S., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., & Woo, W.C. (2015). Convolu-
tional LSTM network: A machine learning approach for precipitation nowcasting. In Proceed-
ings of the 28th International Conference on Neural Information Processing Systems (Vol. 1,
pp. 802–810).
76. Xu, Y., Zhang, L., Du, B., & Zhang, F. (2018). Spectral–spatial unified networks for hyper-
spectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 56(10),
5893–5909.
77. Zhang, X., Zhang, A., & Meng, X. (2015). Automatic fusion of hyperspectral images and laser
scans using feature points. Journal of Sensors, 2015, 9. Article ID 415361
78. Zheng, J., Feng, Y., Bai, C., & Zhang, J. (2021). Hyperspectral image classification using mixed
convolutions and covariance pooling. IEEE Transactions on Geoscience and Remote Sensing,
59(1), 522–534.
79. Zhong, Z., Li, J., Luo, Z., & Chapman, M. (2018). Spectral–spatial residual network for
hyperspectral image classification: A 3-d deep learning framework. IEEE Transactions on
Geoscience and Remote Sensing, 56(2), 847–858
80. Zhou, F., Hang, R., Liu, Q., & Yuan, X. (2019). Hyperspectral image classification using
spectral-spatial lstms. Neurocomputing, 328, 39–47.
Chest X-Ray Image Classification
of Pneumonia Disease Using EfficientNet
and InceptionV3
1 Introduction
Pneumonia is a type of respiratory infection that affects the lungs. It leads to inflam-
mation in the lungs and fluid buildup in the air sacs within, causing difficulties in
breathing and simultaneous cardiovascular health effects. Pneumonia is considered
to be the single largest cause of death in children worldwide, leading to an estimated
count of 5.9 million deaths for children under 5 years old annually [1]. Chest X-Rays
and Radiography methods have been prevalent in the medical industry for quite some
time and the use of such methods and tools have been administered in diagnosing
and curing issues and illnesses such as cancer, infections, emphysema and pneu-
monia. The specialized analysis and diagnosis of an illness through the use of X-Ray
outputs are generally conducted by expert radiologists in person. In recent times,
the number of cases requiring chest X-Rays have substantially increased [2], hence
simultaneously, radiologists working on these outputs now have to devote higher
levels of time for this task. The requirement of expertise for this task comes from the
extremely detailed and niched characteristics of the components present in the lung
which has to be analyzed and deduced via intricate characterizations and traits which
coherently point towards a general illness category. Due to the aforementioned cause
of increased frequency of Chest X-Ray instances, it is a possibility that due to this
vast volume of data to be manually processed, can be a reason which simultaneously
leads to time delays, cost problems, and/or errors which may occur, which in the end
is something that needs to be avoided via any medical institution. Through the work
described in this chapter, we propose an automated medical image diagnosis system
which essentially will allow the radiologists and staff alike to gain an alternate and
handy method to efficiently process and analyze data without much hassle or manual
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 173
S. S. Roy et al. (eds.), Deep Learning Applications in Image Analysis,
Studies in Big Data 129, https://doi.org/10.1007/978-981-99-3784-4_9
174 N. Ghoshal et al.
work. For our problem statement, we have used two Convolutional Neural Network
(CNN) based algorithms to classify Chest X-Ray scans for the illness of pneumonia.
These CNN based algorithms have worked well with this specific image classi-
fication problem due to it’s inherent trait of reducing dimensionality of data and
efficient processing for accurate results [3]. The aforementioned advantages are
due to the neural network subdivisions and their tasks, namely Convolution Layer
which breaks down the entire image into smaller sub-parts of it for and efficient
and less-dimensional input layer, Pooling Layer assumes the convolution layer as
input and reduces the dimensionality further, and the Fully Connected Layer which
can be considered as the final layer upon which the network finally learns which
subdivisions/parts are necessary for the classification problem at hand.
2 Literature Survey
Till date, there have been few proposals and advances towards similar and
specific medical diagnosis problems. CNNs and Deep Neural Networks have
allowed researchers to build sophisticated models towards medical issues including
pneumonia, tuberculosis, Covid-19, lung cancer and many more [4].
There are many different techniques and methodologies used to progress the
specific tasks of medical diagnosis employed by various researchers in their respec-
tive fields, some of them include, Convolutional Neural Networks, Transfer Learning,
Image-Level Prediction, Segmentation networks, Localization Networks, Image
Generation Networks, Domain Adaption Networks, and likewise [4].
For example, Crosby et al. employed the use of deep CNNs for distinguishing
between binary labelled chest radiograph data [5]. Deep Learning has also been
employed in detection of foreign objects in chest radiographs using similar data
[6]. The use of General Adversarial Networks can also be seen in deployment of
technology for organ segmentation and bone suppression tasks in Chest X-Rays [7].
Transfer learning based image classifier models have been researched by Showkat
et al. in detection of Covid-19 pneumonia [8]. Deep Learning techniques are used by
Hirata et al. in the pursuit of detecting pulmonary artery wedge pressure metrics using
Standard Chest X-Ray data. The research community pertaining to these specific
tasks have produced a foothold in the use of CNNs in computer vision problems
like these and in 2015 and 2016 more than 300 papers were published on applica-
tions of deep learning in workshops, conferences, journals, and special issues in this
domain[9, 10].
Chest X-Ray Image Classification of Pneumonia Disease Using … 175
3 Dataset
The dataset used to train our proposed models was obtained from the internet website
named Kaggle, and is named “Chest X-Ray Images (Pneumonia)”. It consists of
5863 images as training samples each of which has a binary feature associated with
it depicting the individual datapoints as either ‘normal’ or ‘pneumonia’. A point to
note here is that, the feature category for this specific dataset is binary in nature,
hence the proposed models will be tasked with the duty of analysing the image for
the presence of the disease of pneumonia in contrast to the task of finding specific
types of pneumonia ranging from bacterial to viral. The images present in the dataset
are formatted X-Ray images of the lungs (Fig. 1).
The dataset consists of 27% images of normal lung x-rays and the remaining
pertaining to those corresponding to pneumonia (Fig. 2).
4 Data Pre-processing
For the task of data pre-processing, all individual images are converted into grayscale
and gaussian blur is applied to them. The conversion of images into grayscale helps
in fine tuning the dataset for the specific image classification task by converting the
pixels present in the images into values depicting the information of the intensity of
176 N. Ghoshal et al.
light. Gaussian blur, in essence, is applied to reduce the noise and redundant data
present in the information pixels. The concept of Gaussian blur works on it’s char-
acteristic to smoothen the edges and boundaries of objects resulting in enhancement
of object data and smoothening of transitions between boundaries. Image Erosion
is also applied to the categorical data, wherein, the erosion function used to process
the data, reduces or removes pixels on object boundaries, the frequency of pixels
affected depends on the specific inherent characteristics of the image. The Canny
Edge Detection algorithm developed by OpenCV is also used, which reduces noise,
finds the intensity gradient of the image and supresses unwanted pixels (Figs. 3, 4
and 5).
5 Proposed Model
We have used 2 distinct models for this classification problem, the Efficient-
Net model and the Inception model. Both of these models are based on CNNs
(Convolutional Neural Networks).
5.1 EfficientNet
Depth : d = aϕ
Width : w = bϕ
Resolution : r = c
s.t. a ∗ bϕ ∗ cϕ 2
a >= 1, b >= 1, c >= 1
5.2 InceptionV3
Fig. 8 Input layer and output layer dimensions for InceptionV3 model
Chest X-Ray Image Classification of Pneumonia Disease Using … 181
6.1 InceptionV3
Figure 9 shows the accuracy graphs and validation of accuracy graphs for the Incep-
tionV3 model, the training of the model has occurred for a duration of 15 epochs. The
peak accuracy achieved by the model is high value of 92.93%, it portrays a gradual
and simultaneous increase and decrease in the graph metric values, occurring due
to the fine tuning of model prediction confidence values, until finally arriving at it’s
peak accuracy point and decreasing therein. The validation accuracy curve can be
seen performing a similar curvature until dropping to an extremely low value but
stabilizing itself while moving forward which depicts the overall accuracy value
fluctuation metrics to the change of model parameters.
The loss value function, as shown graphically in Fig. 10, for the model can be seen
taking a huge initial decline and reaching it’s required lowest value moving forward
in a stable and coherent manner. The validation loss curve doesn’t take a steep dive
but goes through a sudden high peak value in between it’s complete graph path, after
which it stabilizes and reaches it’s boundary values, which are close to the loss value
curve boundary values.
These results hence depict the benchmark being set in pneumonia diagnosis using
CNN based algorithms. This outcome, when compared with other models for similar
tasks perform demonstrably better in the outcomes and at the same time is more
efficient due to the inbuilt performance metrics present in the baseline Inception
models, as depicted in Sect. 5.2.
6.2 EfficientNet
Figure 11 shows the accuracy graphs and validation of accuracy graphs for the Effi-
cientNet model, the training of the model has occurred for a duration of 10 epochs.
The peak accuracy achieved by the model is high value of 95.39%, it displays the
accuracy of the model steeply increasing after the first epoch and gradually and stably
achieving it’s peak value after the last epoch. The validation accuracy curve can be
seen performing a similar curvature until dropping to an extremely low value and
steeply increasing after the subsequent epoch but again dropping extremely low after
two more epochs.
The loss value function for the model, as depicted in Fig. 12, can be seen taking an
initial decline and reaching it’s required lowest value while performing simultaneous
but negligible ups and downs throughout the curvature. The validation loss curve can
be observed performing a steep initial decline similar to the loss value function. It
achieves it’s peak boundary value in the following steps therein, but it then suddenly
Chest X-Ray Image Classification of Pneumonia Disease Using … 183
increases to an enormous amount and also decreases in the following epoch only to
increase substantially again after 2 more epochs.
These results also simultaneously set the benchmark being set in pneumonia
diagnosis using CNN based algorithms. This outcome, when compared with other
models for similar tasks perform demonstrably better in the outcomes and at the same
time is more efficient and customizable due to the inbuilt model metrics present in
the baseline Inception models, as depicted in Sect. 5.1.
7 Discussion
One of the necessities and dire requirements of radiologists, clinicians and staff alike
working towards the problem of detecting and curing pneumonia and related condi-
tions is the metric factors of time, frequency and volume of data to be processed,
and expertise requirements. The presence of already existing classifiers for other
medical diagnosis and related works, including breast cancer detection [13], and
also the recent use of CNNs being used in Brain Tumour Classification [14]. Almost
all of these can be solved to a significant extent via the use of machine learning and
neural network based models to ease this task. But simultaneously, it must be noted
that the final diagnosis and inferences received from it should be done ultimately
184 N. Ghoshal et al.
by a trained professional, these classification models, for now, are present only to
aid the clinicians and trained experts in streamlining their tasks. Some limitations a
model like this would pertain along with itself is the explanation of achieved metrics
and reasons embedded therein, and inability to characterize a few key metrics which
demonstrate a substrata of the general illness being caused and which could necessi-
tates simultaneous alternate remedies extending to a cohesion of multiple disorders
either causing or caused from the pneumonia disease. The accuracies achieved in
this chapter, can be improved further by incorporating a larger dataset, or devel-
oping further specific and custom models based exclusively on X-Ray diagnostics.
Another method which can be availed to achieve improvement is to incorporate
medical histories of the patient in a significant shape or form to be included in as
a feature variable in the dataset. Furthermore, data augmentation techniques can
be identified and incorporated in future models for achieving higher output metrics
[15–30].
Chest X-Ray Image Classification of Pneumonia Disease Using … 185
8 Conclusion
In this chapter, we have discussed the outcomes and experimental usage and use-
cases of the EfficientNet and InceptionV3 models for the medical diagnosis of pneu-
monia via Chest X-Rays. We have achieved high performance results of 95.39%
and 92.93% which is achieved at a significantly low computational cost. Thereby,
using the discussed frameworks can highly beneficial in the medical diagnosis of the
disease and come in handy to the professional medical practitioners and radiologists
working with the related problem statement. Further refinement of approaches and
methodologies will definitely provide a highly positive impact towards this cause
and pave the way for further improvements therein.
References
1. Yadav, K. K., & Awasthi, S. (2016). The current status of community-acquired pneumonia
management and prevention in children under 5 years of age in India: A review. Therapeutic
Advances in Infectious Disease, 3(3–4), 83–97.
2. Çallı, E., Sogancioglu, E., van Ginneken, B., van Leeuwen, K. G., & Murphy, K. (2021). Deep
learning for chest X-ray analysis: A survey. Medical Image Analysis, 72, 102125.
3. Li, Q., Cai, W., Wang, X., Zhou, Y., Feng, D. D., & Chen, M. (2014). Medical image clas-
sification with convolutional neural network. In 13th International Conference on Control
Automation Robotics & Vision (ICARCV), Singapore, pp. 844–848. https://doi.org/10.1109/
ICARCV.2014.7064414
4. Çallı, E., Sogancioglu, E., van Ginneken, B., van Leeuwen, K. G., & Murphy, K. (2021).
Deep learning for chest X-ray analysis: A survey. Medical Image Analysis, 72, 102125. ISSN
1361-8415 https://doi.org/10.1016/j.media.2021.102125
5. https://www.spiedigitallibrary.org/journals/journal-of-medical-imaging/volume-7/issue-1/
016501/Deep-convolutional-neural-networks-in-the-classification-of-dual-energy/https://doi.
org/10.1117/1.JMI.7.1.016501.short?SSO=1
6. Deshpande, H., Harder, T., Saalbach, A., Sawarkar, A., Buelow, T. (2020). Detection of foreign
objects in chest radiographs using deep learning. In IEEE 17th International Symposium on
Biomedical Imaging Workshops (ISBI Workshops). Iowa City, IA, USA, pp. 1–4. https://doi.
org/10.1109/ISBIWorkshops50223.2020.9153350
7. Eslami, M., Tabarestani, S., Albarqouni, S., Adeli, E., Navab, N., & Adjouadi, M. (2020).
Image-to-images translation for multi-task organ segmentation and bone suppression in chest
X-ray radiography. IEEE Transactions on Medical Imaging, 39(7), 2553–2565. https://doi.org/
10.1109/TMI.2020.2974159
8. Showkat, S., & Qureshi, S. (2022). Efficacy of transfer learning-based resnet models in chest
x-ray image classification for detecting COVID-19 pneumonia. Chemometrics and Intelligent
Laboratory Systems, 224, 104534.
9. Hirata, Y., Kusunose, K., Tsuji, T., Fujimori, K., Kotoku, J. I., & Sata, M. (2021). Deep learning
for detection of elevated pulmonary artery wedge pressure using standard chest x-ray. Canadian
Journal of Cardiology, 37(8), 1198–1206.
10. Greenspan, H., Summers, R. M., & van Ginneken, B. (2016). Deep learning in medical imaging:
Overview and future promise of an exciting new technique. IEEE Transactions on Medical
Imaging, 35(5), 1153–1159.
11. Tan, M., & Le, Q. (2019). Efficientnet: Rethinking model scaling for convolutional neural
networks. In International Conference on Machine Learning (pp. 6105–6114). PMLR.
186 N. Ghoshal et al.
12. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception
architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (pp. 2818–2826).
13. Mittal, D., Gaurav, D., & Sekhar Roy, S. (2015). An effective hybridized classifier for breast
cancer diagnosis. In 2015 IEEE International Conference on Advanced Intelligent Mecha-
tronics (AIM), Busan, Korea (South), pp. 1026–1031. https://doi.org/10.1109/AIM.2015.722
2674
14. Roy, S. S., Rodrigues, N., & Taguchi, Y. (2020). Incremental dilations using CNN for brain
tumor classification. Applied Sciences 10(14):4915. https://doi.org/10.3390/app10144915
15. Shorten, C., & Khoshgoftaar, T. M. (2019). A survey on image data augmentation for deep
learning. Journal of Big Data, 6, 60.
16. Roy, S. S., Hsu, C., Samaran, A., Goyal, R., Pande, A., et al. (2023). Vessels segmentation
in angiograms using convolutional neural network: A deep learning based approach. CMES-
Computer Modeling in Engineering & Sciences, 136(1), 241–255.
17. Turki, T., & Roy, S. S. (2022). Novel hate speech detection using word cloud visualization and
ensemble learning coupled with count vectorizer. Applied Sciences, 12(13), 6611.
18. Roy, S. S., Goti, V., Sood, A., Roy, H., Gavrila, T., Floroian, D., Mohammadi-Ivatloo, B.,
et al. (2014). L2 regularized deep convolutional neural networks for fire detection. Journal of
Intelligent & Fuzzy Systems, 1–12.
19. Roy, S. S., Mihalache, S. F., Pricop, E., & Rodrigues, N. (2022). Deep convolutional neural
network for environmental sound classification via dilation. Journal of Intelligent & Fuzzy
Systems, 1–7.
20. Forecasting stock price by hybrid model of cascading multivariate adaptive regression splines
and deep neural network.
21. Bose, A., Hsu, C. H., Roy, S. S., Lee, K. C., Mohammadi-Ivatloo, B., & Abimannan, S. (2021).
Forecasting stock price by hybrid model of cascading multivariate adaptive regression splines
and deep neural network. Computers and Electrical Engineering, 95, 107405.
22. Roy, S. S., & Taguchi, Y. H. (2021). Identification of genes associated with altered gene
expression and m6A profiles during hypoxia using tensor decomposition based unsupervised
feature extraction. Scientific Reports, 11(1), 1–18.
23. Roy, S. S., & Samui, P. (2021). Predicting longitudinal dispersion coefficient in natural streams
using minimax probability machine regression and multivariate adaptive regression spline.
International Journal of Advanced Intelligence Paradigms, 19(2), 119–127.
24. Marques, G., Agarwal, D., & de la Torre, I. (2020). Automated medical diagnosis of COVID-19
through EfficientNet convolutional neural network. Applied Soft Computing, 96, 106691.
25. Biswas, R., Vasan, A., & Roy, S. S. (2020). Dilated deep neural network for segmentation of
retinal blood vessels in fundus images. Iranian Journal of Science and Technology, Transactions
of Electrical Engineering, 44(1), 505–518.
26. Roy, S. S., Samui, P., Nagtode, I., Jain, H., Shivaramakrishnan, V., & Mohammadi-Ivatloo,
B. (2020). Forecasting heating and cooling loads of buildings: A comparative performance
analysis. Journal of Ambient Intelligence and Humanized Computing, 11(3), 1253–1264.
27. Roy, S. S., Chopra, R., Lee, K. C., Spampinato, C., & Mohammadi-Ivatlood, B. (2020).
Random forest, gradient boosted machines and deep neural network for stock price fore-
casting: A comparative analysis on South Korean companies. International Journal of Ad
Hoc and Ubiquitous Computing, 33(1), 62–71.
28. Roy, S. S., Mihalache, S. F., Pricop, E., & Rodrigues, N. (2022). Deep convolutional neural
network for environmental sound classification via dilation. Journal of Intelligent & Fuzzy
Systems, 1–7.
29. Chakraborty, C., Bhattacharya, M., Sharma, A. R., Roy, S. S., Islam, M. A., Chakraborty,
S., Dhama, K., et al. (2022). Deep learning research should be encouraged for diagnosis and
treatment of antibiotic resistance of microbial infections in treatment associated emergencies
in hospitals. International Journal of Surgery (London, England), 105, 106857.
30. Lee, K. C., Roy, S. S., Samui, P., & Kumar, V. (Eds.). (2020). Data analytics in biomedical
engineering and healthcare. Academic Press.
Detection of Cancer Using Deep
Learning Techniques
1 Introduction
Cancer is a dreaded disease which is posing threat to the human society and according
to the data provided by World Health Organisation, cancer accounted for 13% of all
the fatalities in 2018 [1]. In the upcoming years it is predicted to be ranked among
the most deadly diseases in the world. As projected, 12 million individuals are likely
to be affected by cancer in 2030. The number of cancer cases would rise dramatically
in the next few years. Experts, specialists, and medical professionals are developing
new methods to combat cancer, but it is well recognized that this battle is quite
challenging [2–4].
Evaluating the visuals related to medical data by technicians, supported by
computers is referred to as interpretation. Diagnostic ultrasound images, on the
contrary, demand a large volume of data to be addressed by the physician and require
thorough analysis in a short amount of time. These imaging processes include high-
energy electromagnetic radiation. Digital photographs are analyzed by computer-
assisted methods to detect the presence or absence of cancer in the early stages
[5].
Analysis of medical images using computer tools supports medical professionals
in interpretation of medical information inherent in the images. On the other hand,
diagnosing ultrasound images using specific imaging processes such as high intensity
electromagnetic radiation necessitates a significant quantity of data to be controlled
from doctor’s end and involves thorough analysis in a short amount of time. Digital
A. Singh
School of Electronics Engineering, VIT, Vellore, TN 632014, India
Arjunaditya
School of Computer Science and Engineering, VIT, Vellore, TN 632014, India
B. K. Tripathy (B)
School of Information Technology and Engineering, VIT, Vellore, TN 632014, India
e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 187
S. S. Roy et al. (eds.), Deep Learning Applications in Image Analysis,
Studies in Big Data 129, https://doi.org/10.1007/978-981-99-3784-4_10
188 A. Singh et al.
are many other CNN variations, including those with shorter connections, like the
DenseNet architecture, which gives a significant reduction in the number of hyper
parameters needed to develop effective designs and has benefits for feature circulation
[21].
ResNets, Xception, and GooLeNet designs are other varieties of CNN architec-
tures that have been more effective recently. These networks are necessary because
multiscale processing is required, job performance across the board degrades as the
network gets deeper, and better topologies with fewer parameters are sought [22–25].
Another critical challenge in DL is the capability of an architecture to store data
over long time periods. Long Short-Term Memory has been suggested as a potential
remedy for this issue (LSTM). Through the states of specialized units, the LSTM
design enforces continuous error flow which is non-global in time and space [26].
The concept of transfer learning is another DL concept worth mentioning. Transfer
learning involves applying features taken from deep convolutional neural networks to
contemporary and inventive jobs. The requirement for this arises from the possibility
that generic tasks may differ significantly from the original tasks and that there won’t
be enough marks or inputs to train DL architecture for new tasks. The use of transfer
learning also allows characteristics to be modified with ease so that they dependably
express generalization well enough [27–29].
DL techniques utilized in cancer detection and treatments are investigated in this
paper. The purpose of the study is to demonstrate, with the help of the literature, the
effectiveness of a deep learning approach—one of the machine learning techniques
treating a condition like cancer, as well as the methodologies and techniques that are
employed and how they are applied [30].
2 Deep Learning
DL has gained a lot of popularity and success in nearly every industry and has
emerged as a useful tool for understanding how machines perceive the world. In
fields including speech recognition, image classification, video scrutiny and natural
language learning DL techniques are applied [31]. Based on a DL created mathemat-
ical model, analysis is performed without using any attribute extractor. The scope
for generalization of DL techniques is one of their key benefits. For additional appli-
cations and data types, a learnt neural network method can be used. When the data
set is inadequate DL performs poorly [32].
DL exists as a kind of machine learning approach which capitalizes on benefits
of nonlinear processing unit layers [15]. The result of the preceding layer is fed
into the subsequent layer as an input. Data is established on the results from the
visualization of the data in the DL approach by understanding multiple feature levels
[33]. A hierarchy is created in the representation by deriving low-level features
Detection of Cancer Using Deep Learning Techniques 191
When making a diagnosis, doctors frequently draw on their own knowledge, abilities,
and experience. A doctor can never guarantee that his diagnosis of a condition is
accurate, regardless of how talented he is, and many times diseases are misdiagnosed
Technologies involving AI, therefore, appear on the agenda. This is due to the fact
that AI possesses the capacity to evaluate vast quantities of data, resolve complicated
issues, and make very accurate predictions [4]. One of the most modern methods for
AI, DNN describes a number of computational methods that are useful for extracting
data from photos. Many medical disciplines have used DL algorithms for various
medical tasks like radiology, pathology etc. Good efficiency has also been achieved
in the notion of using DL tools for tumor biology and other fields, such as medical
imaging of many species [16].
Any basic neural network consists of an input layer that is connected to the output
immediately. There are several hidden layers inside DNNs that are efficient at
handling complicated issues, each layer’s weight is modified using delta learning
technique. Deep neural networks are also used to discover complex nonlinear inter-
actions by including more hidden layers. Although learning occurs relatively slowly,
DNNs are employed in unsupervised and supervised learning situations. However,
good performance outcomes can be produced, and it is typically employed for
classification and regression purposes [34, 35].
Using a DNN and endoscopic imaging, in [37] lesions were identified and differ-
entiated. It was discovered that there was no appreciable distinction in diagnostic
performance between the artificial intelligence system and skilled endoscopists. The
neural network approach they built has demonstrated great accuracy in discriminating
non-cancerous lesions and high sensitivity [37].
The ability of deep neural networks to identify cancer, specifically lung cancer, in
the presence of low-dose computed tomography and positron emission tomography
scans was examined in [38]. It was shown here that the DNN algorithm has excellent
results in detecting lung cancer. Their work also demonstrated the efforts to screen
192 A. Singh et al.
for lung cancer were more successful as a result of the continued development of
this technique [38].
• A DNN is a type of neural network with any more than two layers and a specific
complexity level [27].
• Advanced mathematical modeling is used to get deeper understanding, and as a
result, the processing of data or features is considered to be complex.
• The task of pattern recognition is carried out by a neural network, which is
a metaphor for the activity of the human brain [8, 20]. In particular, patterns
are recognized to classify cells into non-cancerous and cancerous ones and for
tracking input through various simulated neural association layers [39, 40].
• Dealing with unlabeled data is the major objective of using this network, with
each layer carrying out specific types of tasks [11].
Based on the learning technique, the DL neural network architectures are classi-
fied into 4 categories: supervised, semi-supervised, unsupervised, and reinforcement
learning [41]. Figure 1 shows how DL neural networks are categorized.
The internal representation of the data is examined by the deep unsupervised learning
architectures, employing a few features without the need for any tagged data. The
dimensionality reduction and clustering techniques used unsupervised methods.
Restricted Boltzmann Machines (RBM) and Auto-Encoders (AE) are a few deep
unsupervised learning architectures [42].
Architectures for supervised learning use predetermined data for training. Target
results and all possible combinations of inputs are fed to the network [43]. The
training phase’s data is validated during testing. Recurrent Neural Networks (RNN),
Long Short Term Memory (LSTM), Convolution Neural Networks (CNN), and gated
recurrent units are few typical methods used under supervised learning [17].
Detection of Cancer Using Deep Learning Techniques 193
Partially labeled data is used for training phase under deep supervised learning
architectures. A few semi-supervised learning architectures include LSTM, RNN,
Generative Adversarial Networks (GAN), GRU, and deep reinforcement learning
[44].
The analysis of 2D images as well as 3D images was effective with the use of CNN.
A gradient-based algorithm is taken to train majority of the CNN systems [26].
Compared to other neural network models, there are fewer factors to be tweaked.
Feature extractors and classification are both components of the CNN architecture
[45]. The feature extraction layer receives input from one layer before it and passes
it to the next layer after it. Convolution, maximum pooling, and classification are
the three types of layers that make up the CNN architecture. Even numbers are
used to represent convolution layers, while odd numbers are used to represent max-
pooling layers. The categorization layer, the final step of architecture, is a completely
connected layer. For more accuracy, an architecture using back propagation is used
during for classification. Maximum pooling, global average, average, and minimum
pooling are some of the several types of pooling procedures. Using a kernel made up
of a linear or nonlinear activation function, the convolution layer convolves the data
to create feature maps. The activation functions include the rectified linear, sigmoid,
Softmax, identity and hyperbolic tangent functions. The downsampling action occurs
in the pooling layer, which is also known as the subsampling layer. Depending on
the application, there are different numbers of classification layers. Figure 2 shows
the convolution neural network architecture.
4.3 LeNet-5
4.4 AlexNet
4.5 ZFNet
Although ZFNet’s architecture was similar to AlexNet, its settings had been fine-
tuned, making it the 2013 challenge winner. There was a 14.8% reduction in inac-
curacies. The number of weights is reduced by using 7 7 kernels rather than 11
11 kernels. The precision is increased as a result of reducing the number of tuning
parameters [49].
4.6 GoogleNet
A part of the GoogleNet design is LeNet, which has an inception structure. It has 22
number of layers, and throughout testing the rate of error decreased gradually from
6.66 to 3.66%. The building was the winner of ILSVRC 2014 [46]. When compared
to the conventional CNN architecture, it has a reduced computational complexity.
Compared to other architectures like AlexNet and VGG [50], it was less frequently
used. In Fig. 5, the GoogleNet architecture is shown.
4.7 VGGNet
The VGGNet, which consists of sixteen convolution layers with several filters, was
the ILSVRC 2014 winner [39]. With this architecture, feature extraction has been
196 A. Singh et al.
4.8 ResNet
In contrast to the classical CNN, the fully convolutional layer in the fully convo-
lutional network has been replaced with one layer of up-sampling, one layer of
de-convolution, and one completely linked layer, as shown in the Fig. 8. This archi-
tecture was designed so that the fully convolution and the de-convolution layers
create the reversed equivalents of pooling and convolution layers. Up-sampling and
de-convolution layers were added to the design, which increased its accuracy [40,
41].
4.10 U-Net
U-Net, which has two routes, was created for the segmentation of medical images.
The first path has an encoder which records the context of the image. However,
198 A. Singh et al.
the second path consists of transposed convolutions as well as a decoder [53, 54].
Figure 9 shows the U-Net.
Figure 10 shows the RNN’s fundamental structure. In [55], various RNN design
variations are described. Numerous functional blocks are included in the recurrent
neural network, as seen in Fig. 10. Recurrent neural networks are susceptible to the
vanishing gradient problem. Recurrent neural networks require memory because they
use prior states as input to determine their present state. It makes use of sequential
data, and connections among nodes create one directed graph. RNNs are used to
convert input sequences into fixed-sized vectors. Using RNN in combination with
the convolutional layer, the effective pixel neighborhood is extended. It is used in
machine translation, time series prediction, and NLP. An example of RNN is long
short-term memory network (LSTM) [56].
4.12 Autoencoders
The auto encoder functions as a potent unsupervised learning architecture with three
layers: encoder, decoder, and code. Encoding data into a more compact representation
Detection of Cancer Using Deep Learning Techniques 199
It consists of a forward feed network for the fine adjustment phase and a RBM
(Restricted Boltzmann Machine) for pre-trained model. This network receives the
200 A. Singh et al.
features that the RBM has extracted from the input data vectors. Deep belief networks
use a back propagation design with a slower learning rate. It also has numerous levels
that are hidden. The deep belief network’s primary advantage is its capacity to learn
from higher-level features that are present in earlier levels thanks to its layer-by-layer
learning strategies [59, 60]. In Fig. 12.
The medical imaging techniques like MRI, CT scan, and ultrasound were used to
evaluate the healthy function of anatomical organs and analyze diseases [61]. Cancer
diagnosis and therapy planning are crucially dependent on medical imaging modal-
ities. Preprocessing, often known as filtering, is the initial step in the processing
of medical pictures. The goal of filtering is to either eliminate image noise intro-
duced in the acquiring process or for enhancing image quality to get more accurate
details [62]. The term “segmentation” describes the method of identifying ROI, or
region of interest, and in the context of medical pictures, the ROI stands for anatom-
ical organs or any abnormalities associated with them, such as tumors or cysts. To
Detection of Cancer Using Deep Learning Techniques 201
Fig. 11 Architecture of
autoencoders
classify cancer intensity, the classification step typically uses any ML algorithm.
Compression is defined as the process of using machine-assisted techniques to make
files smaller so they can be stored and transferred with more ease. The table shows
the machine learning methods that can be used in each stage of cancer diagnosis
[63].
When assessing an ailment, professionals depend heavily on their first-hand obser-
vations, abilities, and experiences. A doctor can never be in a state of complete surety
and claim that his assessment of the condition is entirely right, and they undoubtedly
get it wrong. This introduces the dependence of Artificial Intelligence powered auto-
mated systems because artificial intelligence (AI) can evaluate enormous volumes of
information, handle complicated prepositions, and anticipate accurately. One of the
most modern methods for AI systems, deep neural networks, describes a number of
computer models that are useful for extracting data from digital images. Algorithms
for DL are utilized in several medical professions [4, 16].
The steps of cancer diagnosis are as follows.
202 A. Singh et al.
The initial stage in the identification process is pre-processing since the raw photos
include noise. Pre-processing is used to boost the quality of a picture that will be
utilized more frequently by eliminating unnecessary image data known as image
noises. If this issue is not resolved, improper categorization may occur. It becomes
crucially important to properly clean the images and convert them into standard forms
for getting high accuracy levels [3].
Image segmentation refers to dividing any image into different sections. It is sepa-
rated into pixel and region, model, and threshold based segmentation. Additionally,
there is additional histogram cutoff, adaptive cutoff point, and boundary detection
approaches. These strategies are also used in combination [3, 64, 65].
Detection of Cancer Using Deep Learning Techniques 203
After image segmentation, closing and opening operations, island removal, region
merging, border expansion and smoothening is done [3].
Table 1 shows DL architectures for various cancer diagnoses. Neural network Archi-
tectures have been extremely useful in illness detection and have also contributed
to research relating to cancer that affects different organs. The convolution sparse
encoder was found to be appropriate for all categories of 3-dimensional datasets in
the proposed work [66]. In [67], lesion identification was achieved while stage of
cancer diagnosis was accomplished using CNN and handmade features. In another
work [68], GoogLeNet was determined to be more successful, with an efficiency
of 85%, as compared to AlexNet, with an accuracy of 82%, and the VGGNet, with
an accuracy of about 84%. When compared to the conventional predictor based on
texture analysis, the model that had combined pre trained SVM and CNN was more
successful for categorizing tumor tissues in digital mammograms [69].
The researchers [70] used a DL method to perform studies on breast cancer
patients. They used a Cox prediction model and genomic datasets to make predic-
tions. They show that whenever there happens to be an abundance of information and
it is utilized to integrate and simplify biomarkers and gene regulation to enable predic-
tion, performance improves. Shimizu and Nakayama [71] used the TCGA database
to identify and work on breast cancer genes and analytical prediction. They employed
AI to identify 184 genes, after which they used ML algorithms such as Random Forest
Classifier along withDL networks to do it. Furthermore, they employed a prognostic
genetic score that utilized just 23 out of the 184 identified genes.
Liu et al. [72] Proposes a CNN model that is capable of identifying tiny cancerous
tumors using gigapixel pathology slides. The proposed system suggested in Cruz-
Roa et al. [73] identifies aggressive lesions in entire slide pictures while minimizing
human work and temporal complications. On breast ultrasound image lesion pictures,
the alternative CNN architectures like LeNet, U-Net, Transfer Learning, and AlexNet
were thoroughly analyzed and it was found that AlexNet and Patch-based LeNet were
the most accurate architectures [74].
Even before DNN tumor identification, the ROI was extracted using the different
watershed and Gaussian mixture model (GMM) algorithms in Das et al. [75]. For the
segmentation of liver tumors, the FCN structure U net was proposed, with subsequent
processing via 3D linked item tagging in order to get better segmentation results [76].
CNNs were proved to be more accurate classifiers than classical machine learning
algorithms [77]. The DNN was shown to be effective for segmenting the cancerous
growth of cells, and it is also appropriate for segmenting tiny lung nodules. Deep
Neural Network efficiency grows as training data increases [78]. The Convolutional
204 A. Singh et al.
Table 1 (continued)
References Cancer Type of data/imaging DL architecture used Performance
type(s) metrics
[85] Prostate Multiparametric Xmasnet (CNN) AUC: 0.84
Magnetic resonance
imaging (mpMRI)/3D
[86] Prostate Multiparametric Deep convolutional AUC: 0.897
Magnetic resonance neural networks
imaging (mpMRI) (DCNN)
[87] Brain Magnetic resonance Input cascade Sensitivity: 0.84
imaging (MRI) convolutional neural Specificity: 0.9
network
Neural Network suggested in Golan et al. [79] is divided into two stages, out of which
the first gathers spatial characteristics, while the second does categorization. The DL
structure was used with an SVM classifier to identify lung nodules; the rule-based
method reduced false positives. The new ResNet design outperforms the traditional
ResNet structure of lesion segmentation [80].
Additionally, using conventional camera pictures, a CNN was employed to detect
melanoma [81]. Convolutional CNN has been proposed for detecting skin lesion
borders in dermoscopy pictures [82]. A smaller network is used to analyze multi-
dimensional gene data in order to definitively diagnose cancerous cell growth in histo-
logical pictures of the colon [83]. Petalidis et al. [84] published data of genomics for
astrocytic malignancies. To be able to explain the necessity for accurate categoriza-
tion of these cancers, they used a neural network technique to merge characteristics
from histological subtypes of these cancers. They were able to identify 59 genes in
this research. They identified accurate classifications for these variants using custom
and separate data with a correctness of 96.15%.
Prostate cancer were identified under the MRI pictures using XmasNet, a CNN-
based algorithm [85]. AUC of 0.897 [86] was reached by it. In the BRATS dataset, the
brain tumor is segmented using the deep interconnected CNN, which has achieved
good performance through a cascaded design [87].
7 Conclusions
techniques helped in the early detection of cancer and contributed to patient recovery
or life extension.
DL based technological innovation has started to benefit the local and national
medical sectors. Consequently, it is advantageous to use DL technology in cancer
diagnostics and general medicine in order to get further theoretical understanding.
Researchers studying ML algorithms for diagnosing diseases as well as experts in
planning and treating have something to gain from this work’s conclusion.
References
1. Grisold, W. (Ed.) (2021). Wolfgang Grisold, Riccardo Soffietti, Stefan Oberndorfer, Guido
Cavaletti (eds): Effects of cancer treatment on the nervous system.
2. Tang, J., Rangayyan, R. M., Xu, J., El Naqa, I., & Yang, Y. (2009). Computer-aided detection
and diagnosis of breast cancer with mammography: Recent advances. IEEE Transactions on
Information Technology in Biomedicine, 13(2), 236–251.
3. Munir, K., Elahi, H., Ayub, A., Frezza, F., & Rizzi, A. (2019). Cancer diagnosis using deep
learning: A bibliographic review. Cancers, 11(9), 1235.
4. Huang, S., Yang, J., Fong, S., & Zhao, Q. (2020). Artificial intelligence in cancer diagnosis
and prognosis: Opportunities and challenges. Cancer letters, 471, 61–71.
5. Cancer Facts and Figures. (2019). American Cancer Society. https://www.cancer.org/content/
dam/cancer-org/research/cancer-facts-and-statistics/annualcancerfacts-andfigures/2019/can
cer-facts-and-figures-2019.pdf
6. Bhardwaj, P., Guhan, T., & Tripathy, B. K. (2021). Computational biology in the lens of CNN.
In S. S. Roy, Y. H. Taguchi (eds.), Handbook of machine learning applications for genomics
(Chapter 5). Studies in Big Data. ISBN: 978-981-16-9157-7 496166_1_En
7. Tripathy, B. K., & Anuradha, J. (2015). Soft computing-advances and applications. Cengage
Learning Publishers, New Delhi. ASIN : 8131526194. ISBN-10: 9788131526194.
8. Rungta, R. K., Jaiswal, P, & Tripathy, B. K. (2022) A deep learning based approach to measure
confidence for virtual interviews. In A. K. Das et al. (Eds.), Proceedings of the 4th International
Conference on Computational Intelligence in Pattern Recognition (CIPR), CIPR 2022 (pp. 278–
291). LNNS 480.
9. Bhandari, A., Tripathy, B. K., Jawad, K., Bhatia, S., Rahmani, M. K. I., & Mash, A. (2022).
Cancer detection and prediction using genetic algorithms. Comput Intell Neurosci 2022, 18.
https://doi.org/10.1155/2022/1871841
10. Allahyar, A., Ubels, J., & de Ridder, J. (2019). A data-driven interactome of synergistic genes
improves network-based cancer outcome prediction. PLoS Computational Biology, 15(2),
e1006657.
11. Adate, A., Tripathy, B. K., Arya, D., & Shaha, A. (2020) Impact of deep neural learning on
artificial intelligence research. In S. Bhattacharyya, A. E. Hassanian, S. Saha, & B. K. Tripathy
(Eds.), Deep learning research and applications (pp.69–84). De Gruyter Publications. https://
doi.org/10.1515/9783110670905-004
12. Mitchell, M. J., Jain, R. K., & Langer, R. (2017). Engineering and physical sciences in oncology:
Challenges and opportunities. Nature Reviews Cancer, 17(11), 659–675.
13. Obermeyer, Z., & Emanuel, E. J. (2016). Predicting the future—big data, machine learning,
and clinical medicine. The New England Journal of Medicine, 375(13), 1216.
14. Graves, A., Mohamed, A. R., & Hinton, G. (2013). Speech recognition with deep recurrent
neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal
Processing (pp. 6645–6649). IEEE.
15. Bhattacharyya, D. S., Snasel, V., Hassanian, A. E., Saha, S., & Tripathy, B. K. (2020). Deep
learning research with engineering applications. De Gruyter Publications. ISBN: 3110670909,
9783110670905. https://doi.org/10.1515/9783110670905
Detection of Cancer Using Deep Learning Techniques 207
16. Bose, A., & Tripathy, B. K. (2020) Deep learning for audio signal classification. In S. Bhat-
tacharyya, A. E. Hassanian, S. Saha, & B. K. Tripathy (Eds.), Deep learning research and
applications (pp. 105–136). De Gruyter Publications. https://doi.org/10.1515/9783110670905-
00660
17. Singhania, U., & Tripathy, B. K. (2021). Text-based image retrieval using deep learning. In
Encyclopedia of information science and technology (5th edn, p. 11). https://doi.org/10.4018/
978-1-7998-3479-3.ch007
18. Yagna Sai Surya, K., Geetha Rani, T., & Tripathy, B. K. (2022). Social distance monitoring
and face mask detection using deep learning. In J. Nayak, H. Behera, B. Naik, S. Vimal, & D.
Pelusi (Eds.), Computational intelligence in data mining (Vol. 281). Smart Innovation, Systems
and Technologies. Springer, Singapore. https://doi.org/10.1007/978-981-16-9447-9_36
19. Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by
reducing internal covariate shift. In International Conference on Machine Learning (pp. 448–
456). PMLR.
20. Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected
convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (pp. 4700–4708).
21. Kyi, C. W., Birriel, P. C., Davidsen, T. M., Ferguson, M. L., Gesuwan, P., Griner, N. B., Gerhard,
D. S., et al. (2020). NCI office of cancer genomics supports multidisciplinary genomics research
initiatives to advance precision oncology. Cancer Research, 80(16_Supplement), 5862–5862.
22. Pogorelov, K., Randel, K. R., Griwodz, C., Eskeland, S. L., de Lange, T., Johansen, D.,
Halvorsen, P., et al. (2017). Kvasir: A multi-class image dataset for computer aided gastroin-
testinal disease detection. In Proceedings of the 8th ACM on Multimedia Systems Conference
(pp. 164–169).
23. Mesri, M., An, E., Hiltke, T., Robles, A. I., Rodriguez, H., & CPTAC Investigators. (2022).
NCI’s clinical proteomic tumor analysis consortium: A proteogenomic cancer analysis
program. Cancer Research, 82(12_Supplement), 6331–6331.
24. Gupta, P., Bhachawat, S., Dhyani, K., & Tripathy, B. K. (2021). A study of gene characteristics
and their applications using deep learning, (Chapter 4). In S. S. Roy, & Y. H. Taguchi (Eds.),
Handbook of Machine Learning Applications for Genomics (Vol. 103). Studies in Big Data.
ISBN: 978-981-16-9157-7, 496166_1_En.
25. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8),
1735–1780.
26. Maheswari, K., Shaha, A., Arya, D., Tripathy, B. K., & Rajkumar, R. (2020). Convolutional
neural networks: A bottom-up approach. In S. Bhattacharyya, A. E. Hassanian, S. Saha, & B.
K. Tripathy (Eds.), Deep Learning Research with Engineering Applications (pp. 21–50). De
Gruyter Publications. https://doi.org/10.1515/9783110670905-002
27. Tripathy, B. K., & Deepthi, P. H. (2015). Application of spatial FCM in detecting cancer cells.
IIMT Research Network (pp. 1–6, 96–100). ISBN 878-93-82208-77-8.
28. Zhong, Z., Sun, L., & Huo, Q. (2019). An anchor-free region proposal network for Faster
R-CNN-based text detection approaches. International Journal on Document Analysis and
Recognition (IJDAR), 22(3), 315–327.
29. Hanefi Calp, M. (2021). Use of deep learning approaches in cancer diagnosis. In Deep Learning
for Cancer Diagnosis (pp. 249–267). Springer, Singapore.
30. Karahan, Ş., & Akgül, Y. S. (2016). Eye detection by using deep learning. In 2016 24th Signal
Processing and Communication Application Conference (SIU) (pp. 2145–2148). IEEE.
31. Özkan, İN. İK., & Ülker, E. (2017). Derin öğrenme ve görüntü analizinde kullanılan derin
öğrenme modelleri. Gaziosmanpaşa Bilimsel Araştırma Dergisi, 6(3), 85–104.
32. Şeker, A., Diri, B., & Balık, H. H. (2017). Derin öğrenme yöntemleri ve uygulamaları hakkında
bir inceleme. Gazi Mühendislik Bilimleri Dergisi, 3(3), 47–64.
33. Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends® in Machine
Learning, 2(1), 1–127.
34. Tripathy, B. K., Raju, H., & Kaul, D. (2018). Deep learning in health care, accepted in deep
learning for remote sensing and GIS: Frontier advancements and applications. In V. Santhi
(Eds.) CRC publications
208 A. Singh et al.
35. Ravì, D., Wong, C., Deligianni, F., Berthelot, M., Andreu-Perez, J., Lo, B., & Yang, G. Z. (2016).
Deep learning for health informatics. IEEE Journal of Biomedical and Health Informatics,
21(1), 4–21.
36. Küçük, D., & Arici, N. (2018). Doğal Dil İşlemede Derin Öğrenme Uygulamalari Üzerine Bir
Literatür Çalişmasi. Uluslararası Yönetim Bilişim Sistemleri ve Bilgisayar Bilimleri Dergisi,
2(2), 76–86.
37. Ohmori, M., Ishihara, R., Aoyama, K., Nakagawa, K., Iwagami, H., Matsuura, N., & Tada,
T., et al. (2020). Endoscopic detection and differentiation of esophageal lesions using a deep
neural network. Gastrointestinal Endoscopy, 91(2), 301–309.
38. Schwyzer, M., Ferraro, D. A., Muehlematter, U. J., Curioni-Fontecedro, A., Huellner, M. W.,
Von Schulthess, G. K., Kaufmann, P. A., Burger, I. A., & Messerli, M. (2018). Automated
detection of lung cancer at ultralow dose PET/CT by deep neural networks–initial results. Lung
Cancer, 126, 170–173.
39. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Rabinovich, A., et al. (2015).
Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (pp. 1–9).
40. Sihare, P., Ullah Khan, A., Bardhan, P., & Tripathy, B. K. (2022). COVID-19 detection using
deep learning: A comparative study of segmentation algorithms. In A. K. Das et al. (Eds.),
Proceedings of the 4th International Conference on Computational Intelligence in Pattern
Recognition (CIPR) (pp. 1–10), CIPR 2022, LNNS 480.
41. Dai, J., Li, Y., He, K., & Sun, J. (2016). R-fcn: Object detection via region-based fully
convolutional networks. Advances in Neural Information Processing Systems, 29.
42. Raina, R., Madhavan, A., & Ng, A. Y. (2009). Large-scale deep unsupervised learning using
graphics processors. In Proceedings of the 26th Annual International Conference on Machine
Learning (pp. 873–880).
43. Tripathy, B. K., Dash, S., & Patro, B. N. (2012). Study of classification accuracy of microarray
data for cancer classification using multivariate and hybrid feature selection method. IOSR
Journal of Engineering (IOSRJEN), 2(8), 112–119 ISSN: 2250-302.
44. Adate, A., & Tripathy, B. K. (2017). Understanding single image super-resolution techniques
with generative adversarial networks. Advances in Intelligent Systems and ComputingIn J.
Bansal, K. Das, A. Nagar, K. Deep, & A. Ojha (Eds.), Soft computing for problem solving (Vol.
816, pp. 833–840). Springer.
45. Liu, W., Wang, Z., Liu, X., Zeng, N., Liu, Y., & Alsaadi, F. E. (2017). A survey of deep neural
network architectures and their applications. Neurocomputing, 234, 11–26.
46. Mustafa, H. T., Yang, J., & Zareapoor, M. (2019). Multi-scale convolutional neural network
for multi-focus image fusion. Image and Vision Computing, 85, 26–35.
47. LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to
document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
48. Kaul, D., Raju, H., & Tripathy, B. K. (2022). Deep learning in healthcare. In D. P. Acharjya, A.
Mitra, & N. Zaman (Eds.), Deep learning in data analytics, deep learning in data analytics-recent
techniques, practices and applications (Vol. 91, pp. 97–115). Studies in Big Data. Springer,
Cham. https://doi.org/10.1007/978-3-030-75855-4_6
49. Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image
recognition. Preprint retrieved from arXiv:1409.1556.
50. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Fei-Fei, L., et al. (2015).
Imagenet large scale visual recognition challenge. International Journal of Computer Vision,
115(3), 211–252.
51. Tripathy, B. K., Garg, N., & Nikhitha, P. (2014). Image retrieval using latent feature learning
by deep architecture. In Proceedings of the IEEE ICCIC2014 (pp. 663–666)
52. Targ, S., Almeida, D., & Lyman, K. (2016). Resnet in resnet: Generalizing residual architec-
tures. Preprint retrieved from arXiv:1603.08029.
53. Tripathy, B. K., Parikh, S., Ajay, P., & Magapu, C.: Brain MRI segmentation techniques based
on CNN and its variants (Chapter-10). In J. Chaki (Ed.), Brain tumor MRI image segmentation
using deep learning techniques (pp.161–182.). Elsevier publications. https://doi.org/10.1016/
B978-0-323-91171-9.00001-6
Detection of Cancer Using Deep Learning Techniques 209
54. Çiçek, Ö., Abdulkadir, A., Lienkamp, S. S., Brox, T., & Ronneberger, O. (2016). 3D U-Net:
learning dense volumetric segmentation from sparse annotation. In International Conference
on Medical Image Computing and Computer-Assisted Intervention (pp. 424–432). Springer,
Cham.
55. Baktha, K., & Tripathy, B. K. (2017). Investigation of recurrent neural networks in the field
of sentiment analysis. In International Conference on Communication and Signal Processing
(ICCSP), (pp. 2047–2050). https://doi.org/10.1109/ICCSP.2017.8286763
56. Adate, A., & Tripathy, B. K. (2019). S-LSTM-GAN: Shared recurrent neural networks with
adversarial training. In A. Kulkarni, S. Satapathy, T. Kang, A. Kashan (Eds.), Proceedings of
the 2nd International Conference on Data Engineering and Communication Technology (Vol.
828, pp. 107–115). Advances in Intelligent Systems and Computing. Springer, Singapore.
57. Loey, M., El-Sawy, A., & El-Bakry, H. (2017). Deep learning autoencoder approach for
handwritten arabic digits recognition. Preprint retrieved from arXiv:1706.06720.
58. Thomas, S. A., Race, A. M., Steven, R. T., Gilmore, I. S., & Bunch, J. (2016). Dimensionality
reduction of mass spectrometry imaging data using autoencoders. In 2016 IEEE Symposium
Series on Computational Intelligence (SSCI) (pp. 1–7). IEEE.
59. Keyvanrad, M. A., & Homayounpour, M. M. (2014). A brief survey on deep belief networks and
introducing a new object oriented toolbox (DeeBNet). Preprint retrieved from arXiv:1408.3264.
60. Hinton, G. E. (2009). Deep belief networks. Scholarpedia, 4(5), 5947.
61. Jeong, J. (2017). Deep learning for cancer screening in medical imaging. Hanyang Medical
Reviews, 37(2), 71–76.
62. Pereira, G. C., Traughber, M., & Muzic, R. F. (2014). The role of imaging in radiation therapy
planning: past, present, and future. BioMed Research International.
63. Adate, A., & Tripathy, B. K. (2018) Deep learning techniques for image processing. In S.
Bhattacharyya, H. Bhaumik, A. Mukherjee, & S. De (Eds.), Machine learning for big data
analysis (pp. 69–90). De Gruyter, Berlin, Boston. https://doi.org/10.1515/9783110551433-
00357
64. Jain, S., Singhania, U., Tripathy, B., Nasr, E. A., Aboudaif, M. K., & Kamrani, A. K. (2021).
Deep learning-based transfer learning for classification of skin cancer. Sensors (Basel), 21(23),
8142. https://doi.org/10.3390/s21238142
65. Tong, N., Lu, H., Ruan, X., & Yang, M. H. (2015). Salient object detection via bootstrap
learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(pp. 1884–1892).
66. Kallenberg, M., Petersen, K., Nielsen, M., Ng, A. Y., Diao, P., Igel, C., Lillholm, M., et al.
(2016). Unsupervised deep learning applied to breast density segmentation and mammographic
risk scoring. IEEE Transactions on Medical Imaging, 35(5), 1322–1331.
67. Wang, H., Roa, A. C., Basavanhally, A. N., Gilmore, H. L., Shih, N., Feldman, M., Madabhushi,
A., et al. (2014). Mitosis detection in breast cancer pathology images by combining handcrafted
and convolutional neural network features. Journal of Medical Imaging, 1(3), 034003.
68. Ertosun, M. G., & Rubin, D. L. (2015). Probabilistic visual search for masses within mammog-
raphy images using deep learning. In 2015 IEEE International Conference on Bioinformatics
and Biomedicine (BIBM) (pp. 1310–1315). IEEE.
69. Turkki, R., Linder, N., Kovanen, P. E., Pellinen, T., & Lundin, J. (2016). Antibody-supervised
deep learning for quantification of tumor-infiltrating immune cells in hematoxylin and eosin
stained breast cancer samples. Journal of Pathology Informatics, 7(1), 38.
70. Huang, Z., Zhan, X., Xiang, S., Johnson, T. S., Helm, B., Yu, C. Y., Huang, K., et al. (2019).
SALMON: Survival analysis learning with multi-omics neural networks on breast cancer.
Frontiers in Genetics, 10, 166.
71. Shimizu, H., & Nakayama, K. I. (2019). A 23 gene–based molecular prognostic score precisely
predicts overall survival of breast cancer patients. eBioMedicine, 46, 150–159.
72. Liu, Y., Gadepalli, K., Norouzi, M., Dahl, G. E., Kohlberger, T., Boyko, A., Stumpe, M. C.,
et al. (2017). Detecting cancer metastases on gigapixel pathology images. Preprint retrieved
from arXiv preprint arXiv:1703.02442.
210 A. Singh et al.
73. Cruz-Roa, A., Gilmore, H., Basavanhally, A., Feldman, M., Ganesan, S., Shih, N. N.,
Tomaszewski, J., González, F. A., & Madabhushi, A. (2017). Accurate and reproducible inva-
sive breast cancer detection in whole-slide images: A deep learning approach for quantifying
tumor extent. Scientific Reports, 7(1), 1–14.
74. Yap, M. H., Pons, G., Marti, J., Ganau, S., Sentis, M., Zwiggelaar, R., Davison, A. K., & Marti,
R. (2017). Automated breast ultrasound lesions detection using convolutional neural networks.
IEEE Journal of Biomedical and Health Informatics, 22(4), 1218–1226.
75. Das, A., Acharya, U. R., Panda, S. S., & Sabut, S. (2019). Deep learning based liver cancer detec-
tion using watershed transform and Gaussian mixture model techniques. Cognitive Systems
Research, 54, 165–175.
76. Devi, P., & Dabas, P. (2015). Liver tumor detection using artificial neural networks for medical
images. International Journal of Innovative Reserach Science Technology, 2(3), 34–38.
77. Li, W. (2015). Automatic segmentation of liver tumor in CT images with deep convolutional
neural networks. Journal of Computer and Communications, 3(11), 146.
78. Gruetzemacher, R., & Gupta, A. (2016). Using deep learning for pulmonary nodule detection &
diagnosis.
79. Golan, R., Jacob, C., & Denzinger, J. (2016). Lung nodule detection in CT images using deep
convolutional neural networks. In 2016 International Joint Conference on Neural Networks
(IJCNN) (pp. 243–250). IEEE.
80. Kuan, K., Ravaut, M., Manek, G., Chen, H., Lin, J., Nazir, B., Chen, C., Howe, T. C., Zeng,
Z., & Chandrasekhar, V. (2017). Deep learning for lung cancer detection: tackling the kaggle
data science bowl 2017 challenge. Preprint retrieved from arXiv:1705.09435.
81. Jafari, M. H., Karimi, N., Nasr-Esfahani, E., Samavi, S., Soroushmehr, S. M. R., Ward, K., &
Najarian, K. (2016). Skin lesion segmentation in clinical images using deep learning. In 2016
23rd International Conference on Pattern Recognition (ICPR) (pp. 337–342). IEEE.
82. Sabouri, P., & GholamHosseini, H. (2016). Lesion border detection using deep learning. In
2016 IEEE Congress on Evolutionary Computation (CEC) (pp. 1416–1421). IEEE.
83. Chen, H., Zhao, H., Shen, J., Zhou, R., & Zhou, Q. (2015). Supervised machine learning model
for high dimensional gene data in colon cancer detection. In 2015 IEEE International Congress
on Big Data (pp. 134–141). IEEE.
84. Petalidis, L. P., Oulas, A., Backlund, M., Wayland, M. T., Liu, L., Plant, K., Happerfield, L.,
Freeman, T.C., Poirazi, P., & Collins, V. P. (2008). Improved grading and survival prediction
of human astrocytic brain tumors by artificial neural network analysis of gene expression
microarray data. Molecular Cancer Therapeutics, 7(5), 1013–1024.
85. Liu, S., Zheng, H., Feng, Y., & Li, W. (2017). Prostate cancer diagnosis using deep learning
with 3D multiparametric MRI. In Medical Imaging 2017: Computer-Aided Diagnosis (Vol.
10134, pp. 581–584). SPIE.
86. Tsehay, Y. K., Lay, N. S., Roth, H. R., Wang, X., Kwak, J. T., Turkbey, B. I., Pinto, P. A.,
Wood, B. J., & Summers, R. M. (2017). Convolutional neural network based deep-learning
architecture for prostate cancer detection on multiparametric magnetic resonance images. In
Medical Imaging 2017: Computer-Aided Diagnosis (Vol. 10134, pp. 20–30). SPIE.
87. Havaei, M., Davy, A., Warde, D., Biard, A., Courville, A., Bengio, Y., Pal, C., Jodoin, P. M., &
Larochelle, H. (2017). Brain tumor segmentation with deep neural networks. Medical Image
Analysis, 35, 18–31.