0% found this document useful (0 votes)
244 views

Deep Learning Applications in Image Analysis

This document provides information about a book titled "Deep Learning Applications in Image Analysis". It is edited by Sanjiban Sekhar Roy, Ching-Hsien Hsu, and Venkateshwara Kagita. The book contains 12 chapters that apply various deep learning techniques to solve image-related problems in domains such as medical imaging, plant disease classification, hyperspectral image analysis, and cancer detection. It is intended to inform academic researchers about the state-of-the-art in applying deep learning to image analysis problems. The editors hope readers will benefit from learning about these applications of deep learning.

Uploaded by

Raghavendran
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
244 views

Deep Learning Applications in Image Analysis

This document provides information about a book titled "Deep Learning Applications in Image Analysis". It is edited by Sanjiban Sekhar Roy, Ching-Hsien Hsu, and Venkateshwara Kagita. The book contains 12 chapters that apply various deep learning techniques to solve image-related problems in domains such as medical imaging, plant disease classification, hyperspectral image analysis, and cancer detection. It is intended to inform academic researchers about the state-of-the-art in applying deep learning to image analysis problems. The editors hope readers will benefit from learning about these applications of deep learning.

Uploaded by

Raghavendran
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 218

Studies in Big Data 129

Sanjiban Sekhar Roy


Ching-Hsien Hsu
Venkateshwara Kagita Editors

Deep Learning
Applications
in Image
Analysis
Studies in Big Data

Volume 129

Series Editor
Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland
The series “Studies in Big Data” (SBD) publishes new developments and advances
in the various areas of Big Data- quickly and with a high quality. The intent is to
cover the theory, research, development, and applications of Big Data, as embedded
in the fields of engineering, computer science, physics, economics and life sciences.
The books of the series refer to the analysis and understanding of large, complex,
and/or distributed data sets generated from recent digital sources coming from
sensors or other physical instruments as well as simulations, crowd sourcing, social
networks or other internet transactions, such as emails or video click streams and
other. The series contains monographs, lecture notes and edited volumes in Big
Data spanning the areas of computational intelligence including neural networks,
evolutionary computation, soft computing, fuzzy systems, as well as artificial
intelligence, data mining, modern statistics and Operations research, as well as
self-organizing systems. Of particular value to both the contributors and the
readership are the short publication timeframe and the world-wide distribution,
which enable both wide and rapid dissemination of research output.
The books of this series are reviewed in a single blind peer review process.
Indexed by SCOPUS, EI Compendex, SCIMAGO and zbMATH.
All books published in the series are submitted for consideration in Web of Science.
Sanjiban Sekhar Roy · Ching-Hsien Hsu ·
Venkateshwara Kagita
Editors

Deep Learning Applications


in Image Analysis
Editors
Sanjiban Sekhar Roy Ching-Hsien Hsu
School of Computer Science College of Information and Electrical
and Engineering Engineering
Vellore Institute of Technology Asia University
Vellore, TN, India Musashino, Taiwan

Venkateshwara Kagita
Department of Computer Science
and Engineering
National Institute of Technology Warangal
Warangal, India

ISSN 2197-6503 ISSN 2197-6511 (electronic)


Studies in Big Data
ISBN 978-981-99-3783-7 ISBN 978-981-99-3784-4 (eBook)
https://doi.org/10.1007/978-981-99-3784-4

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Singapore Pte Ltd. 2023

This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
This book is dedicated to my mother
“Papri Roy”
–Sanjiban Sekhar Roy
Preface

In recent times, deep learning applications have achieved cutting-edge results on


various image-related problems. Deep learning models are fascinating because they
can understand images and perform vision tasks without requiring a complex series
of specialized methods. In recent years, deep learning has emerged as the fastest-
growing field in artificial intelligence. It has found widespread application across
various domains, showcasing its effectiveness and rapid development. Starting from
handwritten character recognitions, automatic diagnosis of COVID-19 disease from
x-ray images, imbalance image data sets of classification to image captioning, vehicle
over speed detection systems, and many other applications. The topics that have been
included in this book will cater interest to academicians and researchers working in
the field of deep learning and machine learning with image-related problems. Also,
graduates, postgraduates, and Ph.D. scholars working in these fields will immensely
be benefited. This edited book has dealt with the following chains of works on the
applications of deep learning for various image-related problems.
• Autoencoder and Deep Convolutional Generative Adversarial Network in
Improving Classification Performance of Bangla Handwritten
• Deep Learning-Based Approaches Using Feature Selection Methods for Auto-
matic Diagnosis of COVID-19 Disease from X-RAY Images
• Image Captioning Using Deep Transfer Learning
• Vehicle Over speed Detection system
• An Intelligent System for Video-Based Proximity Analysis
• Melanoma cancer detection using deep learning
• Plant Diseases Classification using Neural Network: AlexNet
• Hyperspectral Images: A Succinct Analytical Deep Learning Study
• Chest X-Ray image classification of Pneumonia Disease using Efficient Net and
InceptionV3
• Detection of Cancer using Deep Learning Techniques

vii
viii Preface

The intention of compiling this book is to present a good idea about both theory
and practice related to the above-mentioned applications before the readers by show-
casing the usages of deep learning. We hope that readers will be benefited signifi-
cantly from learning the state of the art of deep learning applications in the domain
of imagery.
Keep reading, learning, and inquiring.

Vellore, TN, India Dr. Sanjiban Sekhar Roy


September 2020 Professor, School of Computer Science
and Engineering
Vellore Institute of Technology
Vellore, India
[email protected]
Contents

Autoencoder and Deep Convolutional Generative Adversarial


Network in Improving the Performance of Bangla Handwritten
Character Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Tanzina Akter Tani, Mir Moynuddin Ahmed Shibly,
Md. Shoumique Hasan, Nilofa Yeasmin, and Shamim Ripon
Deep Learning-Based Approaches Using Feature Selection
Methods for Automatic Diagnosis of COVID-19 Disease
from X-Ray Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Burak Taşci
Image Captioning Using Deep Transfer Learning . . . . . . . . . . . . . . . . . . . . . 51
Tapan Kumar Das
Vehicle Over Speed Detection System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
K. Ganesan, N. S. Manikandan, and Vijayan Sugumaran
An Intelligent System for Video-Based Proximity Analysis . . . . . . . . . . . . . 89
Sergey Antonov, Mikhail Bogachev, Pavel Leyba, Aleksandr Sinitca,
and Dmitrii Kaplun
Deep Learning-Based Conjunctival Melanoma Detection Using
Ocular Surface Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Kanchon Kanti Podder, Mohammad Kaosar Alam,
Zakaria Shams Siam, Khandaker Reajul Islam, Proma Dutta,
Adam Mushtak, Amith Khandakar, Shona Pedersen,
and Muhammad E. H. Chowdhury
Plant Diseases Classification Using Neural Network: AlexNet . . . . . . . . . . 133
Mohd Anas, Sanjiban Sekhar Roy, Kunwar S. Srivastava,
and Jashabir Chakraborty
Hyperspectral Images: A Succinct Analytical Deep Learning Study . . . . 149
L. Sandeep Kumar, G. K. Panda, and B. K. Tripathy

ix
x Contents

Chest X-Ray Image Classification of Pneumonia Disease Using


EfficientNet and InceptionV3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
Neel Ghoshal, Mohd Anas, and Sanjiban Sekhar Roy
Detection of Cancer Using Deep Learning Techniques . . . . . . . . . . . . . . . . . 187
Apoorv Singh, Arjunaditya, and B. K. Tripathy
About the Editors

Sanjiban Sekhar Roy is currently a Professor with the School of Computer Science
and Engineering, Vellore Institute of Technology. He received Ph.D. degree from
the Vellore Institute of Technology, Vellore, India, in 2016. He has edited handful of
special issues for journals, published numerous articles in SCI high impact journals
such as IEEE Transactions on Computational social systems; Scientific Reports,
Nature; Computers and Electrical Engineering, Elsevier and many other reputed
journals; Dr. Roy has published nine books with reputed international publishers such
as Springer, Elsevier and IGI Global. His research interests are deep learning and
advanced machine learning. Dr. Roy was a recipient of the “Diploma of Excellence”
Award for academic research from the Ministry of National Education, Romania.
He was also an Associate Researcher with Ton Duc Thang University, Ho Chi Minh
City, Vietnam, during 2019 to 2020.

Ching-Hsien Hsu is Chair Professor of the College of Information and Electrical


Engineering, Asia University, Taiwan; Professor in the department of Computer
Science and Information Engineering, National Chung Cheng University; Research
Consultant, Department of Medical Research, China Medical University Hospital,
China Medical University, Taiwan. His research includes cloud and edge computing,
big data analytics, high performance computing systems, parallel and distributed
systems, artificial intelligence, medical AI and natural language processing. He
has published 350+ papers in top journals such as IEEE TPDS, IEEE TSC, ACM
TOMM, IEEE TCC, IEEE TETC, IEEE System, IEEE Network, top conferences, and
book chapters in these areas. Dr. Hsu is the editor-in-chief of International Journal
of Grid and High Performance Computing, and International Journal of Big Data
Intelligence; and serving as editorial board for a number of prestigious journals,
including IEEE Transactions on Service Computing, IEEE Transactions on Cloud
Computing, International Journal of Cloud Computing, Journal of Communication
Systems, International Journal of Computational Science, AutoSoft Journal. He has
been acting as an author/co-author or an editor/co-editor of 10 books from Elsevier,
Springer, IGI Global, World Scientific and McGraw-Hill. Dr. Hsu was awarded seven

xi
xii About the Editors

times talent awards from Ministry of Science and Technology, Ministry of Educa-
tion, and nine times distinguished award for excellence in research from Chung Hua
University, Taiwan. Prof. Hsu is president of Taiwan Association of Cloud Coputing;
Chair of IEEE Technical Committee on Cloud Computing (TCCLD); Fellow of the
IET (IEE) and senior member of the IEEE.

Venkateswara Rao Kagita is an Assistant Professor at NIT Warangal. He has


obtained Ph.D. from the University of Hyderabad. His research interests are Data
Mining, Machine Learning, and Deep learning with a specific focus on machine
learning techniques for recommender systems. His research works have been
published in various reputed journals and conference proceedings. He has also deliv-
ered various guest lectures in several International and National workshops, IITs,
NITs, and Universities.
Autoencoder and Deep Convolutional
Generative Adversarial Network
in Improving the Performance of Bangla
Handwritten Character Recognition

Tanzina Akter Tani, Mir Moynuddin Ahmed Shibly, Md. Shoumique Hasan,
Nilofa Yeasmin, and Shamim Ripon

1 Introduction

Handwritten character recognition has been an area of interest among deep learning
researchers and practitioners in recent years. Due to its huge possibilities of various
implementations, a significant number of studies have been carried out on hand-
written texts, and character recognition of different languages, such as English [1],
Japanese [2], Latin [3], etc. Bangla is the 1st and official language of Bangladesh, and
it is the 4th most popular language in the world, spoken by almost 300 million people
[4]. Considering this large number of native users, handwritten character recognition
of the Bangla language plays a very important role in a wide range of applications,
including bank cheque processing, identifying postal codes, zip code scanning, inter-
preting national ID numbers, Bangla optical character recognition (OCR), and many
more [5, 6].
In the Bangla language, there are 11 vowels, 39 consonants, and a considerable
number of vowel diacritical, consonant conjuncts and diacritical, and other digits,
symbols, and punctuation marks. Recognizing handwritten Bangla characters is more
difficult and complicated for a number of reasons: (a) there are a lot of compound
characters in the Bangla alphabet, (b) the forms of certain characters are identical,
(c) as different people write in different ways, the same character written by different
people will have different forms, sizes, and curvatures.

T. A. Tani · Md. S. Hasan · N. Yeasmin · S. Ripon (B)


Department of Computer Science and Engineering, East West University, Dhaka, Bangladesh
e-mail: [email protected]
M. M. A. Shibly
Department of Computer Science and Engineering, United International University, Dhaka,
Bangladesh
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 1
S. S. Roy et al. (eds.), Deep Learning Applications in Image Analysis,
Studies in Big Data 129, https://doi.org/10.1007/978-981-99-3784-4_1
2 T. A. Tani et al.

To overcome these problems several efforts have been taken to improve the recog-
nition accuracy. Convolutional neural networks (CNN) [4, 7–9], Deep CNN [10], and
ensemble learning methods [11, 12] have been applied in recent years. However, the
scarcity of Bangla datasets and imbalances in classes in those datasets are a barrier
to the recognition problem. Ensemble methods and image augmentation are among
the many ways to overcome this issue. A generative Adversarial Network (GAN)
introduced by [13] is another way to produce new instances of data. The presence of
outliers in the dataset can also make the recognition a difficult task as they mislead
the training of the models. So, by eliminating outliers, statistically meaningful results
can be obtained.
In Bangla handwriting-related studies, researchers have used different classifica-
tion approaches. The authors in [14] suggested a hierarchical method for segmenting
characters from sentences, with multilayer perceptron (MLP) as the classification
algorithm, whereas an MLP, RBF network, and SVM fusion classifier is suggested
in [15]. In [16], Bangla handwriting images are classified into 50 groups by using a
multilayer perceptron neural network.
Deep learning methods such as convolutional neural network (CNN)-based archi-
tecture have been used in the majority of recent works. Some of these works are only
limited to simple characters [17] while others are concentrated in handwritten digits
[18, 19]. Additionally, work has been done on a subset of the compound charac-
ters of the Bangla language [20]. One of the major issues in Bangla handwritten
character recognition is the limited availability of a complete handwritten characters
dataset. Generating Bangla handwritten characters is a way to solve this problem.
Deep convolutional generative adversarial network (DCGAN) [21] has been used
by some researchers to generate Bangla handwritten digits [22, 23]. However, there
has not been much work focused on generating complicated curative characters and
classifying Bangla handwritten characters using them.
The deep neural network is a widely used technique in analyzing and classifying
different types of images [24–27]. Residual Networks (ResNet) is one of the promi-
nent neural network-based architectures that have been used for image classification
and identification with excellent results for a long time. For example, researchers have
used Transfer Learning with ResNet-50 for Malaria Cell-Image classification [28],
or malicious software classification [29]. The residual networks have been applied
in some of the Bangla handwritten character recognition studies [30–32].
This chapter aims at proposing a two-fold approach with the residual network clas-
sifier to classify Bangla handwritten characters. At first, a model has been created
by using a ResNet variant called ResNet-50 to show the classification of the target
dataset, which is in this case, the Ekush [33] dataset. After classifying, the datasets are
then stabilized by removing the outliers by using an autoencoder, and the classifica-
tion is performed again by using the same ResNet-50 model. Finally, the classes with
a fewer number of images are augmented with more images by applying DCGAN,
so that the number of images among the classes becomes balanced. This dataset is
then classified with the ResNet-50 model. In the end, a detailed comparative anal-
ysis is conducted over the results obtained from the above-mentioned experiments
measuring the strengths of the adopted methods.
Autoencoder and Deep Convolutional Generative Adversarial Network … 3

The rest of the chapter is structured as follows. Section 2 covers a detailed review
of the state-of-the-art of Bangla handwritten character recognition. The methods and
materials of this study and elaborate result analysis with discussion are presented in
Sects. 3, 4, and 5. The chapter ends with an appropriate conclusion section.

2 Related Work

The researchers in [4] have introduced a CNN model named Ekushnet, which has
generated satisfactory results on Ekush [33] and CMATERdb [34] datasets. The
authors have mentioned that their Ekushnet model has performed extremely well
and generated the best results on Bangla character recognition relative to the prior
work that has been performed. Their proposed model has found 96.90% accuracy
on the training set and 97.73% on the validation set on the Ekush dataset after 50
iterations. The authors also applied cross-validation on the CMATERdb dataset and
found that their EkushNet model is 95.01% accurate. Another research work [7]
has applied only a CNN model on Bangla handwritten character identification and
their proposed model obtained 85.96% accuracy on the test dataset, whereas the
authors in [10] achieved 95% accuracy by using the Deep CNN model. Both works
have used 50 alphabet classes of the Ekush dataset. Another study [20] has achieved
95.05% accuracy on 122 classes in the Ekush dataset using their implemented DCNN
model. The authors have experimented on two other databases, CMATERdb and
BanglaLekha Isolated dataset [35]. The authors in [36] have shown an excellent
accuracy result on the CMATERdb dataset which is 98.78%. The researchers have
used five different approaches for classification.
The authors of [11] have found that the ensembled convolutional neural network
system outperforms a single CNN model when it comes to recognizing Bangla
handwriting. They have proposed a stacked Generalization Ensemble Framework,
consisting of six CNN models. Their research has reached 96.72% accuracy on the
test set. They achieved the result only after 40 epochs. Another study [37] has applied
three approaches: first, seven CNN models have been applied to recognize the Bangla
handwritten characters. Then the best performing model ResNet-50 which has given
97.81% accuracy, has been used for feature extraction, and classification is done by
traditional classification algorithms. In the last step, the authors employed different
ensemble techniques for the classification task. The stacked generation ensemble
method has achieved 98.68% test accuracy which is the best result among all the
adopted methods. All the experiments of this study have been done on Ekush and
BanglaLekha-isolated datasets.
4 T. A. Tani et al.

The authors of another study [38] experimented on six CNN models and evalu-
ated which DCNN model produces the best performance by using CMATERdb [34]
dataset. The results have shown that all the DCNN models have worked wonderfully,
but the DenseNet model has outperformed the others. They have also pointed out
that the DCNN framework works better than other object recognition methods.
Another work [17] has shown that data augmentation can improve handwritten
character identification accuracy. The authors have tested their algorithms on the
alphabets of the BanglaLekha-Isolated dataset and found that it is 91.81% accurate
without data-augmented images and 95.25% accurate with data-augmented images.
They have also compared other machine learning approaches to find out the effi-
ciency level of these methods. The comparative analysis has revealed that CNN
outperforms SVM and LSTM with or without data augmentation. They also have
put their proposed approaches to test on other datasets with similar characteristics.
The experiment has demonstrated 95.07% test accuracy on the 59 classes of the
Ekush dataset.
The performance of the classifier can be enhanced by enlarging the dataset size.
GAN as a data augmentation technique can help to expand the dataset [23]. In [22],
the authors have proposed a DCGAN architecture that successfully increased four
Bangla handwritten character datasets. For their proposed work, the writers have
just focused on the digit dataset. However, they have not attempted to determine the
CNN model performance. The proof of improving the performance of the classifier on
handwritten datasets by adding GAN-generated images has been shown by another
study [23]. The proposed method has been successful in increasing the accuracy on
the MNIST dataset by using the GAN approach. They have also used GAN on three
Indian numerical handwritten datasets: Bangla, Devanagari, and Oriya. The accuracy
of all the datasets has been improved. However, the result of the proposed work has
shown that combining so many GAN-generated images with the real dataset might
degrade efficiency. Another digit recognition and generation work done by [39] has
proposed network architecture and achieved 99.44% on BHAND [40] dataset. After
that, the study applied Semi-Supervised GAN or SGAN to generate Bangla digits.
One more GAN-related work [41] has proposed a conditional GAN-based method
for generating character images based on class. Their study has used three separate
Bangla handwritten character datasets and has been able to achieve very realistic
images by 1500 epochs. However, they did not apply any classification with the
generated images.
Autoencoder and Deep Convolutional Generative Adversarial Network … 5

The literature review reveals that most of the research has been done with either
CNN or Deep CNN models. Apart from performing classification, there have been
very few variations of approaches in Bangla handwritten character recognition works.
Only a few studies have used the GAN method with the attempt of the classifiers.
Furthermore, none of the studies has used outlier detection as part of their study.
So, there is a knowledge gap identified in this literature review about outlier identi-
fication and elimination. In this work, both approaches are explored to enhance the
recognition performance of Bangla handwritten characters.

3 Materials and Methods

In order to improve handwritten character recognition of the Bangla language, a


series of steps has been followed and a set of experiments have been conducted to
evaluate the effectiveness of the proposed model. These steps are illustrated in this
section. The schematic view of the overall steps is shown in Fig. 1.
As shown in Fig. 1, various experiments are conducted over the dataset. Algorithm
1 outlines the pseudocode of the proposed methods. The adopted steps of the model
are described in the subsequent sections.

Fig. 1 Schematic view of the proposed model


6 T. A. Tani et al.

Algorithm 1: Bangla Handwritten Character Recognition


1. Procedure Handwritten Character Recognition
Input: Ekush Dataset as D.
Output: Prediction of Handwritten characters
2. Rename D with classification labels
3. TRAIN_DATA, TEST_DATA ← TRAIN_TEST_SPLIT(D, 0.7, 0.3)
4. ResNet-50_MODEL ← TRAIN_ResNet-50_MODEL(TRAIN_DATA)
5. prediction_results ← ResNet-50_MODEL(TEST_DATA)
6. performance ← PERFORMANCE_SCORE(prediction_results, Labels of
TEST_DATA)
7. For i = 0 to NUMBER_OF_OUTLIER_CLASS do
8. N ← Sample few inlier images from class i
9. TEST_SET ← All images from class i – N
10. OUTLIER_DETECTOR ← Create_autoencoder(N)
11. OUTLIERS, INLIERS ← OUTLIER_DETECTOR(TEST_SET)
12. DISCARD(OUTLIERS)
13. Dnew ← New pure dataset after discarding outliers
14. Calculate prediction performance using steps 3-6 using Dnew and continue
next step
15. G ← GAN_CLASS_DATA
16. For i = 0 to NUMBER_OF_GAN_CLASS do
17. G ← GAN_CLASS_DATA [i]
18. For each EPOCH do
19. For each image_batch in G do
20. generated_image ← GENERATOR (random_noise, training=TRUE)
21. real_output ← DISCRIMINATOR (image, training=TRUE)
22. fake_output← DISCRIMINATOR(generated_image, training=TRUE)
23. GENERETOR_LOSS (fake_output)
24. DISCRIMINATOR_LOSS (real_output, fake_output)
25. GRADIENTS_OF_GENERATOR // to update the generator
26. GRADIENTS_OF_DISCRIMINTOR // to update the discriminator
27. optimizer ← Apply ADAM_OPTIMIZER on GENERATOR,
DISCRIMINATOR
28. if (epoch % 50 == 0) then
29. Gnew ← Save GENERATED_IMAGES
30. Dnew ← Gnew
31. D ← Merge G and D
32. Calculate prediction performance using steps 3-7 and continue next step
COMPARE prediction performance from all three cases
Autoencoder and Deep Convolutional Generative Adversarial Network … 7

Fig. 2 Different handwritten characters from the Ekush dataset

3.1 Dataset

BanglaLekha Isolated [35], ISI [42], NumtaDB [43], CMATERdb [34], and Ekush
[33] are a few datasets that contain Bangla handwritten characters and numerals.
Ekush dataset is selected in this study for experimental purposes because it contains
more classes than any other Bangla handwritten dataset. The Ekush dataset consists
of basic and compound characters, numerals, and modifiers. The 122 classes of
characters are grouped into four categories: 10 modifiers, 11 vowels, 39 consonants,
52 widely used compound letters, and 10 numeral digits and the dataset contains
about 7,29,750 images. A few images from the dataset are shown in Fig. 2. The
images are greyscaled with a size of 28 × 28 pixels.

3.2 Outlier Detection

Outliers boost the uncertainty of the results, lowering statistical power. Therefore,
removing outliers can lead to statistically significant results. In the Ekush dataset,
Bangla handwritten characters are categorized into 122 groups. While grouping indi-
vidual characters into their respective classes, some characters are moved into sepa-
rate classes. As a result, a few characters from various classes are mixed up [37]. It
has already been mentioned that some Bangla handwritten characters bear a striking
resemblance to one another. During the pre-processing of the dataset, it has been
discovered that class 87 and 97, class 19 and 84, and class 69, 76, 110, and 111, all
contain outliers of one another due to their resemblance. Since the character instances
are anomalous in only the specific context, they are termed contextual outliers. To
locate outliers in the individual groups, a semi-supervised outlier detection approach
using autoencoders has been applied.
In Fig. 3, the process of outlier detection and preparing a purer dataset has been
presented. Initially, the images of 122 classes have been analyzed and eight classes
that potentially contain more outliers have been identified. From each class that
contains more than 3000 images, 1000 inlier images have been selected and the
training sets for the autoencoder-based outlier detection models have been prepared.
The number of images in the training sets was 500 for the classes with less than
3000 images. No outlier image has been fed to the outlier detection model during
8 T. A. Tani et al.

Fig. 3 Workflow of the outlier detection

training. After training, the rest of the images have been tested using the model and
the outliers have been identified. The inlier images along with the previously selected
pure training set have been used to develop robust classifiers.

3.2.1 Autoencoder

Autoencoders are special neural networks that learn features of complex data in lower
dimensions from unlabeled data [44], then try to reconstruct the original complex
input from the simpler encoded features. This type of neural network has been proven
to perform well in numerous fields such as generative models, classification, clus-
tering, recommender system, dimensionality reduction, and so on [45], but in this
work, it has been used as an outlier detection model.
A convolutional autoencoder depicted in Fig. 4 has been used for detecting outliers
from Bangla handwritten character dataset. The network consists of three major
components – an encoder network, a bottleneck layer, and a decoder network. The
encoder network starts with an input layer. After that, there are three convolutional
layers; the output of the last such layer is flattened and passed through a dense layer
which produces a vector containing features in a lower dimension. This is also known
as a bottleneck which is followed by a decoder network. The job of the decoder is to
reproduce the input as close as possible to the original. Convolutional transpose layers
have been used which perform the inverse operation of what typical convolutional
layers do.
Autoencoder and Deep Convolutional Generative Adversarial Network … 9

Fig. 4 Convolutional autoencoder for outlier detection

The autoencoder network has been trained with only a few inlier images. The
intuition behind using only inlier images is to train the model to be familiar with
what is normal so that while testing, the model reconstructs the outlier images poorly
and reconstruction error becomes high. The images with reconstruction errors higher
than a specific threshold then have been labeled as outliers and have been discarded
from the dataset. The reconstruction error is calculated using mean squared error.

3.3 Generative Adversarial Network

The Ekush Bangla handwritten dataset contains several imbalanced classes. Data
augmentation can be a way for generating a number of images in order to balance
a dataset. Data augmentation approaches such as rotation, and scaling can expand
a dataset but do not always add information. Generative Adversarial Network
(GAN), on the other hand, can generate synthetic images that can bring additional
information to the dataset. We have chosen a deep convolutional generative adver-
sarial network (DCGAN) as it is the most effective architecture for improving
classification and identification [46]. We have only taken five classes from the Ekush
dataset as these classes have much fewer images than others. The outlier-removed
classes that are common in these classes are used as input data in the proposed GAN
model. Table 1 shows the classes that have been used in DCGAN.
The generative adversarial network is a method for creating new synthetic data
that consists of two models: generator and discriminator. The generator attempts to
create a new image from the random noise and feeds it into the discriminator model,
which determines whether the image is fake or real. If the discriminator determines
it to be fake, the generator attempts again to create a new image to deceive the
discriminator. The fight between these two models will continue until the generator
becomes incredibly powerful, creating a synthetic image that the discriminator model
is unable to differentiate. A general view of GAN is shown in Fig. 5.
Though a few experimental setups have been altered, we have adopted the DCGAN
model shown in [20] as their approach has achieved good results for generating Ekush
dataset images. The DCGAN architecture is defined briefly in the following section.
A CNN is used for both discriminator and generator in DCGAN architecture. Before
passing to the DCGAN model, the images have been prepared by converting all the
10 T. A. Tani et al.

Table 1 Classes used in


Class No Number of images Image example
DCGAN
72 4186

76 4261

97 4100

110 2012

111 986

Fig. 5 General overview of GAN

images into a single channel and by resizing them all into 28 × 28 pixels. After that,
all the images are normalized in the scale of [-1,1]. The GAN model has been run
separately for each of the chosen classes.

3.3.1 Generator Architecture

In GAN, the generator model is used to create new images from a random variable.
A random noise of 100 input sizes is given to the generator of our model. This is
forwarded onto the dense layer which with 1024 hidden units. To keep the GAN
model steady, batch normalization is used in both the generator and discriminator
[21]. The Relu activation function is used in all layers except the output layer, where
the Tanh activation is used. The Tanh activation function allows taking the pixel
in the [-1,1] range that is later used as the discriminator input [23]. Again, another
Autoencoder and Deep Convolutional Generative Adversarial Network … 11

Fig. 6 Generator architecture

dense layer having 6272 neurons is used. After the following batch normalization
and Relu activation, the output is reshaped. Up-sample to input data in the generator
model is required to generate a new output image. Two convolution layers have been
used where the first layer consists of 64 filters, and a kernel size of 3, and the second
layer consists of one filter, and 3 kernel size. In both layers, padding with zero has
been applied. A 2D upsampling is used just before each convolutional layer. The
architecture of the generator model is given in Fig. 6.

3.3.2 Discriminator Architecture

In the discriminator model, two convolution layers have been used. With 64 filters,
a kernel size of 5, a stride of 2, and ‘same’ padding, the first convolutional layer
receives the dimension of 28 × 28 × 1 as an input shape. The same size of the
kernel, stride, and padding is used in the second convolution layer with 128 filters.
Then the outputs are flattened and transferred to the dense layer with 256 hidden
neurons. The LeakyRelu activation function with an initial of α = 0.2 has been used
in every layer in the discriminator model as it helps to perform well in the GAN
model [21]. The alpha(α) parameter is the leakiness of the LeakyReLu activation
function which controls the negative inputs and allows the passing of negative values
to the network which prevents the dying state. After that, a 25% dropout has been
used to keep the discriminator model from overfitting. Finally, a single unit of output
has been used in a dense layer having sigmoid activation. The architecture of the
discriminator model is given in Fig. 7.
Following this [21] research, we have used the Adam optimizer in our DCGAN
model. Although another study [47] has used an Adam optimizer with 1 × 10−5
of learning rate and 0.1 of β1 momentum, we have changed the learning rate to
1.5e−4 and momentum term to β1 = 0.5 in both the discriminator and generator
model as we have found that these values of parameters helped to stabilize the
training. β_1 momentum is used to control the decaying of the running average of
the gradient, which is exponentially multiplied by itself at the end of each batch step
[48]. Binary cross entropy has been employed to measure the loss of the discriminator
and generator. We have used two separate batch sizes: 64 batch size in 110 and 111
classes because the number of real images has been limited, and 128 batch size in the
12 T. A. Tani et al.

Fig. 7 Discriminator architecture

Table 2 DCGAN generated


Class no Epoch number Total DCGAN image
dataset details
72 1900 964
76 1750 1360
97 750 896
110 1850 3267
111 3750 3594

other three classes. For these groups, the model has been trained for 2000 epochs,
except for 111, which have been trained for 4000 epochs. The explanation for the
higher number of epochs for 111 groups is that the number of training data in 111
is very scarce, which prevents the generator to generate quality synthetic images
in the early epoch. Every 50 epochs, we saved images and observed the produced
images. We have taken images for these five classes at various epochs and identified
the epoch where the quality of synthetic images is good enough compared to the
actual images. We have taken a fixed number of images for each of these classes so
that the classification model is trained with at least 4000 images. Table 2 shows the
total number of images that are added to the actual training dataset.

3.4 Classification

Before applying the classification model, all the images have been resized as 28 ×
28 scales with gray color mode. We have used ResNet-50 to classify the 122 classes
of the Ekush dataset. The name implies that the model consists of 50 layers. A brief
description of ResNet-50 is given in the following section.
Autoencoder and Deep Convolutional Generative Adversarial Network … 13

3.4.1 ResNet-50

Identity and convolutional blocks are the two different blocks that are used in ResNet-
50 architecture based on the dimensions of the input and output. Both blocks have a
skip connection over the main path which helps the model learn an identity function.
The identity function helps to skip the layers to be trained which is not helpful to
add value to accuracy [49]. In the identity block, there are three Conv2D layers
with stride (1,1) and zero seed random initialization. Only the second Conv2D has
padding. Batch normalization and Relu activation follow each Conv2D except that
a shortcut is added before the final Relu activation. In the convolutional block, the
skip connection has a Conv2D layer and Batch normalization that the identity block
does not have. Except for this, the structure is almost the same as the identity block.
The first and the convolution layer on shortcut paths has a stride of (s,s) and the rest
has (1,1).
The ResNet-50 architecture has five stages. Before entering these stages, the
dimension of the dataset image 28 × 28 × 1 is given as an input shape to the ResNet-
50 architecture. The first stage of the ResNet-50 has 7 × 7 convolutional layers
with 32 filters and (1,1) strides. Right after that, batch normalization and a 3 × 3
MaxPooling layer are used. The rest of the stages of ResNet-50 has two, three, five,
and two numbers of identity blocks, respectively followed by a convolutional block.
After the five stages, there is an average pooling with (2,2) strides, which is used
to reduce the output. Finally, a SoftMax activation is used with an FC-dense layer
to reduce the 122 input classes. The diagram of ResNet-50 architecture is given in
Fig. 8.
To train the ResNet-50 models, adam optimizer with the default learning rate
value of 0.001 has been used. Also, as the loss function, categorical cross-entropy
has been utilized. The accuracy with 1024 batch size has provided the best result in
the [11] study. Following this study, the batch size has reset to 1024. Furthermore,
100 epochs have been used to train all the approaches.

Fig. 8 ResNet-50 architecture


14 T. A. Tani et al.

Table 3 Overview of the dataset for all approaches


Datasets Total number of images train set Test set Validation set
Original dataset 729,750 547,131 109,777 72,842
Outlier removed Dataset 727,849 545,724 109,467 72,658
DCGAN + 737,659 555,224 109,777 72,658
Outlier Removed
Dataset

3.5 Train, Test, and Validation

There have been made three different datasets after the deduction of outliers and
DCGAN image generation. All the images in each approach have been split in such
a way that 70% of images are in the training set, 20% of images are in the test set
and 10% of images are in the validation set. The DCGAN-generated images are used
only to balance the training dataset after the split to avoid bias. The total number of
images and train, test, and validation set image numbers are presented in Table 3.

4 Results

To improve the performance of Bangla handwritten character recognition, initially,


a semi-supervised image outlier detection model has been proposed, and secondly,
a generative adversarial network model has been used to balance the dataset. For
both strategies, a subset of 122 classes has been chosen based on the recommen-
dation made by other works [37] and based on the domain knowledge regarding
Bangla handwritten characters. Outliers have been excluded from 7 classes using an
autoencoder-based model and 5 classes have been balanced up using the DCGAN
model. In this section, the outcomes of the experiments are explained in detail.

4.1 Result Analysis

The Ekush dataset has been classified using the ResNet-50 framework in three
different datasets (original dataset, outlier removed dataset, outlier removed and
DCGAN implemented dataset). The ResNet-50 model has achieved 97.63% test
accuracy on the original dataset consisting of 122 classes. The second approach
where the outliers are removed from seven classes has achieved 97.95% test accu-
racy. And the final approach where outliers are removed from the original dataset
and DCGAN-generated images are used to balance the original training dataset has
achieved 97.92% test accuracy. The precision, recall, F1-score, and accuracy yielded
by the ResNet-50 model on three approaches are shown in Table 4. It illustrates
Autoencoder and Deep Convolutional Generative Adversarial Network … 15

Table 4 Performance of all proposed approaches


Methods Precision (%) Recall (%) F1 Score (%) Test accuracy (%)
Original dataset 97.64 97.63 97.62 97.63
Outlier removed 97.96 97.95 97.95 97.95
dataset
DCGAN + Outlier 97.93 97.92 97.92 97.92
removed dataset

that the ResNet-50 models with both outliers-removed dataset and with a balanced
dataset using DCGAN-generated images have outperformed the model trained on
the original dataset.

4.1.1 Result of Outlier Detection

The model accuracy has improved from 97.63% to 97.95% after outliers are removed
from seven classes of the dataset, demonstrating the benefit of outlier removal.
When assessing changes in individual classes, the same trend of improvement can
be observed. In Table 5, the precision for classes 76, 87, and 97 increased by 4%,
1%, and 5%, respectively, suggesting that the performance has improved for these
three classes compared to the original dataset. In classes 84 and 111, the recall has
improved by 1% and 6% for the outlier-removed dataset than that for the original
dataset, which also indicates the improvement of the classifier. For four classes (76,
84, 97, and 111) among the seven discussed classes, the increased F1-score compared
to the original dataset indicates that the images have been better classified than the
original dataset. The remaining three classes (19, 69, and 87) have not seen any
changes in F1-score. However, few classes have experienced performance drops in
terms of precision and recall even after removing the outliers. The reason behind this
is, even though the outliers are eliminated from those classes, some noises are still
there. Another explanation is that, when the outliers are eliminated, it also removes
some of the original images from these classes, resulting in a dataset that is less
balanced than the original. But these can be ignored as the performance drop is very
negligible. Removing outliers from specific classes has also reduced the number of
false positives and false negatives for classes other than the ones that are discussed.
This, along with the improved performance in these specific classes has been the
key ingredient to achieving an overall better performance. So, the cumulative results
justify that the classifier performs well as a result of excluding outliers from the
original dataset.
The improvement in classification performance shows the effectiveness of the
autoencoder-based outlier detection model in Bangla handwritten character images.
In Table 6, the numbers of discarded images from the chosen seven classes are shown.
The greatest number of outliers have been removed from class 19, whereas the least
amount has been removed from class 111. There is also a correlation between the
16 T. A. Tani et al.

Table 5 Classifier evaluation after outlier exclusion


Precision Recall F1-Score
Class Original Outlier removed Original Outlier removed Original Outlier removed
dataset dataset dataset dataset dataset dataset
19 0.95 0.95 0.98 0.96 0.96 0.96
69 0.91 0.89 0.94 0.94 0.92 0.92
76 0.90 0.94 0.89 0.89 0.89 0.92
84 0.95 0.95 0.92 0.93 0.93 0.94
87 0.97 0.98 0.97 0.96 0.97 0.97
97 0.88 0.93 0.95 0.94 0.92 0.93
111 0.86 0.84 0.62 0.68 0.72 0.75

Table 6 Outliers removed


Class No. of images in the No. of outliers removed
dataset details
original dataset
19 6180 272
69 5676 234
76 4446 185
84 5788 240
87 6136 257
97 4264 164
111 1028 42

number of outliers removed with the original size of the dataset. The more images a
class has more outliers have been removed.
In Fig. 9, a few representative inlier and outlier images from class 69 that are
detected by the model are presented. By looking at the images, one can identify that
the images in the right part of the figure are anomalous, while the images on the
left side are inliers. However, there are cases where outliers have not been accurately
detected, and inliers have been wrongly identified as outliers. Despite this, the overall
outlier detection scheme has been successful as it has improved the ResNet-50 model
performance.
Figure 10 also justifies the efficiency of the outlier detection model. The inlier
images of class 19 have been divided into four batches, and all the images of each
batch have been superimposed into a single image. Each batch consists of approxi-
mately 1500 images. In contrast, 272 outlier images detected by the model have also
been superimposed into a single image. It is apparent from the figure that the images
with superimposed inliers tend to hold the inherent shape of the character even with
1500 images. On the other hand, only 272 outliers have made the corresponding
superimposed image all jumbled up, which further validates the efficiency of the
outlier detector.
Autoencoder and Deep Convolutional Generative Adversarial Network … 17

Fig. 9 Inliers versus outliers in class 69

Fig. 10 Superimposed inliers versus superimposed outliers

4.1.2 Result of Balancing Dataset with DCGAN

Apart from the existence of outliers in the Ekush dataset, there is also an imbalance
in it. Five such imbalanced classes (72, 76, 97, 110, and 111) have been selected and
their training sets have been made balanced using DCGAN-generated images. No
generated image has been added to either validation or test set. The test accuracy of
ResNet-50 has improved from 97.63% to 97.92% after adding synthesized images
to the training set. Moreover, Table 7 shows that almost all the evaluation metrics are
improved through the use of DCGAN-generated images. Especially the class of 111
has improved exceptionally. But only the precision of DCGAN with outlier removed
images in class 72 is dropped by 2% from the original classes. This means when the
18 T. A. Tani et al.

Table 7 Classifier evaluation after applying DCGAN


Class Precision Recall F1-Score
Original Balanced Original Balanced Original Balanced
dataset dataset with dataset dataset with dataset dataset with
DCGAN DCGAN DCGAN
72 0.99 0.97 0.93 0.94 0.96 0.95
76 0.90 0.97 0.89 0.94 0.89 0.96
97 0.88 0.95 0.95 0.96 0.92 0.96
110 0.97 0.99 0.92 0.93 0.94 0.96
111 0.86 0.93 0.62 0.88 0.72 0.90

classifier predicted the images are from class 72, it is less correct than the original
dataset. The reason for decreasing the precision of class 72 can be the noisy images
that are generated in the DCGAN experiment. But that is a very negligible value and
also the corresponding recall is increased which means it can more correctly identify
all the respective class images than the original class. There are three classes to
which both the outlier detection model and DCGAN have been applied. For all three
classes, ResNet-50 with DCGAN-generated images has outperformed the ResNet-
50 trained on the outliers-removed dataset. The reason for this improvement is that
the DCGAN model has been trained on those individual classes after the removal of
outliers which has produced good-quality images. The overall performance justifies
that the use of proposed DCGAN-generated images on the real dataset can improve
the classification result.
Figure 11 shows a comparison of original images and DCGAN-generated images.
From the figure, it is difficult to distinguish between original and synthesized images
without the labels which prove that the DCGAN has generated good quality images.
However, for class 111, the generated images have not been up to the mark for having a
smaller number of images to train DCGAN. Using this generative adversarial network
has helped us to tackle the class imbalance problem. Without the five training classes
on which the DCGAN has been applied, the average training size had been nearly
4575 images per class. On the other hand, those five classes had only 2389 training
images on average per class. Even one class i.e., class 111 had only 770 training
instances which led the classifier to achieve only a 72% F1-score. But after balancing
only the training set with 3594 synthesized images, the F1-score has improved to 90%.
The changes in the training sample sizes are illustrated in Fig. 12. In classes 110
and 111, the number of synthesized images added has been more than 3000 and for
the rest of the three classes, this number has been around 1000. For four of these five
classes, the classifier performance in terms of the F1-score has improved. Moreover,
the overall performance of the ResNet-50 classifier trained on a balanced dataset
has been better than that of the trained on the original dataset. This validates the
applicability of DCGAN in generating synthesized Bangla handwritten character
images.
Autoencoder and Deep Convolutional Generative Adversarial Network … 19

Fig. 11 Some original images versus some DCGAN generated images

Fig. 12 Training size before versus training size after applying DCGAN

4.2 Overfitting Handling

Training and validation accuracy and loss are illustrated in Figs. 13, 14, and 15. On
both the original dataset and the outlier removed dataset, the ResNet-50 model has
a good fit for predicting handwriting characters, as illustrated in Figs. 13 and 14.
But Fig. 15 illustrates an exception, in which the model is applied to the Ekush
dataset after outliers are removed and DCGAN is used. Except for one epoch in
validation loss, the model has a good prediction result because the training and
validation accuracy and loss are near to each other. Additionally, in the learning
curve of each approach, the training and the validation loss are initially high and
20 T. A. Tani et al.

Fig. 13 Accuracy and loss of ResNet-50 on the original dataset

Fig. 14 Accuracy and loss of ResNet-50 on outlier removed dataset

then gradually decrease in the same direction, indicating that the model is secure
from overfitting. Though there is a slight gap between the training and validation
curve, it is negligible for being considered as overfitting. However, the validation
loss in epoch 88 has increased to about 1.6, which is relatively high compared to
other epochs. The reason behind this spike of 88 number epoch can be due to the
existence of noise in the dataset. In this particular batch of images, the model is
unable to correctly predict the batch image’s class. This type of spike does not exist
in the outlier-removed dataset or the original dataset. This means there is some
noise in the DCGAN-generated images. Also, when the training and the validation
dataset are split randomly, this particular batch has got the noisiest images.

4.3 Comparison with State-of-the-Art

Outlier elimination on the Ekush dataset is a unique operation. We are the first who
experimented on the outlier-removed Ekush dataset. Authors in [22] only performed
Autoencoder and Deep Convolutional Generative Adversarial Network … 21

Fig. 15 Accuracy and loss of ResNet-50 on outlier removed and DCGAN applied dataset

DCGAN to enlarge the Ekush dataset but no classification was performed on the
generated images. A comparative analysis of the current work with others that used
only the Ekush dataset is illustrated in Table 8. Our proposed ResNet-50 model on
the original dataset has achieved 97.63% accuracy on the test dataset (Table 4), which
proves that the score has outperformed all the work except the EkushNet. Shibly et al.
[37] achieved the best test accuracy of 98.68% on the Ekush dataset but that has been
obtained through an ensemble of ten CNN models. Their highest performance with a
single CNN model has been 97.81% using ResNet-50 which is easily outperformed
by both of our proposed methods. Our work has also achieved better performance
than an ensemble [11] and deep CNN techniques [10, 20] applied to the same dataset.
Also, although we have applied outlier on only seven classes and DCGAN on only
five classes, our two approaches outperformed the other related works. However, the
improvement is minor as only some classes from 122 classes of the Ekush dataset have
been considered in our study. But the results can conclude that our proposed outlier
and DCGAN approaches are capable to improve the classification performance.

Table 8 Performance comparison with the state of the art


Work Methods Number of Test accuracy
references classes (%)
[4] CNN 122 97.73
[10] Deep CNN (Bengali handwritten alphabets of 50 95.00
Ekush dataset)
[37] ResNet-50 122 97.81
[20] Deep CNN 122 95.05
[11] Stacked generalization ensemble method 122 96.72
Proposed Outlier Removal + ResNet-50 122 97.95
method
Proposed DCGAN + ResNet-50 122 97.92
method
22 T. A. Tani et al.

5 Discussion

Outlier elimination and applying DCGAN as well as comparing the character detec-
tion of these two approaches is a unique experiment conducted on the Ekush dataset.
ResNet-50 is one of the most popular models and can be used to achieve a very
good result on the Ekush dataset as in [37]. In addition to managing the vanishing
gradient problem, the ResNet-50 model can achieve great results with a few error
rates. Apart from that, by applying a skip connection, it can ignore the layers which
cannot provide any benefit to the output [50]. The result has shown that the ResNet-50
has given a better performance than the widely used CNN models.
Outlier detection is very beneficial if there is a probability of images being found
in the wrong classes. The result analysis has shown that the test result as well as the
precision, recall, and F1-scores have improved after applying outlier detection on
seven classes of the Ekush dataset. There is also an improvement in the performance
of the overall classification result of 122 classes. Moreover, outlier detection and
elimination on three (76, 97, 111) classes help our DCGAN to generate good-quality
images. However, certain classes from the dataset, that have been chosen in this
outlier detection approach, have a smaller volume of data, so training the model with
this limited dataset reduced the precision. The outcome could be better if outlier
detection can be applied to the whole dataset.
The DCGAN approach has generated images as an augmentation technique with
an outlier removed dataset has improved the test dataset performance by 0.29% over
the original Ekush dataset. Not only DCGAN has increased the size of the dataset but
also created variant images that add more information to the original dataset. In our
study, only five classes of images have been augmented by the DCGAN approach and
the generated image number is only able to make the training set near to 4 thousand.
However, the whole dataset still has imbalanced classes besides the chosen classes.
Yet with small amounts of generated images, the study has shown an improvement
in the classification result. If we could generate more images for these classes, the
accuracy might be improved further. However, as mentioned in [23], we should be
careful not to generate a large number of images to avoid the probability of degrading
the performance.

6 Conclusion

Handwritten character recognition is a widely known research problem. This study


adopts a two-fold approach on one of the largest Bangla handwritten datasets, namely
the Ekush dataset with the ResNet-50 classifier. At first, outliers are detected and
eliminated which has achieved a test accuracy of 97.95%. In the second approach,
DCGAN is used to generate images for the original dataset which shows an accuracy
of 97.92%. However, the results can be improved more if the adopted approaches have
been applied to the whole dataset. Because of the limited computing resources, we
Autoencoder and Deep Convolutional Generative Adversarial Network … 23

have taken only a few classes of the Ekush dataset for our experiments. Despite this,
the results which are obtained from the adopted novel approaches have demonstrated
superior performance than majority the related works. In the future, other Bangla
handwritten character datasets may also be used to evaluate the efficacy of these
methods. In addition, other classifier models, such as VGG-16, Xception, DenseNet,
AlexNet, etc. can also be explored with these two proposed methods.

Data Availability Statement


All the codes and the dataset can be accessed at the following repositories.
Tanzina Akter Tani and Shibly, Moynuddin Ahmed (2022): Codes. figshare.
Journal contribution. https://doi.org/10.6084/m9.figshare.18933470.
Tanzina Akter Tani and Shibly, Moynuddin Ahmed (2022): Dataset. figshare.
Dataset. https://doi.org/10.6084/m9.figshare.18931760.
DCGAN generated images: http://doi.org/10.6084/m9.figshare.14754309.

References

1. Yuan, A., Bai, G., Jiao, L., & Liu, Y. (2012). Offline handwritten English character recognition
based on convolutional neural network. In Proceedings 10th IAPR International Workshop on
Document Analysis Systems, DAS 2012 (pp. 125–129). https://doi.org/10.1109/DAS.2012.61
2. Kimura, F., Wakabayashi, T., Tsuruoka, S., & Miyake, Y. (1997). Improvement of handwritten
Japanese character recognition using weighted direction code histogram. Pattern Recognition,
30(8), 1329–1337. https://doi.org/10.1016/S0031-3203(96)00153-7
3. Ciresan, D. C., Meier, U., & Schmidhuber, J. (2012). Transfer learning for Latin and Chinese
characters with deep neural networks. In Proceedings of the international joint conference on
neural networks (pp. 1–6). https://doi.org/10.1109/IJCNN.2012.6252544
4. Azad Rabby, A. K. M. S., Haque, S., Abujar, S., & Hossain, S. A. (2018). Ekushnet: Using
convolutional neural network for Bangla handwritten recognition. Procedia Computer Science,
143, 603–610. https://doi.org/10.1016/j.procs.2018.10.437
5. Ahmed, S., et al. (2019). Hand sign to bangla speech: A deep learning in vision based system
for recognizing hand sign digits and generating bangla speech. https://doi.org/10.2139/ssrn.
3358187
6. Manisha, N., Sreenivasa, E., & Krishna, Y. (2016). Role of offline handwritten character recog-
nition system in various applications. International Journal of Computer Applications. https:/
/doi.org/10.5120/ijca2016908349
7. Rahman, Md. M., Akhand, M. A. H., Islam, S., Chandra Shill, P., & Hafizur Rahman, M. M.
(2015). Bangla handwritten character recognition using convolutional neural network. Inter-
national Journal of Image, Graphics and Signal Processing, 7(8), 42–49. https://doi.org/10.
5815/ijigsp.2015.08.05
8. Ghosh, T., Abedin, M. H. Z., Al Banna, H., Mumenin, N., & Abu Yousuf, M. (2021). Perfor-
mance analysis of state of the art convolutional neural network architectures in Bangla hand-
written character recognition. Pattern Recognition and Image Analysis, 31(1), 60–71. https://
doi.org/10.1134/S1054661821010089
9. Chowdhury, R. R., Hossain, M. S., ul Islam, R., Andersson, K., & Hossain, S. (2019). Bangla
handwritten character recognition using convolutional neural network with data augmentation.
In 2019 Joint 8th international conference on informatics, electronics & vision (ICIEV) and
2019 3rd international conference on imaging, vision & pattern recognition (icIVPR) (pp. 318–
323). https://doi.org/10.1109/ICIEV.2019.8858545
24 T. A. Tani et al.

10. Ahmed, S., Tabsun, F., Reyadh, A. S., Shaafi, A. I., & Shah, F. M. (2019). Bengali handwritten
alphabet recognition using deep convolutional neural network. In 5th International conference
on computer, communication, chemical, materials and electronic engineering, IC4ME2 2019.
https://doi.org/10.1109/IC4ME247184.2019.9036572
11. Shibly, M. M. A., Tisha, T. A., & Ripon, S. H. (2021). Stacked generalization ensemble method
to classify Bangla handwritten character. In Proceedings of international conference on sustain-
able expert systems. Lecture Notes in Networks and Systems 176. https://doi.org/10.1007/978-
981-33-4355-9_46
12. Mamun, M. R., Al Nazi, Z., & Yusuf, M. S. (2018). Bangla handwritten digit recognition
approach with an ensemble of deep residual networks. In International conference on bangla
speech and language processing, ICBSLP 2018 (pp. 21–22). https://doi.org/10.1109/ICBSLP.
2018.8554674
13. Goodfellow, I., et al. (2014). Generative adversarial nets. Advance in Neural Information
Process Systems, 27.
14. Basu, S., Das, N., Sarkar, R., Kundu, M., Nasipuri, M., & Basu, D. K. (2009). A hierarchical
approach to recognition of handwritten Bangla characters. Pattern Recognition, 42(7), 1467–
1484. https://doi.org/10.1016/j.patcog.2009.01.008
15. Bhowmik, T. K., Ghanty, P., Roy, A., & Parui, S. K. (2009). SVM-based hierarchical archi-
tectures for handwritten Bangla character recognition. International Journal on Document
Analysis and Recognition, 12(2), 97–108. https://doi.org/10.1007/s10032-009-0084-x
16. Bhattacharya, U., Gupta, B. K., & Parui, S. K. (2007). Direction code based features for recog-
nition of online handwritten characters of Bangla. In Proceedings of the international confer-
ence on document analysis and recognition, ICDAR, 2007. https://doi.org/10.1109/ICDAR.
2007.4378675
17. Chowdhury, R. R., Hossain, M. S., Ul Islam, R., Andersson, K., & Hossain, S. (2019). Bangla
handwritten character recognition using convolutional neural network with data augmentation.
In 2019 Joint 8th international conference on informatics, electronics and vision, ICIEV 2019
and 3rd international conference on imaging, vision and pattern recognition, icIVPR 2019 with
international conference on activity and behavior computing, ABC 2019 (pp. 318–323). https:/
/doi.org/10.1109/ICIEV.2019.8858545
18. Shopon, M., Mohammed, N., & Abedin, M. A. (2017). Bangla handwritten digit recognition
using autoencoder and deep convolutional neural network. In IWCI 2016-2016 International
Workshop on Computational Intelligence. https://doi.org/10.1109/IWCI.2016.7860340
19. Shopon, M., Mohammed, N., & Abedin, M. A. (2017). Image augmentation by blocky artifact
in deep convolutional neural network for handwritten digit recognition. In IEEE international
conference on imaging, vision and pattern recognition, icIVPR 2017 (pp. 1–6). https://doi.org/
10.1109/ICIVPR.2017.7890867
20. Mashrukh Zayed, M., Neyamul Kabir Utsha, S. M., & Waheed, S. (2021). Handwritten bangla
character recognition using deep convolutional neural network: Comprehensive analysis on
three complete datasets. Advances in Intelligent Systems and Computing. https://doi.org/10.
1007/978-981-33-4673-4_7
21. Radford, A., Metz, L., & Chintala, S. (2016). Unsupervised representation learning with deep
convolutional generative adversarial networks. In 4th International conference on learning
representations, ICLR 2016-conference track proceedings.
22. Haque, S., Shahinoor, S. A., Rabby, A. K. M. S. A., Abujar, S., & Hossain, S. A. (2018).
OnkoGan: Bangla handwritten digit generation with deep convolutional generative adversarial
networks. In Recent Trends in image processing and pattern recognition, second international
conference, {RTIP2R} 2018, Solapur, India, 21–22 Dec 2018, Revised Selected Papers, Part
{III}, 2018, vol. 1037 (pp. 108–117). https://doi.org/10.1007/978-981-13-9187-3_10
23. Jha, G., & Cecotti, H. (2020). Data augmentation for handwritten digit recognition using
generative adversarial networks. Multimed Tools and Applications. https://doi.org/10.1007/s11
042-020-08883-w
24. Biswas, R., Vasan, A., & Roy, S. S. (2020). Dilated deep neural network for segmentation of
retinal blood vessels in fundus images. Iranian Journal of Science and Technology, Transactions
of Electrical Engineering, 44(1), 505–518. https://doi.org/10.1007/s40998-019-00213-7
Autoencoder and Deep Convolutional Generative Adversarial Network … 25

25. Roy, S. S., Rodrigues, N., & Taguchi, Y. (2020). Incremental dilations using CNN for brain
tumor classification. Applied Sciences, 10(14), 4915. https://doi.org/10.3390/app10144915
26. Roy, S. S., Mihalache, S. F., Pricop, E., & Rodrigues, N. (2022). Deep convolutional neural
network for environmental sound classification via dilation. Journal of Intelligent & Fuzzy
Systems, 43(2), 1827–1833. https://doi.org/10.3233/JIFS-219283
27. Roy, S. S., et al. (2022). L2 regularized deep convolutional neural networks for fire detec-
tion. Journal of Intelligent & Fuzzy Systems, 43(2), 1799–1810. https://doi.org/10.3233/JIFS-
219281
28. Reddy, A. S. B., & Juliet, D. S. (2019). Transfer learning with ResNet-50 for malaria cell-image
classification. In International Conference on Communication and Signal Processing (ICCSP)
(pp. 945–949). https://doi.org/10.1109/ICCSP.2019.8697909
29. Rezende, E., Ruppert, G., Carvalho, T., Ramos, F., & de Geus, P. (2017). Malicious software
classification using transfer learning of ResNet-50 deep neural network. In Proceedings of
the 16th IEEE international conference on machine learning and applications, ICMLA 2017
(pp. 1011–1014). https://doi.org/10.1109/ICMLA.2017.00-19
30. Alif, M. A. R., Ahmed, S., & Hasan, M. A. (2017). Isolated Bangla handwritten character recog-
nition with convolutional neural network. In 2017 20th International conference of computer
and information technology (ICCIT) (pp. 1–6).
31. Alom, M. Z., Sidike, P., Hasan, M., Taha, T. M., & Asari, V. K. (2018). Handwritten Bangla char-
acter recognition using the state-of-the-art deep convolutional neural networks. Computational
Intelligence and Neuroscience. https://doi.org/10.1155/2018/6747098
32. Khan, M. M., Uddin, M. S., Parvez, M. Z., & Nahar, L. (2022). A squeeze and excitation
ResNeXt-based deep learning model for Bangla handwritten compound character recognition.
Journal of King Saud University Computer and Information Sciences, 34(6), 3356–3364. https:/
/doi.org/10.1016/j.jksuci.2021.01.021
33. Rabby, A. K. M. S. A., Haque, S., Islam, M. S., Abujar, S., & Hossain, S. A. (2019). Ekush:
A multipurpose and multitype comprehensive database for online off-line Bangla handwritten
characters. Communications in Computer and Information Science. https://doi.org/10.1007/
978-981-13-9187-3_14
34. Sarkar, R., Das, N., Basu, S., Kundu, M., Nasipuri, M., & Basu, D. K. (2012). CMATERdb1:
A database of unconstrained handwritten Bangla and Bangla-English mixed script document
image. International Journal on Document Analysis and Recognition. https://doi.org/10.1007/
s10032-011-0148-6
35. Biswas, M., et al. (2017). BanglaLekha-Isolated: A multi-purpose comprehensive dataset
of handwritten Bangla isolated characters. Data in Brief . https://doi.org/10.1016/j.dib.2017.
03.035
36. Alom, Z., Sidike, P., Taha, T. M., & Asari, V. K. (2017). Handwritten bangla digit recognition
using deep learning, p. 1712.
37. Shibly, M. M. A., Tisha, T. A., Tani, T. A., & Ripon, S. (2021). Convolutional neural network-
based ensemble methods to recognize Bangla handwritten character. PeerJ Computer Science,
7, 1–30. https://doi.org/10.7717/peerj-cs.565
38. Alom, M. Z., Sidike, P., Hasan, M., Taha, T. M., & Asari, V. K. (2017). Handwritten bangla
character recognition using the state-of-art deep convolutional neural networks, p.1712.
39. Sikder, M. F. (2020). Bangla handwritten digit recognition and generation. In: Proceedings of
international joint conference on computational intelligence (pp. 547–556).
40. Rahman, M. S. (2016). Towards optimal convolutional neural network parameters for
bengali handwritten numerals recognition. In 19th international conference on computer and
information technology (ICCIT) (pp. 431–436).
41. Nishat, Z. K., & Shopon, M. (2019). Synthetic class specific Bangla handwritten character
generation using conditional generative adversarial networks. In 2019 International conference
on bangla speech and language processing (ICBSLP 2019). https://doi.org/10.1109/ICBSLP
47725.2019.201475
42. Chaudhuri, B. B. (2006). A complete handwritten numeral database of Bangla-A major Indic
script. In 10th international workshop on frontiers of handwriting recognition (IWFHR), La
Baule, France.
26 T. A. Tani et al.

43. Alam, S., Reasat, T., Doha, R. M., & Humayun, A. I. (2018). NumtaDB-assembled Bengali
handwritten digits, pp 1–4.
44. Kramer, M. A. (1991). Nonlinear principal component analysis using autoassociative neural
networks. AIChE Journal, 37(2), 233–243. https://doi.org/10.1002/aic.690370209
45. Bank, D., Koenigstein, N., & Giryes, R. (2020). Autoencoders. In Machine learning: Methods
and applications to brain disorders (pp. 193–208). https://doi.org/10.1016/B978-0-12-815739-
8.00011-0
46. Alqahtani, H., Kavakli-Thorne, M., & Kumar, G. (2021). Applications of generative adversarial
networks (GANs): An updated review. Archives of Computational Methods in Engineering,
28(2), 525–552. https://doi.org/10.1007/s11831-019-09388-y
47. Haque, S., Shahinoor, S. A., Rabby, A. K. M. S. A., Abujar, S., & Hossain, S. A. (2019).
OnkoGan: Bangla handwritten digit generation with deep convolutional generative adversarial
networks. Communications in Computer and Information Science. https://doi.org/10.1007/978-
981-13-9187-3_10
48. Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. Preprint at arXiv
arXiv:1412.6980.
49. Theckedath, D., & Sedamkar, R. R. (2020). Detecting affect states using VGG16, ResNet50 and
SE-ResNet50 networks. SN Computer Science. https://doi.org/10.1007/s42979-020-0114-9
50. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 770–778).
https://doi.org/10.1109/CVPR.2016.90
Deep Learning-Based Approaches Using
Feature Selection Methods for Automatic
Diagnosis of COVID-19 Disease
from X-Ray Images

Burak Taşci

1 Introduction

The novel coronavirus pandemic (COVID-19) was created a worldwide chaos envi-
ronment in a very short time. As of July 2021, over 206 million official cases were
reported in the world and the number of deaths due to COVID-19 has exceeded
4 million [1]. Many countries have developed various policies to cope with this
pandemic and minimize its effects. In particular, Turkey is among the few countries
that set an example to the world as a result of the early measures and social isola-
tion rules. It is of vital importance to take early action for COVID-19 and similar
pandemics. If the cases of COVID-19 can be detected early, these patients can be
isolated, so that healthy individuals who are not infected can remain safe. Science
and technology make great contributions to the precautionary policies implemented
in this sense. One of the most important of these contributions is to predict how
the pandemic will act in the ongoing times. In this context, two main approaches
appear. The first of these is statistical approaches and mathematical models. The
second approach is artificial intelligence-based approaches that have received more
attention in recent years.
In the literature, there are various approaches for disease detection using
biomedical images based on machine learning and deep learning methods [2–8].
Javaheri et al. [9], tried to detect COVID-19 positive, CAP, and other diseases
from 89,145 images obtained from the data of 5 different hospitals using BCDU-Net
(U-Net). The achievement results were 91.66%, 87.5%, 95%, and 94% accuracy,
sensitivity, AUC, and specificity, respectively. Rehmen et al. [10], used CT and X-
Ray images of 200 COVID19(+), 200 Healthy, 200 Bacterial Pneumonia and 200
viral Pneumonia in their study. Using the RestNet101 transfer learning method,
the reported results were 98.75%, 97.5%, 96.43%, and 100% accuracy, sensitivity,

B. Taşci (B)
Fırat University Vocational School of Technical Sciences Elazığ, Elazığ, Turkey
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 27
S. S. Roy et al. (eds.), Deep Learning Applications in Image Analysis,
Studies in Big Data 129, https://doi.org/10.1007/978-981-99-3784-4_2
28 B. Taşci

precision, and specificity respectively. JavadiMoghaddam et al. [11], proposed a deep


learning model called Wavelet CNN-4, which consists of a wavelet and four convolu-
tion layers and a Squeeze Excitation Block layer in the coupling layer. They compared
the proposed model with pre-trained models such as VGG11, ResNet18, ResNet50
and Inception-v3. The proposed model achieved 99.03% accuracy. Chen et al. [12],
tried to detect COVID-19 positive and other diseases from 35,355 images using.
U-Net+ +. With the applied method, the obtained results were 98.85%, 94.34%,
99.16%, 88.37%, and 99.4% accuracy, sensitivity, specificity, precision, and AUC
values, respectively. Wu et al. [13], used CT images consisting of 368 COVID19(+),
127 other diseases in their study. Using the RestNet50 transfer learning method, the
reported results were 76%, 81.1%, 61.5%, and 81.9% accuracy, sensitivity, AUC,
and specificity, respectively. Mobiny et al. [14], used CT images consisting of 349
COVID19(+) and 397 COVID19(-) images in their study. The images that applied
GAN, Rescaling, and cropping as a preprocessing, in DECAPS + Peekaboo and
DECAPS architectures were used. In the applied DECAPS + Peekaboo method,
it was reached 87.6%, 84.3%, 87.1%, and 85.2% accuracy, sensitivity, F1-score,
and specificity, respectively. Balaha et al. [15], proposed a hybrid learning and opti-
mization approach based on pre-trained models to detect Covid-19. Harris Hawks
Optimization (HHO) algorithm was used to optimize the hyperparameters. They
performed data augmentation by combining three publicly available data sets. The
Weighted Summation Method (WSM) was used as an evaluation metric to compare
combinations of models, with the best accuracy being 99.33% with VGG19. Li et al.
[16], proposed a deep learning automated framework, COVNet, to accurately iden-
tify COVID-19 with chest CTs. While creating the models, a chest CT consisting of
4356 images was reported to be used. With this model, detecting COVID-19 patients
from other pneumonia patients, a sensitivity of 87% and an Area Under the Curve
(AUC) value of 0.95% were obtained. He et al. [17], used CT images consisting of
349 COVID19(+) and 397 COVID19(-) images in their study. The self-trans method
was used as a preprocess. Using the DenseNet-169 transfer learning method, the
reached results were 85%, 94%, and 86% F1-score, AUC, and accuracy, respec-
tively. Ahamed et al. [18], used datasets consisting of chest X-ray and CT images
in their study to train their proposed model. Images were preprocessed and enlarged
before entering the proposed ResNet50V2 model. Extra layers have been added to
the basic model with regularization and fine-tuning processes. They classified the
images according to two-class, three-class and four-class categories as pre-processed
and non-pre-processed. The model achieved 99.01% and 83.6% accuracy for the 3
class categories with and without preprocessing, respectively. Pathak et al. [19], used
413 COVID19(+), 439 normal or pneumonia CT images in their study. As prepro-
cessing, ResNet50 feature extraction was used. CNN was used for classification. On
the CNN network, the reached results were 93.01%, 91.45%, 94.77%, 95.18% accu-
racy, sensitivity, specificity, and precision, respectively. Shi et al. [20], have applied
a machine learning algorithm, Random Forest (RF), to screen for COVID-19. CT
images of 2685 patients were used to evaluate the models in the presented study. In the
model, after evaluating the fivefold cross-validation technique, the model achieved
accuracy, sensitivity, and specificity of 87.9%, 90.7%, and 83.3%, respectively.
Deep Learning-Based Approaches Using Feature Selection Methods … 29

The following are the primary contributions of this study:


• The suggested model utilized the classification capabilities of features derived
from AlexNet and ResNet101’s pre-trained deep architectures.
• The current study examines Chi-square, NCA, mRMR, and ReliefF feature selec-
tion algorithms in order to reduce the amount of features obtained from pre-trained
deep neural networks and identify the most effective deep features.The AlexNet
and ResNet101 features that give the highest result are combined. The mRMR
feature selection algorithm was adapted to the combined features. In experimental
studies, a highly successful diagnostic model was obtained by using these selected
and effective features for chest X-ray image classification.
• Deep features were obtained using pre-trained CNN networks, and those features
were used to optimize the parameters of the best SVM classifier. This method
got the maximum performance, with a score of 98.21%, when it came to the
classification of chest X-ray images.
In the remaining section of the paper, the material and method was mentioned in
the second section, the experimental studies and results in the third section, and the
discussion in the fourth section.

2 Material and Method

2.1 Methodology

Using previously trained network models, an efficient method for detecting the
COVID-19 virus with a high degree of accuracy is proposed in this research. Figure 1
depicts the planned workflow for the approach. Preprocessing techniques are applied
to X-RAY images as part of the proposed method. The primary purpose of these
techniques is to improve classification performance. In order to draw attention to
the point regions in X-RAY images and cut down on the overall number of house
gray tones, the gradient operator was used in sobel operator mode. After that, we
moved on to the second step, which involved using the Modulator Circulating Water
System (MCWS) to segment the points in the gradient images. In the last step, feature
extraction was performed on 13 pre-trained models. Extracted features were reduced
in number using Chi-square, NCA, mRMR, and ReliefF feature selection methods.
Selected features obtained from pretrained networks were given to 13 different clas-
sifiers. High performance was observed in AlexNet and Resnet101. The AlexNet
and Resnet101 architectures were reused for feature extraction. The FC8 layer of the
AlexNet model and the FC1000 layer of the ResNET101 model have 1000 features.
In the proposed method, feature extractions were carried out during the training
and testing phases. In total (1000 + 1000) 2000 features have been reduced to 200
30 B. Taşci

fc8
Relieff
1000
………….. NCA
Features
mRMR
Covid-19 AlexNet Chi-Spuare

Matmul
Relieff
1000 NCA
…………..
Features
Pneumonia mRMR
EfficientNet B0 Chi-Spuare

Loss3-Classifier
Relieff
1000 NCA
…………..
Features
mRMR
GoogleNet Chi-Spuare
Normal
Predictions
Relieff
1000 NCA
…………..
Features mRMR
Inception ResNet-v2 Chi-Spuare

mRMR Feature Selection Algorithm


Dataset
Predictions
Relieff
1000 NCA
…………..
Features
mRMR
Inception v3 Chi-Spuare

Gradient fc1000
Relieff
Combined Features
1000
…………..
Features NCA
mRMR
Resnet-18 Chi-Spuare

fc1000
Relieff
1000 NCA
Watershed …………..
Features mRMR
Resnet-50 Chi-Spuare

fc1000
SVM
Relieff
1000 NCA
…………..
Features
mRMR
Resize DenseNet-201 Chi-Spuare

fc8
Relieff
1000 NCA
…………..
Features mRMR
Pre-Processing VGG16 Chi-Spuare

fc8
1000
Relieff
…………..
Features NCA
mRMR
VGG19 Chi-Spuare

Logits
Relieff
1000 NCA
…………..
Features mRMR
MobileNetv2 Chi-Spuare

Predictions
1000
Relieff
…………..
Features
NCA
mRMR
Nasnet-Large Chi-Spuare

fc1000
Relieff
1000 NCA
…………..
Features
mRMR
Resnet-101 Chi-Spuare

Fig. 1 Framework of the proposed approach


Deep Learning-Based Approaches Using Feature Selection Methods … 31

features by combined mRMR feature selection methods. In the last step, the reclassi-
fication process was given to 13 different classifiers. It was observed that the highest
performance was obtained in SVM.

2.1.1 Preprocessing

The gradient method is applied to the input images. Calculation of gradient


magnitudes and directions is done with the help of directional gradient.
The watershed method is usually applied to the gradient of the image. By using
8 neighboring points around each point in the image, the most bumpy and rough
directions in the image are detected [21]. Points with a minimum height in the image
are marked with individual identifiers. Using the gradient information in the image,
the descending regions are followed at certain rates. The watershed method associates
all pixels with their respective minimum points [22].

2.1.2 Feature Selection Algorithms

Feature selection, in short, is the creation of a feature vector equivalent to the principal
feature vector and more functional, smaller in size, by creating a subset of features
that belong to a class and obtained by deep learning models.

Neighborhood Component Analysis (NCA)

NCA is a feature weighting method that may be used to select the optimum subset of
features by maximizing the objective function that evaluates classification accuracy
over training data. This is done through the use of NCA as a feature selection method.
In order to obtain the weight vector (w) that corresponds to the feature vector x i , the
approach optimizes the closest neighbor learning classifier in an effort to improve
performance [23]. Within the NCA framework, a reference sample point x j is chosen
for each sample, and then that point is assigned to sample x i . As a result of the
close proximity of the two samples, the probability that x j will be selected as the
reference point for x i will increase as a direct result of this proximity. This distance
can be measured using the weighted distance, which is denoted by Dw and found by
applying Eq. 1 to the equation.

( ) ∑r
| |
Dw= xi , x j = wm2 |xim − x jm | (1)
m=1

wm is the weight that has been allotted to the mth feature. A kernel function that
returns big values for tiny Dw can be used to determine the relationship between
probability Pij and weighted distance Dw . This relationship can be determined by
32 B. Taşci

using the kernel function. Pij is defined by the following equation:


( ( ))
k Dw x i , x j
Pi j = ∑n (2)
j=1, j/=i Dw(xi ,x j )

Also, it takes the vae 1 if i = j and Pii = 0. The kernel function is defined as
k(z) = exp (− z/σ). The parameter k and σ are the core width and this affects the
probability that sample x j will be selected as the reference point. The probability of
x i being classified correctly is written ain Eq. 3.


n
Pi j = Pi j Yi j (3)
j=1, j/=i

ReliefF

One of the most well-known approaches of feature selection is referred to as the relief
algorithm. It is a type of algorithm that has the potential to create features predictions
that are quite accurate and fruitful. The prediction of these features is accomplished
by assigning weights to the characteristics or features If an features is of any use,
one can anticipate that the closest distances of the same class will be closer to one
another than the closest distances of any and all other classes that are given along
that feature [24]. The convex optimization problem is solved, and the result is used
to determine the feature weights. However, the Relief algorithm has the limitation
of only being able to handle two-class situations and cannot process data that is
incomplete. This is a disadvantage. The ReliefF method, which was an enhanced
version of the Relief algorithm, was offered as a solution for these problems as
well as additional difficulties. It’s possible that this enhanced approach can conquer
incredibly powerful, noisy, and incomplete data. If the working logic of the ReliefF
algorithm is examined, firstly, a sample Ri is randomly selected, then, the k nearest
neighbors from the same class called Hj, and k nearest neighbors from each of the
different classes, called Mj(C) are selected. Depending on the values of Ri, Hj, and
Mj(C), the w[A] value was updated for all A features. feature weights range from
−1 to + 1. The largest positive values mean that the feature was important. This
process was continued for the number determined by the user. With the diff function,
the differences between samples and features, that is, distances, are calculated. The
calculation of this function depends on whether the features are written or numeric.
Let I1 and I2 be samples and A be features If the features were written, then the
calculation will be as in Eqs. 4 and 5. Choosing k, increases the robustness of the
algorithm against noisy data. This value can be set by the user; but if k is chosen as 1,
the algorithm will be sensitive to noisy data. In many studies, the k value was chosen
as 10, but choosing the k value differently would be more useful in examining the
importance levels of the features. Finally, choosing the k value too small will cause
similar bad results.
Deep Learning-Based Approaches Using Feature Selection Methods … 33
{
0, I1 = I2
di f f ( A, I1 , I2 ) = (4)
1, I1 /= I2
1
di f f (A, I1 , I2 ) = |I1 − I2 |x (5)
max( A) − min(A)

Chi Square Test

Chi Square Test; It is a single variable filter method. The Chi-Square method works
on categorical variables. It detects the relationships and dependencies of categorical
variables. The chi-square test is a two-step test. In the first step, the chi-square
statistics of the observed values are calculated according to the expected values. In
the second step, the obtained chi-square statistics are compared with the determined
threshold value and a decision is made accordingly. The features are scored according
to the chi-square statistic and the features with the best score are used. The chi-square
statistic is obtained using Eq. 6–8. I given in Eq. 6; is the number of intervals, and
J is the number of classes. Nij ; The ith interval is the number of samples in the jth
class. While the two properties Eij given in Eq. 6 are independent; The ith interval
is the expected number of units in the jth grade. Finally, the d given in Eq. 7 shows
the degrees of freedom of the Chi-Square distribution to be used for the test statistic
[25, 26].

j ( )2
∑I ∑
Ni j − E i j
X =
2
(6)
i=1 j=1
Ei j

Ni N j
Ei j = (7)
N

d = (I − 1)(j − 1) (8)

Minimum Redundancy Feature Selection(mRMR)

The MRMR algorithm is an entropy-based feature selection algorithm proposed by


Peng et al. in 2005 [27]. The MRMR algorithm is a filtering algorithm that works
by selecting the features that are most associated with the labels of the classes in the
data to be used for classification. This algorithm uses Mutual Information to measure
the similarity ratio between two features or between features and class labels [28]. In
essence, the MRMR algorithm tries to rank all the features from the most valuable to
the least valuable and leaves the user to decide how many features should be used for
the classification problem. Therefore, the MRMR algorithm should be considered as
a feature sorting algorithm rather than a feature selection algorithm.
34 B. Taşci

2.1.3 Pre-trained Networks

Transfer learning is defined as the learning structure created by using the features
obtained by deep learning models developed for special purposes as inputs in
other machine learning methods. In this study, deep learning models AlexNet, Effi-
cientNet B0, GoogleNet, Inception ResNet-v2, Inception-v3, ResNet18, ResNet50,
ResNet101, VGG16 and VGG19 were used. Layer, depth, number of parame-
ters, image input dimensions of the mentioned networks in Table 1, and network
architectures were given in Table 1.

AlexNet

Deep learning pioneers Alex Krishevsky, Ilya Sutskever, and Geoffrey Hinton came
up with the method that would become known as AlexNet [29]. This deep convolu-
tional neural network has a total of 25 layers, with 5 convolution layers, 3 maxpool
layers, 2 dropout layers, 3 fully connected layers, 7 relu layers, 2 normalization
layers, a softmax layer, input, and classification (output) layers making up the struc-
ture. The dimensions of the image that will go into the input layer of Alexnet are
227 by 227 by 3. The final layer is where classification takes place, and this is also
where the value of the classification number in the input image is presented.

Table 1 Deep learning networks used in the study


Pre-trained model Layer Depth Number of parameters (Million) Image input size
AlexNet 25 8 61,0 227 × 227
EfficientNet B0 290 82 5,3 224 × 224
GoogleNet 144 22 7,0 224 × 224
Inception ResNet-v2 825 164 55,9 299 × 299
Inception v3 316 48 23,9 299 × 299
Densenet201 708 201 3,5 224 × 224
Nasnetlarge 1243 88,9 331 × 331
Mobilenetv2 154 53 3,5 224 × 224
Resnet-18 71 18 11,7 224 × 224
Resnet-50 177 50 25,6 224 × 224
Resnet-101 347 101 44,6 224 × 224
VGG16 41 16 138,0 224 × 224
VGG19 47 19 144,0 224 × 224
Deep Learning-Based Approaches Using Feature Selection Methods … 35

DenseNet201

Forward connections are made between each layer of the DenseNet-121 (Densely
Connected Convolutional Network) and other layers. Each layer of the DenseNet
design takes as input the properties of all of the layers that came before it, as well
as the qualities that are unique to that layer, which are then passed on to the layers
that come after it [30]. DenseNet topologies have the advantage of providing feature
propagation and reducing the number of parameters by permitting feature reuse [31].
DenseNet-121 design is composed of four dense blocks, three transition layers, and
121 layers in total (117 loops, 3 passes and 1 classification).

MobileVNet2

MobileNet designs are built on a modular architecture that allows for the develop-
ment of both shallow and deep neural networks. This architecture’s two basic global
hyperparameters provide an optimal balance of latency and precision. Based on the
restrictions of the problem, these hyperparameters allow the model builder to select
the appropriate-sized model for their application.

Nasnet-Large

NASNet-Large is a 1243-layer convolutional neural network trained on more than


one million photos from the ImageNet collection. The network can split photos into
one thousand object types, including animals, balloons, and flowers. As a result, the
network has acquired rich feature representations for a vast array of image types.
331 × 331 pixels should be the size of the picture to be put to the mesh.

EfficientNet B0

EfficientNet, a new CNN study developed by Google in 2019, provides significant


improvements in accuracy and productivity (performance). The productivity model
presented in the study offers a new approach because of being also applicable to
other CNN models. EfficientNet-B0, is the basic network developed using AutoML
MNAS [32]. EfficientNet-B0 consists of 290 layers. The image to be placed in the
input layer of EfficientNet B0 is 227 × 227 × 3 in size.

GoogleNet

ImageNet 2014 came first with a success rate of 93.33% in image classification
competition. GoogLeNet architectural structure consists of 144 layers and this archi-
tecture has proven that too many data sets were increased the performance of the
36 B. Taşci

classification process by increasing the number of layers. The image to be placed


in the input layer of Googlenet is 224 × 224 × 3 In order to prevent overloading
of large-sized images, it filters images in various sizes such as “1 × 1, 3 × 3, 5 ×
5” in the same period. Unlike other architectures, this architecture processes images
in parallel, rather than stacking the layers it creates. Because it also was consid-
ered negative factors such as memory size increase, waste of time, etc. for stacked
processes [25].

Inception ResNet-V2

The Inception-ResNet-V2 architecture combines the remaining connections with a


new version of the inception architecture. The Inception-ResNet-V2 network makes
efficient use of remaining connections [33]. The feature extraction performance of
Inception-ResNet-V2 architecture is quite good. In this architecture, remaining units
are added to each Inception module to prevent degradation of the network gradient
usually associated with the increase in the number of layers. Inception ResNet-v2
architectural structure consists of 825 layers. The image to be placed in the input
layer of Inception ResNet-v2 measures 299 × 299 × 3.

Inception V3

Inception architecture is an architecture that emerged with the GoogleNet model.


GoogleNet model, proposed by Szegedy et al. (2015), tries to keep the computational
cost at the same rate while increasing the depth and width. Therefore, in this model
using the concept of Inception, the outputs obtained by using different convolution
filters together were combined [34]. The Inception-v3 architectural structure consists
of 316 layers. The image to be placed in the input layer of Inception v3 measures
299 × 299 × 3.

ResNet-18

The ResNet 18 pre-trained model, which provides rich features, works by inputting
more than one million data in the ImageNet dataset with a size of 224 × 224. Although
it has 71 layers and 18 depths, it is analyzed that it gives successful and faster results
compared to some models with a deeper layer [35].

ResNet-50
Resnet microarchitecture module differs from other architectures with its structure.
It may be preferable to switch to the lower layer by ignoring the change between
some layers. By allowing this situation in the Resnet architecture, the performance
rate was increased to higher levels.
Deep Learning-Based Approaches Using Feature Selection Methods … 37

Resnet50 architecture consists of a network of 177 layers. The depth of the net is
50. In addition to this layered structure, there is information about how the inter-layer
connections will be [36].

ResNet-101
The Resnet-101 structure has 347 layers and a depth of 101. ResNet’s bypass
(jumping) between layers is referred to as ResBlock. Even if nothing is learned
in the previous layer, ResBlock makes the model more robust by applying the infor-
mation from the previous layer to the new layer. ResBlock thereby fixed the gradient
deletion issue. Utilizing slope drop as the optimization algorithm. Resnet-101 input
layer dimensions are 224 × 224 × 3 [36].

VGG16
The VGG16 model consists of a total of 41 layers, 16 of which include learnable
weights, followed by ReLu and pooling layers. Learnable layers include thirteen
convolutional and three fully linked layers. Similar to AlexNet, the VGG16 model
employs a 1-pixel pitch shift and 3 × 3 filter in all convolutional layers, and maximum
pooling layers follow convolutional layers. Maximum pooling is attained with a two-
step, two-by-two filter. To extract feature vectors, activations in the first and second
fully connected layers (fc6, fc7) were utilized. fc6 and fc7 result vectors include a
total of 4096 characteristics. Training utilizes 224,224 RGB pictures [37].

VGG19
The Visual Geometry Group at the University of Oxford is responsible for the devel-
opment of the VGG19 computer program (VGG). It consists of 19 layers, 16 of which
are convolutional, 3 of which are completely connected, 5 of which are maximum
pooling, and 1 of which is a Softmax layer. The input for this network is photos with
a dimension of (224, 224, 3). Approximately 144 million trainable parameters are
available. Filters with a step size of one pixel (3 by 3) were employed so that the
overall notion of the image could be conveyed [37].

2.1.4 Support Vector Machine

SVM is a machine learning model, used in clustering and regression problems, espe-
cially in classification, developed by Vapnik–Chervonenkis in 1995. Especially in
recent years, it is one of the most successful machine learning algorithms used for
solving classification problems. The purpose of the SVM model is basically, is to
detect the hyperplane that will separate the classes of target variables from each other
in the most appropriate way [38].
38 B. Taşci

2.1.5 K-Nearest Neighbors(K-NN)

Although the k-NN classifier is a simple type of classifier, it is one of the classifiers
with good results. The reason why it is called “simple”, this classifier does not
require any training steps. This feature distinguishes this training data. This classifier
from other classifiers. used directly during the classification process by the classifier,
without a requirement for a training stage. Let a test sample is given, k nearest
neighbors of this test sample in the training set are detected and the number of
those belonging to each class is subtracted. Here it is said to belong to the class
with the largest number of neighbors [39]. There are certain mathematical formulas
for the concept of distance in the k-NN classifier. These are given in Eqs. 9–11. In
the Minkowski distance equation, if k 1 is chosen, Manhattan, if k 2 is chosen, the
Euclidean distance equation is obtained.

| n
|∑
Oklid = | (xi − yi )2 (9)
i=1

| n
|∑
Manhattan = | |xi − yi | (10)
i=1

⎛┌ ⎞1/k
| n
|∑
Minkowski = ⎝| |xi − yi |k ⎠ (11)
i=1

2.1.6 Decision Trees

Decision trees allow the rapid processing of data. Decision trees perform the classi-
fication process by data with certain property values. For this process, some features
are determined as input and some features as output, are presented to the algorithm.
In order to obtain the results in the output feature with the algorithm, what the input
values can be is realized by looking at the decision trees. One of the methods used
to create a model is the EBT method.
To increase the prediction accuracy of discrete learning algorithms, ensemble
approaches mix various learning methods. They are a linear mixture of different
modeling methods that produce better prediction outcomes without increasing
complexity significantly. Bagged and boosted ensemble methods are two of the
most used ensemble methods. While bagged approaches minimize error variance
in constructor learning algorithms, boosted methods specifically reduce bias in
constructor learning algorithms [40, 41].
Deep Learning-Based Approaches Using Feature Selection Methods … 39

Covid-19 Pneumonia Normal

Fig. 2 COVID-X-Ray scan dataset sample images

2.2 Dataset

The dataset consists of 1061 x-ray images labeled by Radiologists. The dataset has
been edited after downloading from the kaggle website [42, 43]. X-ray images consist
of three classes: COVID-19, Pneumonia and Normal. There are 361 COVID-19, 500
Pneumonia and 200 Normal chest X-ray images in the Dataset. The COVID-19 cases
in the dataset consist of chest X-ray images of 200 male and 161 female patients.
The mean age of the patients is over 45. These images range in height is from 143
to 1637 pixels (average 491 pixels) and in width from 76 to 1225 pixels (average
383 pixels). Figure 2 shows an example of X- RAY scans of COVID-19, Normal and
Pneumonia patients in the dataset.

3 Performance Measurement Metrics

The success of machine learning classifiers was determined by the correlation


between class labeling and actual class value. Labeling data with a positive true
class value as positive was referred to as true positive (TP), while labeling as nega-
tive was referred to as false negative (FN); labeling data with a negative true class
value as negative was referred to as true negative (TN), while labeling as positive
40 B. Taşci

was referred to as false positive (FP) (FP). For the suggested method, performance
measurement metrics were computed utilizing the TP, TN, FP, and FN numbers
from the matrix of complexity. Using the values of accuracy, sensitivity, specificity,
precision, and F-score, performance measures were developed. Using the following
equations, performance measurement metrics were computed.

TP + TN
Accuracy = (12)
TP + TN + FP + FN
TP
Sensitivit y = (13)
TP + FN
TN
Speci f icit y = (14)
TN + FP
TP
Pr ecision = (15)
TP + FP
Pr ecision × Sensitivit y
F − scor e = 2 × (16)
Pr ecision + Sensitivit y

4 Experimental Studies

Matlab environment was used to obtain the experimental results in this study. Exper-
imental results were obtained using an all-in-one computer with an I7 processor,
16 GB Ram, and a 4 GB graphics card. The images in the data set were sized
as 224 × 224, 227 × 227, 299 × 299 and 331 × 331, and classification was
performed. In the study, convolutional neural networks, AlexNet, EfficientNet
B0, GoogleNet, Inception ResNet-v2, Inception-v3, DenseNet201, MobilevNet2,
Nasnet-Large, ResNet18, ResNet50, ResNet101, VGG16 and VGG19 models were
used. Chi-square, NCA, mRMR and ReliefF feature selection methods were used. A
total of 2000 features were selected, 1000 from the FC8 layer of AlexNet’s features
and 1000 from Resnet101’s FC1000 layer. Selected features have been reduced to
200 features with mRMR feature selection methods. Classification process for 200
features was given to 13 different classifiers. In this study, it was observed that the
highest performance was obtained in SVM. In Fig. 3, the Confusion matrices of the
classification method in which the 13 pre-learned different networks and combined
networks used reach the highest accuracy were given. ResNet50 + AlexNet network
Cubic SVM classifier with mRMR feature selection had the best accuracy result with
98,21% and classifier Inception Resnet-v2 network Cubic SVM classifier with NCA
feature selection had the worst accuracy result with 95,00%.
In Fig. 4, The graphs of the accuracy values of the pre-trained networks according
to the classifiers and feature selections were given.
Deep Learning-Based Approaches Using Feature Selection Methods … 41

Covid-19

Covid-19

Covid-19
350 11 352 9 347 14

True Class

True Class

True Class
Normal

Normal

Normal
27 173 27 173 23 177
Pneumonia

Pneumonia

Pneumonia
500 500
500

Covid-19 Normal Pneumonia Covid-19 Normal Pneumonia Covid-19 Normal Pneumonia


Predicted Class Predicted Class Predicted Class
AlexNet-Cubic SVM DenseNet-Cubic SVM-NCA Efficient B0-Cubic SVM-mRMR
Covid-19

Covid-19

Covid-19
361 340 21 352 9
True Class

True Class

True Class
Normal

Normal

Normal
36 164 32 168 32 168
Pneumonia

Pneumonia

Pneumonia
500 500
500

Covid-19 Normal Pneumonia Covid-19 Normal Pneumonia Covid-19 Normal Pneumonia


Predicted Class Predicted Class Predicted Class
GoogleNet-Cubic SVM-NCA Inception ResNet-v2-Cubic SVM Inception-v3-Cubic SVM-Chi2
Covid-19

Covid-19

345 16 349 12 Covid-19 347 14


True Class

True Class

True Class
Normal

Normal

Normal

25 175 27 173 28 172


Pneumonia

Pneumonia

Pneumonia

500 500
500

Covid-19 Normal Pneumonia Covid-19 Normal Pneumonia Covid-19 Normal Pneumonia


Predicted Class Predicted Class Predicted Class
MobilevNet2-Subspace KNN-NCA NasNet Large-Cubic SVM-Relieff Resnet-18-Cubic SVM
Covid-19

Covid-19

Covid-19

345 16 351 10 342 19


True Class

True Class

True Class
Normal

Normal

Normal

15 185 32 168 19 181


Pneumonia

Pneumonia

Pneumonia

500 500
500

Covid-19 Normal Pneumonia Covid-19 Normal Pneumonia Covid-19 Normal Pneumonia


Predicted Class Predicted Class Predicted Class
ResNet101-Cubic SVM-NCA ResNet50-Quadratic Dicriminant-NCA VGG16-Cubic SVM-mRMR
Covid-19

Covid-19

359 2 355 6
True Class

True Class
Normal

Normal

44 156 13 187
Pneumonia

Pneumonia

500 500

Covid-19 Normal Pneumonia Covid-19 Normal Pneumonia


Predicted Class Predicted Class
VGG19-Cubic SVM-NCA ResNet50+AlexNet-Cubic SVM - mRMR

Fig. 3 Confusion matrices with the highest accuracy


42 B. Taşci

98,00% 98,00% 98,00%


97,00%
96,00% 96,00%
96,00%
95,00% 94,00%
94,00%
94,00%
93,00% 92,00%
92,00%
92,00% 90,00%
91,00%
90,00%
90,00% 88,00%
89,00% 88,00% 86,00%
88,00%
87,00% 86,00% 84,00%
No Chi2 MrMr NCA Relieff No Chi2 MrMr NCA Relieff No Chi2 MrMr NCA Relieff
AlexNet DenseNet201 EfficientNet B0

97,00% 97,00% 98,00%


96,00% 96,00%
95,00% 96,00%
95,00%
94,00% 94,00% 94,00%
93,00%
93,00%
92,00% 92,00%
92,00%
91,00%
91,00% 90,00%
90,00%
89,00% 90,00% 88,00%
88,00% 89,00%
86,00%
87,00% 88,00%
86,00% 87,00% 84,00%
No Chi2 MrMr NCA Relieff No Chi2 MrMr NCA Relieff No Chi2 MrMr NCA Relieff
GoogleNet Inception ResNet-v2 Inception v3

97,00% 98,00% 97,00%


96,00% 96,00%
95,00% 96,00%
95,00%
94,00% 94,00%
94,00%
93,00%
93,00%
92,00%
92,00% 92,00%
91,00%
91,00%
90,00%
90,00%
89,00% 90,00%
88,00% 89,00%
88,00%
87,00% 88,00%
86,00% 86,00% 87,00%
No Chi2 MrMr NCA Relieff No Chi2 MrMr NCA Relieff No Chi2 MrMr NCA Relieff
MobilevNet2 Nasnet-Large Resnet-18

98,00% 97,00% 98,00%


96,00%
96,00% 96,00%
95,00%
94,00%
94,00% 94,00%
93,00%
92,00% 92,00% 92,00%
91,00%
90,00% 90,00%
90,00%
89,00%
88,00% 88,00%
88,00%
86,00% 87,00% 86,00%
No Chi2 MrMr NCA Relieff No Chi2 MrMr NCA Relieff No Chi2 MrMr NCA Relieff
Resnet-50 Resnet-101 VGG16

Ensemble Subspace KNN


98,00%
Quadratic Dicriminant
96,00% Bilayered Neural Network

94,00% Medium Gaussian SVM


Ensemble Bagged Trees Weighted KNN
92,00%
Narrow Neural Network
90,00%
Cubic SVM Wide Neural Network
88,00% Quadratic SVM
Ensemble Boosted Trees
86,00%
No Chi2 MrMr NCA Relieff
Fine Tree
VGG19 Fine KNN

Fig. 4 Graphs of truth values of pre-trained networks according to classifiers and feature selections

Cubic SVM classifier had the highest accuracy with 96.42% for AlexNet network,
The Medium Gaussian SVM classifier with mRMR feature selection had the worst
accuracy with 89.2%. Cubic SVM classifier with NCA feature selection had the
highest accuracy with 96.61% for DenseNet-201 network, The Quadratic Dicrimi-
nant classifier with Chi2 feature selection had the worst accuracy with 89.6%. Cubic
SVM classifier with NCA feature selection had the highest accuracy with 96.51%
for EfficientNet B0 network, The Fine Tree classifier with Chi2 feature selection
had the worst accuracy with 89.3%. Cubic SVM classifier with NCA feature selec-
tion had the highest accuracy with 96.06% for GoogleNet network, The Quadratic
SVM classifier with mRMR feature selection had the worst accuracy with 89.7%.
Cubic SVM classifier had the highest accuracy with 95.0% for Inception ResNet-v2
Deep Learning-Based Approaches Using Feature Selection Methods … 43

network, The Medium Gaussian SVM classifier with NCA feature selection had the
worst accuracy with 89.7%.
Cubic SVM classifier with Chi2 feature selection had the highest accuracy with
96.14% for Inception v3 network, The Bilayered Neural Network had the worst
accuracy with 89.2%. Cubic SVM classifier with Chi2 feature selection had the
highest accuracy with 96.14% for MobilevNet2 network, The Quadratic Dicrimi-
nant with ReliefF feature selection had the worst accuracy with 90.0%. Cubic SVM
classifier with ReliefF feature selection had the highest accuracy with 96.32% for
Nasnet-Large network, The Medium Gaussian SVM with ReliefF feature selection
had the worst accuracy with 89.7%. Cubic SVM classifier had the highest accu-
racy with 96.04% for ResNet18 network, The Quadratic Dicriminant with ReliefF
feature selection had the worst accuracy with 90.0%. Cubic SVM classifier with
NCA feature selection had the highest accuracy with 97.08% for ResNet50 network,
The Quadratic Dicriminant with NCA feature selection had the worst accuracy with
90.1%.Quadratic Dicriminant with NCA feature selection had the highest accuracy
with 96.04% for ResNet101 network, The Fine Tree classifier with ReliefF feature
selection had the worst accuracy with 90.4%.Cubic SVM classifier with mRMR
feature selection had the highest accuracy with 96.42% for VGG16 network, The
Medium Gaussian SVM with NCA feature selection had the worst accuracy with
90.0%.Quadratic Dicriminant classifier with NCA feature selection had the highest
accuracy with 95.66% for VGG19 network, The Medium Gaussian SVM with NCA
feature selection had the worst accuracy with 89.3%.
In Table 2, the sensitivity, specificity, precision and, F-score results of the classi-
fiers used in the proposed method were given. For the pneumonia class, Accuracy,
Sensitivity, Specificity, Precision, F-Score metrics were all 100%. In the COVID19
class, for the Sensitivity metric, GoogleNet network Cubic SVM classifier with NCA
feature selection had the best result with 100% and classifier Inception Resnet-v2
network Cubic SVM classifier with NCA feature selection had the worst result with
94.18%. For the Specificity metric, ResNet50 + AlexNet network Cubic SVM clas-
sifier with mRMR feature selection had the best result with 98.14% and classifier
VGG19 network Cubic SVM classifier with mRMR feature selection had the worst
result with 93.71%. For the Precision metric, ResNet50 + AlexNet network Cubic
SVM classifier with mRMR feature selection had the best result with 96.47% and
classifier VGG19 network Cubic SVM classifier with mRMR feature selection had
the worst result with 89.08%. For the F-score metric, ResNet50 + AlexNet network
Cubic SVM classifier with mRMR feature selection had the best result with 97.39%
and classifier Inception Resnet-v2 network Cubic SVM classifier with NCA feature
selection had the worst result with 92.77%.
In the Normal class, for the Sensitivity metric, VGG19 network Cubic SVM
classifier with mRMR feature selection had the best result with 99.77% and classifier
GoogleNet network Cubic SVM classifier with NCA feature selection had the worst
result with 96.04%. For the Specificity metric, classifier Inception Resnet-v2 network
Cubic SVM classifier with NCA feature selection had the best result with 98.95%
and classifier VGG19 network Cubic SVM classifier with mRMR feature selection
had the worst result with 78.00%. For the Precision metric, ResNet50 + AlexNet
44 B. Taşci

Table 2 Other performance metrics of classifiers


Accuracy Sensitivity Specificity Precision F-Score
(%) (%) (%) (%) (%)
AlexNet-Cubic SVM COVID-19 96.42 96.95 96.14 92.84 94.85
Normal 98.72 86.50 96.92 97.81
Pneumonia 100.00 100.00 100.00 100.00
DenseNet-201-Cubic COVID-19 96.61 97.51 96.14 92.88 95.14
SVM-NCA Normal 98.95 86.50 96.93 97.93
Pneumonia 100.00 100.00 100.00 100.00
Efficient-B0-Cubic COVID-19 96.51 96.12 96.71 93.78 94.94
SVM-mRMR Normal 98.37 88.50 97.36 97.86
Pneumonia 100.00 100.00 100.00 100.00
GoogleNet-Cubic COVID-19 96.06 100.00 94.86 90.93 95.25
SVM-NCA Normal 100.00 98.37 86.00 97.58
Pneumonia 100.00 100.00 100.00 100.00
Inception COVID-19 95.00 94.18 95.43 91.40 92.77
Resnet-v2-Cubic Normal 97.56 84.00 96.33 96.94
SVM-NCA
Pneumonia 100.00 100.00 100.00 100.00
Inception- v3-Cubic COVID-19 96.14 97.51 95.43 91.67 94.50
SVM-Chi2 Normal 96.14 98.95 84.00 96.38
Pneumonia 100.00 100.00 100.00 100.00
MobilevNet2-Subspace COVID-19 96.14 95.57 96.43 93.24 94.39
KNN-NCA Normal 98.14 87.50 97.13 97.63
Pneumonia 100.00 100.00 100.00 100.00
NasNet Large-Cubic COVID-19 96.32 96.68 96.14 92.82 94.71
SVM-ReliefF Normal 98.61 86.50 96.92 97.75
Pneumonia 100.00 100.00 100.00 100.00
ResNet18-Cubic SVM COVID-19 96.04 96.12 96.00 92.53 94.29
Normal 98.37 86.00 96.80 97.58
Pneumonia 100.00 100.00 100.00 100.00
ResNet101-Cubic COVID-19 97.08 95.57 97.86 95.83 95.70
SVM-NCA Normal 98.14 92.50 98.26 98.20
Pneumonia 100.00 100.00 100.00 100.00
ResNet50-Cubic COVID-19 96.04 97.23 95.43 91.64 94.35
SVM-NCA Normal 98.84 84.00 96.38 97.59
Pneumonia 100.00 100.00 100.00 100.00
VGG16-Cubic COVID-19 96.42 94.74 97.29 94.74 94.74
SVM-mRMR Normal 97.79 90.50 97.79 97.79
(continued)
Deep Learning-Based Approaches Using Feature Selection Methods … 45

Table 2 (continued)
Accuracy Sensitivity Specificity Precision F-Score
(%) (%) (%) (%) (%)
Pneumonia 100.00 100.00 100.00 100.00
VGG19-Cubic COVID-19 95.66 99.45 93.71 89.08 93.98
SVM-mRMR Normal 99.77 78.00 95.13 97.39
Pneumonia 100.00 100.00 100.00 100.00
RenNet50 + COVID-19 98.21 98.34 98.14 96.47 97.39
AlexNet-Cubic Normal 99.30 93.50 98.50 98.90
SVM-mRMR
Pneumonia 100.00 100.00 100.00 100.00

network Cubic SVM classifier with mRMR feature selection had the best result with
98.50% and classifier Inception Resnet-v2 network Cubic SVM classifier with NCA
feature selection had the worst result with 84.00%. For the F-score metric, ResNet50
+ AlexNet network Cubic SVM classifier with mRMR feature selection had the best
result with 98.90% and classifier Inception-v3 network Cubic SVM classifier with
Chi2 feature selection had the worst result with 96.38%.

5 Discussion

In this section, the performance criteria of studies with pre-trained models and the
proposed method, consisting of accuracy, sensitivity and specificity, are discussed.
Evaluations in the literature are usually made on combined data sets. Since the data
sets used in the studies are different and the evaluation criteria are different, it cannot
be said that they are completely superior to each other. The performance scores of
these methods are given in Table 3.
Abbas et al. [44], established a modified deep neural network effective on Xray
images to more effectively distinguish between COVID-19 cases. The model they
call DeTraC includes three inner layers. This model was created using ResNet18
on the backend and achieved 95.12% accuracy on the X-Ray dataset. Wang et al.
[45], used 44 COVID19(+), 55 typical viral pneumonia CT images in their study. As
preprocessing, a visual inspection of ROI extraction was performed. In the applied
M-inception algorithm, the obtained results were 82.9%, 81%, 84%, 77%, and 90%
accuracy, sensitivity, F1-score, AUC, and specificity, respectively. Alqudah et al. [46]
used SVM, Random Forest, CNN in this study. 95.2% accuracy, 93.3% Sensitivity,
100% Specificity and 100% Precision were achieved.
Hemdan et al. [47], suggested the COVIDXNET deep learning classifier archi-
tecture for COVID-19 diagnosis using X-Ray pictures. In addition, they validated
seven distinct DCNN models, such as VGG19 and Densenet201, in their investi-
gation. They demonstrated that VGG19 and DenseNet classifications are superior.
46

Table 3 Literature studies and results


Ref Dataset Method Accuracy (%) Sensitivity Specificity Precision F-Score
Abbas et al. [44] COVID-19 image eTraC-ResNet-18 95.12 97.97% 91.87% 93.36% –
data collection[49]
Wang et al. [45] The chest x-ray VGG16, VGG19, DenseNet201, 82.9 81.00% 77.00% – 84.00%
images Inception_ResNet_V 2,Inception_V3,
(pneumonia) [42] Resnet50, MobileNet_V2 Xception
Alqudah et al. COVID-19 image SVM,Random Forest, CNN 95.20 93.30% 100.00% 100.00% –
[46] data collection[49]
Hemdan et al. COVID-19 image VGG19,DenseNet201,InceptionV3, 90.00 – – 83.00% 91.00%
[47] data collection[49] ResNetV2, InceptionResNetV2,
Xception, MobileNetV2
Narin et al. [48], COVID-19 image ResNet50, InceptionV3, 98.00 – – 100.00% 98.00%
data collection[49] InceptionResNetV2
Proposed COVID-19 image AlexNet, EfficientNet B0, GoogleNet, COVID-19 = 98.34% 98.14% 96.47% 97.39%
method dataset [42, 43] Inception ResNet-v2, Inception-v3, 98.21
DenseNet201, MobilevNet2, Normal = 98.21 99.30% 93.50% 98.50% 98.90%
Nasnet-Large, ResNet18, ResNet50,
ResNet101, VGG16,VGG19 Pneumonia = 100.00% 100.00% 100.00% 100.00%
98.21
B. Taşci
Deep Learning-Based Approaches Using Feature Selection Methods … 47

Narin et al. [48], used deep CNN-based models to classify X-ray images for COVID-
19 illness. Using chest X-ray radiographs, CNN-based models (InceptionResNetV2,
ResNet50, and InceptionV3) were utilized to detect people infected with coronavirus
pneu-monia. 98.00% accuracy was reached with the ResNet50 model, based on the
results of the experiments.
The proposed approach has reached a success rate of 98.21%. It has reached a
100% success rate in the sensitivity and specificity criteria for the pneumonia class.
For the COVID-19 class Sensitivity, Specificity, Precision, F-Score metrics, values
of 98.34%, 98.14%, 0.96.47%, and 97.39% were obtained, respectively.

6 Results

The rapid spread of the COVID-19 pandemic all over the world, its negative effects
on people, clearly demonstrates the detection of positive cases in the early stages and
the rapid and correct intervention. In this study, the three-class data set consisting
of X-Ray images obtained during the COVID-19 epidemic was classified by the
learning transfer method. In this paper, preprocessing techniques have been applied
to X-RAY images to improve classification performance. Gradient operator used as
Sobel operator was used to highlight the point regions in X-RAY images and reduce
the number of house gray tones. Chi-square, NCA, mRMR and ReliefF feature selec-
tion methods were used. First, the results of 13 pre-trained models were compared.
Then, a total of 2000 features were selected from AlexNet and Resnet101. Selected
features have been reduced to 200 features with mRMR feature selection methods.
Classification process for 200 features was given to 13 different classifiers. In this
study, it was seen that the highest performance was obtained at 98.21% SVM after
applying mRMR feature selection to the combined models of RenNet50 + AlexNet
models. In the study, the highest accuracy, sensitivity, specificity, precision and F-
score value for the COVID19 class were; ResNet50 + AlexNet Cubic SVM with
98.21%, GoogleNet network Cubic SVM classifier with 100%, ResNet50 + AlexNet
Cubic SVM with 98.14%, ResNet50 + AlexNet Cubic SVM with 96.47%, ResNet50
+ AlexNet with 97.39% Obtained in Cubic SVM. In the proposed approach, it has
been seen that pre-trained CNN architectures and feature extraction methods can
be used together. In addition, it has been confirmed in this study that the weights
can be combined and efficient rather than considering the performance of feature
selection methods separately. The major limitation of this study is that the method
used requires more powerful hardware if applied to larger datasets.
48 B. Taşci

References

1. CoronaVirus Updates. (2022). https://www.worldometers.info/coronavirus/


2. Jalali, S. M. J., Ahmadian, M., Ahmadian, S., Hedjam, R., Khosravi, A., & Nahavandi, S.
(2022). X-ray image based COVID-19 detection using evolutionary deep learning approach.
Expert Systems with Applications, 201, 116942.
3. Dhiman, G., Chang, V., Kant Singh, K., & Shankar, A. (2022). Adopt: Automatic deep learning
and optimization-based approach for detection of novel coronavirus covid-19 disease using
x-ray images. Journal of Biomolecular Structure and Dynamics, 40(13), 5836–5847.
4. Roy, S. S., Goti, V., Sood, A., Roy, H., Gavrila, T., Floroian, D., Paraschiv, N., & Mohammadi-
Ivatloo, B. (2014). L2 regularized deep convolutional neural networks for fire detection. Journal
of Intelligent & Fuzzy Systems, 1–12.
5. Ravi, V., Narasimhan, H., Chakraborty, C., & Pham, T. D. (2022). Deep learning-based
meta-classifier approach for COVID-19 classification using CT scan and chest X-ray images.
Multimedia Systems, 28(4), 1401–1415.
6. Roy, S. S., Rodrigues, N., & Taguchi, Y. (2020). Incremental dilations using CNN for brain
tumor classification. Applied Sciences, 10(14), 4915.
7. Biswas, R., Vasan, A., & Roy, S. S. (2020). Dilated deep neural network for segmentation of
retinal blood vessels in fundus images. Iranian Journal of Science and Technology, Transactions
of Electrical Engineering, 44(1), 505–518.
8. Samui, P., Roy, S. S., & Balas, V. E. (2017). Handbook of neural computation. Academic Press.
9. Javaheri, T., Homayounfar, M., Amoozgar, Z., Reiazi, R., Homayounieh, F., Abbas, E., Laali,
A., Radmard, A. R., Gharib, M. H., & Mousavi, S. A. J. (2021). CovidCTNet: An open-source
deep learning approach to diagnose covid-19 using small cohort of CT images. NPJ Digital
Medicine, 4(1), 1–10.
10. Rehman, A., Naz, S., Khan, A., Zaib, A., & Razzak, I. (2022) Improving coronavirus (COVID-
19) diagnosis using deep transfer learning. In Proceedings of international conference on
information technology and applications (pp. 23–37). Springer.
11. JavadiMoghaddam, S., & Gholamalinejad, H. (2021). A novel deep learning based method for
COVID-19 detection from CT image. Biomedical Signal Processing and Control, 70, 102987.
12. Chen, J., Wu, L., Zhang, J., Zhang, L., Gong, D., Zhao, Y., Chen, Q., Huang, S., Yang, M., &
Yang, X. (2020). Deep learning-based model for detecting 2019 novel coronavirus pneumonia
on high-resolution computed tomography. Scientific Reports, 10(1), 1–11.
13. Wu, X., Hui, H., Niu, M., Li, L., Wang, L., He, B., Yang, X., Li, L., Li, H., & Tian, J. (2020). Deep
learning-based multi-view fusion model for screening 2019 novel coronavirus pneumonia: A
multicentre study. European Journal of Radiology, 128, 109041.
14. Mobiny, A., Cicalese, P., Zare, S., Yuan, P., Abavisani, M., Wu, C., Ahuja, J., de Groot, P., & Van
Nguyen, H. (2020). Covid R-l detection using CT scans with detail-oriented capsule networks.
15. Balaha, H. M., El-Gendy, E. M., & Saafan, M. M. (2021). CovH2SD: A COVID-19 detection
approach based on Harris Hawks Optimization and stacked deep learning. Expert Systems with
Applications, 186, 115805.
16. Li, L., Qin, L., Xu, Z., Yin, Y., Wang, X., Kong, B., Bai, J., Lu, Y., Fang, Z., & Song, Q. (2020)
Artificial intelligence distinguishes COVID-19 from community acquired pneumonia on chest
CT. Radiology.
17. He, X., Yang, X., Zhang, S., Zhao, J., Zhang, Y., Xing, E., & Xie, P. (2020) Sample-efficient
deep learning for COVID-19 diagnosis based on CT scans. Medrxiv.
18. Ahamed, K. U., Islam, M., Uddin, A., Akhter, A., Paul, B. K., Yousuf, M. A., Uddin, S.,
Quinn, J. M., & Moni, M. A. (2021). A deep learning approach using effective preprocessing
techniques to detect COVID-19 from chest CT-scan and X-ray images. Computers in Biology
and Medicine, 139, 105014.
19. Pathak, Y., Shukla, P. K., Tiwari, A., Stalin, S., & Singh, S. (2020). Deep transfer learning
based classification model for COVID-19 disease. Irbm.
Deep Learning-Based Approaches Using Feature Selection Methods … 49

20. Shi, F., Xia, L., Shan, F., Song, B., Wu, D., Wei, Y., Yuan, H., Jiang, H., He, Y., & Gao, Y. (2021).
Large-scale screening to distinguish between COVID-19 and community-acquired pneumonia
using infection size-aware classification. Physics in Medicine & Biology, 66(6), 065031.
21. Tarabalka, Y., Chanussot, J., & Benediktsson, J. A. (2010). Segmentation and classification of
hyperspectral images using watershed transformation. Pattern Recognition, 43(7), 2367–2379.
22. Gauch, J. M. (1999). Image segmentation and analysis via multiscale gradient watershed
hierarchies. IEEE Transactions on Image Processing, 8(1), 69–79.
23. Yang, W., Wang, K., & Zuo, W. (2012). Neighborhood component feature selection for high-
dimensional data. Journal of Computers, 7(1), 161–168.
24. Robnik-Šikonja, M., & Kononenko, I. (2003). Theoretical and empirical analysis of ReliefF
and RReliefF. Machine Learning, 53(1), 23–69.
25. Liu, H., Li, J., & Wong, L. (2002). A comparative study on feature selection and classification
methods using gene expression profiles and proteomic patterns. Genome Informatics, 13, 51–60.
26. McHugh, M. L. (2013). The chi-square test of independence. Biochemia Medica, 23(2), 143–
149.
27. Peng, H., Long, F., & Ding, C. (2005). Feature selection based on mutual information criteria of
max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 27(8), 1226–1238.
28. Ding, C., & Peng, H. (2005). Minimum redundancy feature selection from microarray gene
expression data. Journal of Bioinformatics and Computational Biology, 3(02), 185–205.
29. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). Imagenet classification with deep
convolutional neural networks. Communications of the ACM, 60(6), 84–90.
30. Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected
convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern
recognition (pp. 4700–4708).
31. Iandola, F., Moskewicz, M., Karayev, S., Girshick, R., Darrell, T., & Keutzer, K. (2014)
Densenet: Implementing efficient convnet descriptor pyramids. Preprint at arXiv:14041869
32. Tan, M., & Le, Q. (2019). Efficientnet: Rethinking model scaling for convolutional neural
networks. In International conference on machine learning, PMLR (pp. 6105–6114).
33. Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. A. (2017). Inception-v4, inception-resnet
and the impact of residual connections on learning. In Thirty-first AAAI conference on artificial
intelligence.
34. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke,
V., & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE
conference on computer vision and pattern recognition (pp. 1–9).
35. Ou, X., Yan, P., Zhang, Y., Tu, B., Zhang, G., Wu, J., & Li, W. (2019). Moving object detection
method via ResNet-18 with encoder–decoder structure in complex scenes. IEEE Access, 7,
108152–108160.
36. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In
Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778)
37. Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image
recognition. Preprint at arXiv:14091556.
38. Vapnik, V. (1999). The nature of statistical learning theory. Springer science & business media.
39. McRoberts, R. E., Tomppo, E. O., Finley, A. O., & Heikkinen, J. (2007). Estimating areal
means and variances of forest attributes using the k-Nearest Neighbors technique and satellite
imagery. Remote Sensing of Environment, 111(4), 466–480.
40. Bühlmann, P. (2012). Bagging, boosting and ensemble methods. In Handbook of computational
statistics. Springer, pp 985–1022.
41. Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140.
42. COVID-19 chest xray. (2022). https://www.kaggle.com/bachrr/covid-chest-xray
43. Chest X-Ray Images (Pneumonia). (2022). Retrieved from https://www.kaggle.com/paultimot
hymooney/chest-xray-pneumonia
44. Abbas, A., Abdelsamea, M. M., & Gaber, M. M. (2021). Classification of COVID-19 in chest
X-ray images using DeTraC deep convolutional neural network. Applied Intelligence, 51(2),
854–864.
50 B. Taşci

45. Wang, S., Kang, B., Ma, J., Zeng, X., Xiao, M., Guo, J., Cai, M., Yang, J., Li, Y., & Meng,
X. (2021). A deep learning algorithm using CT images to screen for Corona Virus Disease
(COVID-19). European Radiology, 31(8), 6096–6104.
46. Alqudah, A. M., Qazan, S., Alquran, H., Qasmieh, I. A., & Alqudah, A. (2020). COVID-2019
detection using X-ray images and artificial intelligence hybrid systems. Biomedical Signal and
Image Analysis and Project.
47. Hemdan, E. E.-D., Shouman, M. A., & Karar, M. E. (2020). Covidx-net: A framework of deep
learning classifiers to diagnose covid-19 in x-ray images. Preprint at arXiv:200311055.
48. Narin, A., Kaya, C., & Pamuk, Z. (2021). Automatic detection of coronavirus disease (covid-19)
using x-ray images and deep convolutional neural networks. Pattern Analysis and Applications,
24(3), 1207–1220.
49. Cohen, J. P., Morrison, P., Dao, L., Roth, K., Duong, T. Q., & Ghassemi, M. (2020). Covid-19
image data collection: Prospective predictions are the future. Preprint at arXiv:200611988.
Image Captioning Using Deep Transfer
Learning

Tapan Kumar Das

1 Introduction

Generating textual description of an image is an easier task for human being, however,
for a machine to explain the image requires computer vision to visualise the image
and NLP to describe the image [1]. Hence in order to generate caption automatically
for a particular photograph, the system must be trained and educated to realise the
content of image and thereafter to express the contents in natural language words
[2]. With the advent of deep learning methods especially for image feature extraction
and processing [3], this particular problem has been swiftly addressed.
Deep learning techniques such as convolutional neural network (CNN) are widely
used for image processing tasks for their ability to deal with millions of underlying
features [4]. It has been well perceived that CNN techniques are quite efficient for
varieties medical image processing e.g. COVID-19 lung CT- scan [5], MRI images
for brain tumor diagnosis [6, 7], retinal blood vessel [8], angiograms [9], chest X-rays
[10] and many more.
By just seeing the picture depicted in Fig. 1, some of us might say “A Little
is talking brown guiding grassy”, some may say “Little boy is playing with toys”
and yet some others might say “A little boy is designing the house”. The answer to
all these observations are true and even few additional captions are also possible.
All these findings do not require any special training or efforts for a human being,
however, this is not the case for a system so that just by overlooking glancing; an
appropriate language can be described.

T. K. Das (B)
School of Information Technology and Engineering, Vellore Institute of Technology,
Vellore 632014, India
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 51
S. S. Roy et al. (eds.), Deep Learning Applications in Image Analysis,
Studies in Big Data 129, https://doi.org/10.1007/978-981-99-3784-4_3
52 T. K. Das

Fig.1 A sample image

This study of generating the captions for the images has following significances.
• The experiments are based on transfer learning coupled with Convolution Neural
Network (CNN).
• We aims for boosting the model performance by making subtle changes to the
block diagram.
• The objective is producing the semantic and syntactical captions for the input
images by using the phrases as elementary units instead of words.
Motivation
This problem is immensely useful in real-world applications. We listed below few
applications where this study is being interpreted:
• Self-driving cars: By automatically and readily generating the caption of the scene
around the car, the self-driving system would be truly autonomous.
• Aid to the blind: By designing a product which will guide the blind persons when
walking on the roads will fulfil a lot of aspirations. This is possible by converting
the scene around into text following the text to voice.
• Google image search: Like Google search, image search may be popular if an
image could be first transformed into a caption and then the underlying text can
be searched.
Image Captioning Using Deep Transfer Learning 53

2 Related Studies

Different techniques for image captioning exists; they are retrieval based or template
based. Recently deep learning base captioning become very popular due to the quality
and appropriateness of the textual description of images. Deep learning based atten-
tion mechanism are also delivers promising result in captioning [11]. Most of the
models are encoder- decoder based, and it has been realised that LSTM and bidirec-
tional LSTM networks are used as decoder in most of the systems [12]. Similarly
for encoding purpose VGG16 and ResNet50 are employed for their effectiveness in
vectorising [13].
Few studies on image captioning those have used deep learning for image
processing and text description are represented in Table 1.

Table 1 Contemporary studies on caption generation using deep learning method


Studies Objective Methodology Result
Chen et al. Mapping between Generating the caption Capable of generating novel
[14] images and their using the recurrent capions
textual neural network
descriptions
Sharma Image captioning The methodologies used extracted classes belong to the True
et al. [15] by integrating are LSTM, CNN, and class in general
visual and external knowledge from
knowledge external source
You et al. Image captioning Combines both State-of-the-art performance as
[16] with semantic top-down and compared to standard benchmarks
attention bottom-up strategies
Rampal Image captioning VGG16 or ResNet50 as As compared to uncompressed
et al. [17] using neural an encoder, LSTM as model, achieves a 73.1% reduction
network decoder and flickr8k in model size, and 7.7% increase in
compression dataset BLEU score
Arnav Image captioning Two input streams are Results show 5 sentences,
et al. [18] using deep merged and passed to an generated using a beam size of 5
learning LSTM layer along with average log probability
of the sequence of words
Yao et al. Boosting image Devised the CNN plus Examines image representations
[19] captioning with RNN architecture to and high-level attributes
attributes generate descriptions
Singh et al. Image Captioning CNN, RNN –LSTM discussed the various algorithms
[20] using Artificial like CNN, RNN, LSTM
Intelligence
Wang et al. Image captioning CNN and two separate Achieved highly performance
[21] LSTM networks
54 T. K. Das

3 Methodology

We used combined (CNN-RNN) model to extract the features from the image and
text, further, we used evaluation model to check the accuracy of the proposed model
and finally performance of the model at each epoch is tracked by the help of error rate.
Here we are using top-down approach and transfer learning to extract the features and
to train a model and also to get accurate captions of the image. In fact the concept of
transfer learning is applied twice in our model. InceptionV3 for extracting features
from images and Glove for extracting features from text/captions for better accu-
racy.Finally, we test a model with some images (test images) to know the accuracy
of the model. Detailed methodology consists of following steps:
• Data collection.
• Data cleaning and pre-processing.
• The result from pre-processing is that we have a vocabulary of 1652 unique words
from the training dataset. We employed InceptionV3 transfer learning model.
• We encoded all the training images and testing images which are input to our
model.
• After removing the stop-words in the process of data cleaning we have 7578 words
in our vocabulary.
• We also used a transfer learning model (Glove) to extract the features from our
pre-processed text data.
• Then we built and train our network/model. Finally, we evaluated the performance
on the test data.

3.1 Dataset

We have utilised Flickr8k dataset which contains around 8000 image, out of which
6000 images are used for training the model, 1000 images for validating the model
and remaining 1000 images for testing the model in order to determine the model
efficiency. Each image contains five number of captions (Fig. 2).
Figure 3 exhibits few sample images from the Flickr8k dataset.
From Fig. 4, clearly each individual images have five different captions.
The Flickr dataset are loaded in repository, then the data is pre-processed by
removing extra whitespace, punctuation, and other distractions. For encoding, CNN
is used. The input image is fed to CNN to extract the features. After the features
are processed by a series of layers, the last hidden state of the CNN is connected to

Fig. 2 Encoder-decoder based image captioning process


Image Captioning Using Deep Transfer Learning 55

Fig. 3 Sample images in the dataset

Fig. 4 Caption for the images

the decoder. In this framework, RNN serves as a decoder which performs language
modelling up to the word level. A schematic diagram of encoder-decoder based
image captioning process is shown in Fig. 2.

3.2 Inception Model for Images

Here we have used pretrained Inception V3 model to extract the features from the
images. Inception v3 is a widely-used image recognition model which has shown a
remarkable accuracy of 98.1% on the standard ImageNet dataset.
Architecture of Inception V3 is depicted in Fig. 5.
56 T. K. Das

Fig. 5 Inception V3 architecture diagram

The process of encoding and decoding and the detailed layers of those models
and the parameters are involved are being represented in Figs. 6 and 7 respectively.
Summary of caption model which depicts that the total parameters trained by the
proposed model and the detailed network layers are represented in Fig. 8.

Fig. 6 Encoding the model summary


Image Captioning Using Deep Transfer Learning 57

Fig. 7 Decoding the vectored model summary

4 Result

The main objective is to predict the caption for the image. For predicting, we applied
an efficient predictive model using deep learning technique. We mainly focussed on
the predictiveness of the model that suits to find the caption for the given image in
the dataset.
For evaluating the calibre of the text generated, we used BLEU (Bilingual Evalu-
ation Understudy) since it has the principle of matching each text against set of refer-
ence texts composed by human itself. It is being signified a score which reflects overall
quality of generated text. We achieved a BLEU score of 0.645 for our considered
dataset.

4.1 Sample Output

For testing the effectiveness of our designed model, we tested the model over the few
images from Flicker8k dataset and exhibited the output caption obtained in Figs. 9,
10, 11, 12 and 13.
58 T. K. Das

Fig. 8 Caption model summary


Image Captioning Using Deep Transfer Learning 59

Fig. 9 Sample output for image 1

Fig. 10 Sample output for image 2


60 T. K. Das

Fig. 11 Sample output for image 3

Fig. 12 Sample output for image 4


Image Captioning Using Deep Transfer Learning 61

Fig. 13 Sample output for image 5

5 Conclusion

In this chapter, we have executed image captioning task by integrating two deep
learning techniques i.e. CNN with RNN. For training the encoder-decoder model, we
used Flickr8k dataset. The trained model achieved state of the art performance when
tested with unseen images of the dataset. Efficiency of image retrieval with content
is assessed by the quality of the textual description of the image. This image caption
generation can widen the scope of application areas such as medicine, security and
other fields where the underlying image speaks a lot and has some implicit meaning.
Moreover, the framework of image captioning can automate and promote annotating
the image in large scale which can lead to even video captioning and video dialog.

References

1. Sharma, H., Agrahari, M., Singh, S. K., Firoj, M., & Mishra, R. K. (2020). Image captioning:
A comprehensive survey. In 2020 International Conference on Power Electronics & IoT
Applications in Renewable Energy and its Control (PARC) (pp. 325–328). IEEE.
2. Stefanini, M., Cornia, M., Baraldi, L., Cascianelli, S., Fiameni, G., & Cucchiara, R. (2022).
From show to tell: a survey on deep learning-based image captioning. IEEE Transactions on
Pattern Analysis and Machine Intelligence.
3. Hossain, M. Z., Sohel, F., Shiratuddin, M. F., & Laga, H. (2019). A comprehensive survey of
deep learning for image captioning. ACM Computing Surveys (CsUR), 51(6), 1–36.
4. Chohan, M., Khan, A., Mahar, M. S., Hassan, S., Ghafoor, A., & Khan, M. (2020). Image
captioning using deep learning: A systematic. Image, 11(5).
62 T. K. Das

5. Tiwari, R. S., Das, T. K., Srinivasan, K., & Chang, C. Y. (2022). Conceptualising a channel-
based overlapping CNN tower architecture for COVID-19 identification from CT-scan images.
Scientific Reports, 12(1), 1–15.
6. Roy, S. S., Rodrigues, N., & Taguchi, Y. (2020). Incremental dilations using CNN for brain
tumor classification. Applied Sciences, 10(14), 4915.
7. Das, T. K., Roy, P. K., Uddin, M., Srinivasan, K., Chang, C. Y., & Syed-Abdul, S. (2021). Early
tumor diagnosis in brain MR images via deep convolutional neural network model. Computers,
Materials and Continua, 68(2), 2413–2429.
8. Biswas, R., Vasan, A., & Roy, S. S. (2020). Dilated deep neural network for segmentation of
retinal blood vessels in fundus images. Iranian Journal of Science and Technology, Transactions
of Electrical Engineering, 44(1), 505–518.
9. Roy, S. S., Hsu, C., Samaran, A., Goyal, R., Pande, A., et al. (2023). Vessels segmentation
in angiograms using convolutional neural network: A deep learning based approach. CMES-
Computer Modeling in Engineering & Sciences, 136(1), 241–255.
10. Das, T. K., Chowdhary, C. L., & Gao, X. Z. (2020). Chest X-ray investigation: a convolutional
neural network approach. Journal of Biomimetics, Biomaterials and Biomedical Engineering,
45, 57–70. Trans Tech Publications Ltd.
11. Zohourianshahzadi, Z., & Kalita, J. K. (2022). Neural attention for image captioning: Review
of outstanding methods. Artificial Intelligence Review, 55(5), 3833–3862.
12. Wang, C., Yang, H., Bartz, C., & Meinel, C. (2016). Image captioning with deep bidirectional
LSTMs. In Proceedings of the 24th ACM International Conference on Multimedia (pp. 988–
997).
13. Rampal, H., & Mohanty, A. (2020). Efficient CNN-LSTM based image captioning using neural
network compression. Preprint retrieved from arXiv:2012.09708.
14. Chen, X., & Zitnick, C. L. (2014). Learning a recurrent visual representation for image caption
generation. Preprint retrieved from arXiv:1411.5654.
15. Sharma, H., & Jalal, A. S. (2020). Incorporating external knowledge for image captioning using
CNN and LSTM. Modern Physics Letters B, 34(28), 2050315.
16. You, Q., Jin, H., Wang, Z., Fang, C., & Luo, J. (2016). Image captioning with semantic attention.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4651–
4659).
17. Rampal, H., & Mohanty, A. (2020). Efficient CNN-LSTM based image captioning using neural
network compression. Preprint retrieved from arXiv:2012.09708.
18. Arnav, J. H., & Pulkit, M. (2018). Image captioning using deep learning.
19. Yao, T., Pan, Y., Li, Y., Qiu, Z., & Mei, T., (2017). Boosting image captioning with attributes.
In Proceedings of the IEEE International Conference on Computer Vision (pp. 4894–4902).
20. Singh, Y. P., Ahmed, S. A. L. E., Singh, P., Kumar, N., & Diwakar, M. (2021). Image captioning
using artificial intelligence. In Journal of Physics: Conference Series (Vol. 1854, No. 1,
p. 012048). IOP Publishing.
21. Wang, C., Yang, H., Bartz, C., & Meinel, C. (2016). Image captioning with deep bidirectional
LSTMs. In Proceedings of the 24th ACM International Conference on Multimedia (pp. 988–
997).
Vehicle Over Speed Detection System

K. Ganesan, N. S. Manikandan, and Vijayan Sugumaran

1 Introduction

Every year, many individuals die all across the world. One of the most common
causes of death is a vehicle accident. Accidents not only kill people, but also harm a
large number of people. Among the several causes of accidents, high-speed vehicles
are the most important cause. As a result, high-speed vehicles must be managed. As
a result, different government organisations, academic institutions, and automobile
manufacturers have begun various studies and projects to lower the likelihood of
accidents and provide safety to passengers and drivers. Several researchers have
used different kinds of mechanisms to detect vehicle over-speed in highways such
as VANET technology to connect with cloud server [1], video based specific area of
Interest (ROI) [2], and Electronic toll collection data based speed prediction [3]. To
manage high-speed vehicles on the highway, the Tamil Nadu government planned
to install an over-speed detecting device in the toll plaza. Figure 1 depicts a block
diagram of over-speed detection in a toll plaza. This architecture is made up of a
vehicle detection system, a common cloud server that is linked to an RTO server,
and an over-speed detection system.

K. Ganesan (B)
Professor, Higher Academic Grade, School of Information Technology and Engineering, Vellore
Institute of Technology (VIT), Vellore 632014, Tamil Nadu, India
e-mail: [email protected]
N. S. Manikandan
Senior System Architect, TIFAC-CORE Automotive Infotronics, Vellore Institute of Technology,
Vellore 632014, Tamil Nadu, India
V. Sugumaran
Distinguished Professor of Management Information Systems, School of Business
Administration, Oakland University, Rochester, MI, USA

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 63
S. S. Roy et al. (eds.), Deep Learning Applications in Image Analysis,
Studies in Big Data 129, https://doi.org/10.1007/978-981-99-3784-4_4
64 K. Ganesan et al.

Fig. 1 Block diagram of the proposed system for high speed detection

A state-of-the-art application in many domains, including vehicle detection from


satellite images [4], tumour extraction from medical images [5], and many others, is
made possible by neural computation [6] and deep learning [7].
The vehicle detection and licence plate detection and recognition play a major role
in the initial stages of the vehicle over-speed detection system. It must, in particular,
detect Indian vehicles and extract Indian licence plate information. In view of the
Indian vehicle detection at the toll plaza, Rajput et al. [8] utilized YOLO3 object
detection and classification to detect and classify vehicles in a toll plaza. They have
classified six types of vehicles, each of which can be used for a separate toll cost.
Their findings at the toll plaza revealed an average recall of 86.3% and precision of
94.1%.
The important role of this system is to locate and extract vehicle license plate
information. Many researchers have used deep learning technique [9–15]; some have
used image processing technique [16, 17]. However, most of the time, locating and
extracting Indian vehicle license plate is complicated due to non-proper position or
damaged or occlusion or un-authorized font used [18]. In terms of Indian vehicle
license plate localization and data extraction, a novel method for detecting license
plates with various font styles on vehicles was proposed by Jagtap et al. [19]. It
relies on adaptive image segmentation in conjunction with Artificial Neural Network
(ANN) character recognition. The proposed approach combines morphological oper-
ations with horizontal and vertical edge histograms to accomplish plate localization
Vehicle Over Speed Detection System 65

and character segmentation. To recognize characters, a two layer feed forward back
propagation ANN is used. The results show an overall accuracy of 89.5%. When
it comes to Indian license plate irregularities, a pipeline is built by Ravirathinam
et al. [20] using a number of cutting-edge Faster Regional Convolutional Neural
Networks to effectively address the Indian situation in a variety of scenarios. There
is no publicly accessible dataset for Indian licence plates, so they created a balanced
dataset using frames from videos and images from mobile devices, accounting for
all the irregularities. Their pipeline generated an overall total correctness of 88.5%
and a partial correctness of 10% for Indian plates. The overall correctness increased
to 91% with the addition of a new heuristics system. The accuracy of licence plate
detection for all kinds of vehicles was 94.98%. Sometimes the extracted license plate
information is incorrect, for OCR corrections in chaotic Indian traffic videos with
complicated licence plate patterns; Singh et al. [18] proposed a modular framework.
These patterns are produced by a cutting-edge deep learning model that was trained
on video frames. This model includes multi-frame consensus in their framework for
generating suggestions because it reads text from videos rather than images. Their
human-interactive framework uses an object detector and a tracker to first separate
the multi-vehicle videos into multiple clips, each of which contains a single vehicle
from the video, to aid in the correction process. Their framework then offers recom-
mendations for a single vehicle using multi-frame consensus. The user is then given
interactive suggestions that only show them certain extracted clips, allowing them to
quickly and easily verify or correct their predictions. This high-quality output can be
used to update a sizable database continuously for surveillance, which will improve
the accuracy of deep models in difficult real-world scenarios.
In view of the cloud platform, an IoT-based system that uses two detection points
with surveillance cameras to measure the average speed between them was proposed
by Khan et al. [21]. To enforce speed limits, the measured data is sent to the cloud
for additional processing. Entry and exit points are used to detect any uncertainty
in a particular area. The failure of a car, for instance, to reach the end point after
passing through the entrance point, can be highlighted. The system is made up of a
mobile phone application and a web network that exchange real-time data, including
information about passing vehicles like entrance time, pictures, and license plate
registration numbers. Such a system has the advantages of requiring little human
involvement, requiring fewer speed guns to be installed, and monitoring vehicles
even when they are not in the camera’s field of view.
The speed limit between two toll gates is determined by traffic density or govern-
ment traffic rules and regulations. However, the roads between the toll plazas are
generally curvy and have speed limits. The majority of cloud-based vehicle over-
speed detection systems are unaware of road curvature. In terms of extracting data
about horizontal curves from road GIS maps, Li et al. [22] present a fully automated
method. Their proposed methodology aims at four different things: (a) Regardless of
the type of curve, each road’s curves in the selected road’s surface layers are identi-
fied; (b) each curve is automatically classified as either simple or compound; (c) Each
simple curve’s radius, degree of curvature, length, and compound curve’s radius are
66 K. Ganesan et al.

all automatically determined; and (d) curve characteristics and layers are automati-
cally created in the GIS for all detected curves. 96.7% of curves were correctly iden-
tified and their geometric information was computed using the proposed technique.
However, the existing road curvature extraction method is unaware of curvature noise
and curvature in hilly terrain.
Thus, the existing over-speed detection system has some gaps, such as not being
aware of the curvature on the highway and not being aware of curvature noise. To
bridge the generation gap, the proposed system includes the following features:
• The YOLO object detection model has been proposed for vehicle detection and
vehicle type extraction.
• An image processing technique is used to locate and extract licence plates from
detected vehicle images.
• The information on the localised licence plate is extracted using the CRNN deep
learning text extraction model.
• The proposed curvature aware travel time estimation model calculates the travel
time between two toll plazas, and the cloud-based system detects over-speed of
vehicles.
The remaining portions of this paper is arranged as follows: Sect. 2 describes
the vehicle detection & license plate extraction system, which is sub divided into
vehicle detection & type classification, License plate localization and license plate
recognition, travel time estimating & over-speed detection system. The speed detec-
tion system is further sub divided into new curve finding method, curve speed limit
database creation, curve aware travel time estimation, and vehicle over speed detec-
tion system. Section 3 discusses the results of vehicle detection and license plate
localization & text extraction, new curve finding method, curve aware travel time
estimation, and vehicle over speed detection. Finally, Sect. 4 provides the conclusion
and future work.

2 Proposed Model

Figure 2 depicts the proposed system’s architecture. This system has been subdivided
into three subsystems. The first subsystem detects vehicles at toll gates and extracts
license plate information as well as vehicle type. The second subsystem uses a road
curvature extraction module, a curve aware speed limitation module, and a curvature
aware travel time estimator to characterize the curvature aware journey time between
two toll gates. Over speed detection is the third and final subsystem. It is made up of
a toll gate system and a common cloud server infrastructure. The three subsystems
are briefly described below.
Vehicle Over Speed Detection System 67

Fig. 2 The architecture of the proposed system

2.1 Vehicle Detection and License Plate Extraction System

This system consists of three modules: vehicle detection and vehicle type classifica-
tion, license plate localization, and license plate recognition. Details of each of these
modules are provided below.

2.1.1 Vehicle Detection and Type Classification: YOLO

YOLO [23] divides the image into M X M grids by a single CNN applied to the
entire image. For each grid, the prediction of B bounding boxes and the associated
confidence score are computed. The class confidence score analyses these bounding
boxes using the formula given below.
Class confidence score = conditional class probability + box confidence score.
It assesses the level of certainty in both classification and localization. The
mathematical definitions are as follows:
box confidence score ≡ Pr (object).I oU
conditional class probability ≡ Pr (class i |object)
class confidence score ≡ Pr (class i ).I oU , then

Pr (class i ).I oU = P r (object).I oU × Pr (class i |object) (1)


68 K. Ganesan et al.

where Pr (object) denotes the likelihood that an object is present in the box. The
intersection over union, or IoU, between the predicted box and the actual data is
the ground truth. The probability that an object belongs to a given class i , given its
presence, is known as Pr (class i |object). The probability that an object belongs to
a given class i is given by Pr (class i ).
YOLO reduces an input image to 448 × 448 pixels in size. The image is then sent
through a convolutional network, yielding a tensor of 7 × 7x30. Tensor information
includes: (1) the coordinates of the bounding box’s rectangle, and (2) the probability
distribution for all classes for which the system has been trained. By limiting these
class labels, confidence scores (probability) with less than 30% are eliminated.
To calculate the loss, when comparing predictions to ground truth, YOLO uses
the sum-squared error. The categorization loss is part of the loss function. The loss
of localization is the error between the predicted boundary box and the ground truth.
The loss of confidence scores only for the boxes which did not contain any object at
all. Here’s the overall formula:


s ∑
2
B [( )2 ]
obj Ʌ

λcoor d 1i j xi − x i + (yi − yi )2
i=0 j=0
[ ]

s2 ∑
B
obj
(√ √ )2 (√
Ʌ
/ )2 Ʌ

+λcoor d 1i j xi − x i + hi − hi
i=0 j=0


s2 ∑ ∑
s ∑
2

obj ( )2 obj ( )2
B B
Ʌ Ʌ

+ 1i j ci − ci + λnoobj 1i j ci − ci
i=0 j=0 i=0 j=0


s2
obj
∑ ( Ʌ )2
+ 1i j pi (c) − pi (c) (2)
i=0 c∈classes

2.1.2 License Plate Localization

Finding the location of the License Plate in the vehicle image is a critical assignment.
Grayscale conversion, thresholding, and morphological procedures such as dilatation
and erosion are used to localize plates. Canny edge detector is used to detect license
plate edges and crop the located license plate from vehicle image [24].

2.1.3 License Plate Recognition: CRNN

The CNN, Bi-directional LSTM, and CTC layer that make up the CRNN [25] can be
viewed as an encoder-decoder structure. A feature sequence encoder known as CNN
creates image feature sequences. Character sequences are produced by a decoder
made up of the bi-directional LSTM and CTC layers.
Vehicle Over Speed Detection System 69

The input image”s width and height are set by CNN to (Wx32)/H and 32 pixels,
respectively, where W and H are the image’s width and height, in order to maintain
the original aspect ratio. The CNN uses stride 21 rather than stride 22 for the pooling
layer because the character is tall and thin, with a height greater than a width. As
a result, the final feature map has a thin and tall pixel point that corresponds to the
original image’s receptive field. The input image is downsampled using two layering
pools with a 22 stride, and three layering pools with a 21 stride. The final dimension
of the feature map is b × 1 × [(W × 8)/H] × C, where b is the batch size, 1 is the
height, (Wx8)/H is the width, and C represents the number of channels. The structure
of CNN used for feature extraction is displayed in Table 1.
The CRNN decoder is composed of the CTC layer and the Bi-directional LSTM
layer. Bi-directional LSTM receives its input from a feature map’s column vector.
The probability matrix of (Wx8)/HxC, where C is the number of character labels
and is set to English uppercase letters in 26, English lowercase letters in 26, and a
space, is the output and it represents the probabilities of characters in each column
vector. The feature map that was recovered has a width of (Wx8)/H. The likelihood
of the label sequence is determined by applying the CTC layer to the Bi-directional
LSTM’s output. The likelihood of the label sequence during training is determined
by the conditional probability defined in the CTC layer. The conditional probability’s
negative log-likelihood serves as the loss function for training the network. The prob-
ability sum of all pathways that are genuine label sequences is calculated by the CTC
layer. The paths ‘hee-ll-o’ and ‘hh-ee-ll-oo’ (where ‘-’ signifies a space) eliminate

Table 1 CNN network of CRNN


Layers CNN network Output size
Conv1 (3 × 3 × conv) × 6 32 × [(W × 32)/H]
Connection layer Relu, 1 × 1 conv, dropout 32 × [(W × 32)/H]
2 × 2 average pool, stride 2 × 2 16 × [(W × 16)/H]
Conv2 (3 × 3 × conv) × 6 16 × [(W × 16)/H]
Connection layer Relu, 1 × 1 conv, dropout 16 × [(W × 16)/H]
2 × 2 average pool, stride 2 × 2 8 × [(W × 8)/H]
Conv3 (3 × 3 × conv) × 6 8 × [(W × 8)/H]
Connection layer_1 Relu, 1 × 1 conv, dropout 8 × [(W × 8)/H]
2 × 1 average pool, stride 2 × 1 4 × [(W × 8)/H]
Conv4 (3 × 3 × conv) × 6 4 × [(W × 8)/H]
Connection layer_1 Relu, 1 × 1 conv, dropout 4 × [(W × 8)/H]
2 × 1 average pool, stride 2 × 1 2 × [(W × 8)/H]
Conv5 (3 × 3 × conv) × 6 2 × [(W × 8)/H]
Connection layer_1 Relu, 1 × 1 conv, dropout 2 × [(W × 8)/H]
2 × 1 average pool, stride 2 × 1 1 × [(W × 8)/H]
70 K. Ganesan et al.

duplicates and spaces to show the label sequence ‘helo.‘ The test’s recognition result
is determined by which character sequence has the highest probability.

2.2 Travel Time Estimating and Over-Speed Detection System

This system has four modules: road curvature identification, curve speed restriction
declaration, curve aware travel time computation, and vehicle overspeed detection.
They are described in detail below.

2.2.1 New Curve Detection Algorithm

A path between the source to the destination is constructed. The path, as shown
in Fig. 3a, is made up of a series of segment points (S1 to S9) with each segment
connected by straight lines. (Note: In India, all vehicles drive on the left side [26]).
The equations for detecting the curve using the sequence of segment point are
shown below. Before calculating the radius of curvature, we must first calculate the
great-circle distance between two points using the ‘Haversine’ formula.

val = sin2 (Δϕ/2) + cos ϕ1 × cos ϕ2 × sin2 (Δλ/2) (3)



−1 val
res = 2 × tan √ (4)
(1 − val)

dst = ER · res (5)

Fig. 3 a Google map road segment points b Identified curve


Vehicle Over Speed Detection System 71

where ϕ is latitude, λ is longitude, ER is radius of earth (ER = 6,371 km). Let’s


think about the distance between segment point S1 to S2 as a, S2 to S3 as b, and S1
to S3 as c then calculating the radius is

(a × b × c)
radius = √ (6)
(a + b + c) × (b + c − a) × (c + a − b) × (a + b − c)

According to the Indian Roads Congress [27, 28], a vehicle can travel at a speed
of 70 to 80 km/h in a 1000 m radius curve on the Indian highways [26]. So, we
assume that the maximum radius of an Indian road curve is 1000 m. Using Eq. (6)
at segment points S1, S2, and S3 from Fig. 3a, we find that the radius of these three
segment points is more than 1000 m because they are interconnected like a straight
line. So, we check the next adjacent three-segment points S2, S3, and S4. The radius
of these three-segment points (S2, S3, and S4) is less than 1000 m because it looks
like a curve. These three segment points’ radius values are recorded in the radius list,
and this procedure is repeated for subsequent segment points until we reach the final
set of segment points along the path.
Figure 3a segment points S3–S9 yield six curve radii R1, R2, R3, R4, R5, and
R6. This information is saved in the radius list. After that, the average curve radius
(R1 to R6) of the segment points S2 to S9 is calculated. The detected curve of the
path in Fig. 3a is shown in Fig. 3b. (Explained in Algorithm 1).
The curve list keeps track of the curve’s starting segment point S2, ending segment
point S8, mid-segment point S6, and computed average curve radius. The method is
utilized with the route between the origin and destination, and the found curves and
their attributes (curve starting point, curve ending point, curve mid-point, and average
curve radius) are saved in the curve list. Figure 4 shows that the source location is
on a highway, but the destination location is on a mountainous (hilly) terrain. The
curves on the highway are always large. A single curve that is 1000 m long, as seen
in Fig. 4 (top red solid circle), is a good example. The mountainous landscape here
features several hairpin bends. These curves have a radius of 50 to 150 m, and some
curves are 500 m long. As the assumed maximum curve radius of roads in India is
1000 m, multiple hairpin curves form a single curve, as seen in Fig. 4 (bottom two
red solid circles).
72 K. Ganesan et al.

Algorithm 1: finding curve on path


Input: GPS segment points, required radius
Output: collection of Identified curves and its attributes
1: Function curve_detection (list_segment_points, required_radius_meter)
2: for each segment from list_segment_points do
3: a = distance between segment S1 and S2
4: b = distance between segment S2 and S3
5: c = distance between segment S1 and S3
6: det_curve_radius = (a * b * c) /
7: (sqrt ((a + b + c) * (b + c - a) * (c + a - b)* (a + b - c)))
8: radius_count=0
9: collected_radius = 0
10: if det_curve_ radius <= required_radius_meter then:
11: curve_length = curve_ length + (a + b)
12: collected_radius = collected_radius + det_curve_radius
13: radius_count = radius_count + 1
14: else:
15: radius = collected_radius / radius_count
16: List_curve[radius, curve_length]
17: return List_curve
18: end Function
19:
20: for radius [1000, 500, 250, 125, 75] do
21: list_segment_points = curve_ detection (list_segment_points, radius)

To solve this problem, a recursive curve detection technique is applied. In this


technique, the single curve (multi-curve in a single curve) is broken up into numerous
smaller curves. The recursive curve detection method (lines no. 18 and 19 from
algorithm 1) uses the discovered curve from the path, and is recorded in a curve list.
The potential of one or more hairpin curves falls within the 500-m radius, so this
technique uses a sequence of curve radii of 500, 250, 150, and 50 m. As a result, one
large single curve is divided into smaller curves, as seen in Fig. 5a. Figure 5b shows
the curve finding the result of the twin tunnel road in Mumbai.
Finding the curve direction (bearing) is based on the following calculation.

sinΔλ × cosϕ2
Bearing(θ) = tan−1 (7)
cosϕ1 × sinϕ2 − sinϕ1 × cosϕ2 × cosΔλ

where ϕ is latitude, λ is longitude.


There is sometimes ‘noise’ in the path. The connected lines that connect the
segment points are not uniform, as shown in Fig. 6b yellow circle, and this is referred
to as “noise.“ This noise can be eliminated by the proposed curve detector method.
The next section goes over the process of creating the curve speed limit database.
Vehicle Over Speed Detection System 73

Fig. 4 Many types of curve in different terrain

Fig. 5 a Result of proposed curve finding algorithm b Curve in tunnel road (Twin tunnel Mumbai)

2.2.2 Curve Speed Limit Database Creation

The references to the Indian Road Congress’s (IRC) [27, 28] articles provide support
for the development of the database of curve speed restriction on Indian roads.
According to the Indian Road Congress (IRC) article, Table 2 measures up the
74 K. Ganesan et al.

Fig. 6 a Noisy curve b Curve noise eliminated

Table 2 database of curve


S.No Range of curve radius Designed speed limit (in km/
speed restriction
h)
1 50−100 20
2 70−150 25
3 100−200 30
4 180−320 35
‘5 280−650 40
6 470−1100 60
7 700−1400 80

planned speed limit and curve radius. Based on the super-elevation of the curve,
the radius range is established.

2.2.3 Curve Aware Travel Time Estimation

This module describes the travel time between two toll gates. This is described in
following equation.

Sr d = T r d − Cr d (8)

St = Srd/d
s (9)

C1d
Ct1 =
C1s
Vehicle Over Speed Detection System 75

C2d
Ct2 =
C2s
···
Cn d
Ctn =
Cn s
Cat = St + Ct1 + Ct2 + · · · + Ctn (10)

where the Sr d is straight road distance which is subtraction result of Toll road distance
T r d from total distance of curvature road Cr d . The time taken to travel only on
straight road St is straight road distance Sr d divided by declared speed ds .
Here, the Ct1 , Ct2 , and Ctn is time to travel over the curvature, which is
computed from C1d ,C2d , Cn d curvature distance divided by curvature speed restric-
tion C1s ,C2s , Cn d/Cn s . Finally, the curvature aware travel time Ca t is obtained by
adding the travel time on straight road St with every curvature travel time Ct1 , Ct2 ,
and Ctn .

2.2.4 Vehicle Over Speed Detection System

Figure 7 depicts the block diagram of over speed detection system. In a toll gate,
every booth has an over speed detection system. This system connects with camera
situated outside the toll booth which is focusing on the vehicle for detecting vehicle
type as well as extracting license plate information. This system connects with cloud
server, which stores vehicle information with timestamp and it connects with the
RTO server.
The system connects with the camera to detect vehicle type and extract license
plate and add the current timestamp. Before that, the system downloads vehicle
information and timestamp of vehicle entry in the previous toll gate, along with RTO
information of vehicle from the cloud server. When the vehicle enters the toll booth,
the over speed detection system in each booth checks the vehicle information from
the downloaded information from cloud server. If there is no match, then it considers
the vehicle entering the toll booth as new, so the vehicle information with current
timestamp is added to the cloud server. In case the entered vehicle information
is matched with downloaded information, then it checks for the over-speed. The
over-speed is calculated using the following formula.

V t t = Cbt − Pbt (11)

Os = V t t < Ca t (12)

where V t t is vehicle travelled time, calculated from subtracting its current toll booth
timestamp Cbt from pervious toll booth timestamp Pbt . Here, the over speed Os is
whether the vehicle travelled time is less than Toll gate declared curve aware travel
76 K. Ganesan et al.

Fig. 7 Over speed detection module

time Ca t (Eq. 10). if the vehicle over speed is detected, then it will be entered into
the violator database and get fined by the field inspector.

3 Result

The vehicle overspeed detection system’s testbed location is set up in two toll plazas
in Tamilnadu, India: Pallikonda and Ranipet. The results of this testbed are described
below.

3.1 Vehicle Detection and License Plate Localization


and Text Extraction

Using the toll booth outside camera, which was focusing on vehicles and moving
line by line, the YOLO detected and classified the type of vehicle. The YOLO result
is shown in Figs. 8 and 9 at the top left (Vehicle detection) and bottom left (Detected
Vehicle Over Speed Detection System 77

Fig. 8 Car detection and license plate information extraction

Fig. 9 Truck detection and license plate information extraction

vehicle type). The detected vehicle image has been sent to the license plate localiza-
tion process, as described in Sect. 2.1.2. Figures 8 and 9 bottom right (Number plate
cropped) show the output of this license plate localization.
The cropped license plate image is then fed into the CRNN text recognition
algorithm, which produces the extracted text of the license plate, as shown in the
bottom right (Predicted num) of Figs. 8 and 9. This Vehicle detection & License
78 K. Ganesan et al.

plate extraction system output is vehicle type and license plate text recognition,
which is sent to the vehicle over-speed detection module.

3.2 New Curve Detection Algorithm

The found curve analysis is assessed using the Type 1 error, Type 2 error, and Type 2
error ratio (TIIR) metrics [22]. Wherever a Type 2 error occurs, the detected curve is
extended beyond the ground truth curve by an additional segment. Wherever a Type
1 error occurs, either the detected curve is not detected or is missing 25, 50, or 75%
of the ground truth curve.
Figure 10a, b, and c show various 25, 50% Type 1 curve identifications error and
Type 2 error, respectively. It’s risky to make this Type 1 error.
The type 2 error ratio is denoted by the formula TIIR = m/n, where m denotes
the quantity of type 2 errors, n denotes the quantity of ground truth curves, and TIIR
denotes the type 2 error ratio.
Table 3 displays the Type 1, Type 2, TIIR, actual, predicted curve numbers, the
overall distance between source and destination, performance delay, types of curves
predicted, and noise corrected curve numbers for locations in India (rows 1 to 5),
France (row 6), and the United States (row 7). The data from Google Maps Road
segments is the same all over the world. As a result, the proposed method can extract
curves from road segments anywhere in the world. One minor distinction is that
vehicles in India travel on the left side of the road, whereas vehicles in France and the
United States drive on the right side. As a result, depending on the vehicle travelling
direction, the starting and ending point of the curve varies from country to country.
Each (7 rows of Table 3) source to destination Google map road segment data was
collected and the starting and ending point of each curve was manually identified;
this ground truth data were compared with the proposed model recognized curve
data. Here, one highway road (row 1), a hilly terrain road (row 2), a university road

Fig. 10 a 25% type 1 curve identification error b 50% type 1 curve identification error c type 2
error
Vehicle Over Speed Detection System 79

(row 3), and a tunnel road (row 5) have all been tested in India. The curve observed in
the highway starting location and hilly terrain destination location is shown in Table
3, row no. 4. The proposed method can extract curve from the tunnel road. Figure 5b
shows the row 5 in the twin tunnel in Mumbai city.
Noise is the result of a GPS segment being drawn incorrectly over an existing
segment. This form of noise, may be corrected using the proposed approach, as
shown in Fig. 6b. However, this noise frequently misrepresents a straight road as a
curving road. Because a hilly terrain road includes more noise segments than a road
segment in the plains, this kind of error is classified as a Type 2 error. However, it
is not dangerous because one can get alerted if a curve exists. If the proposed curve
detection algorithm fails to detect a curve with a radius of fewer than 60 m, it is
dangerous because this type of road includes sharp or blinding bends and is prone to
accidents. This type of curve was successfully identified using the proposed method.
The final column (column no. 11) in Table 3, displays the predicted number of curves
in the specified location using the existing method [22]. The existing method lacks
the capability of removing curve noise and therefore, the noise is declared as a curve.

3.3 Curve Aware Travel Time Estimation

Table 4 details the study of the proposed curve-aware travel time estimation; column
2 in the table shows the latitude and longitude of the source and destination toll
plazas, column 3 lists the information about the toll plaza and the type of highway,
column 4 lists the distance between the two toll plazas, column 5 lists the number
of curves in the highways that are a result of Sect. 2.2.1, column 6 lists the declared
speed of the car and truck, column 7 displays the declared reaching times for cars
and trucks between toll plazas, while column 8 displays the results of the Sect. 2.2.3
curve aware reaching times.

3.4 Vehicle Over Speed Detection System

Ten booth systems and one server system are used at each toll gate in the Vehicle
over Speed Detection System architecture. As shown in Fig. 11a, the GUI for each
toll booth system is connected to a camera to automatically gather information about
the vehicle’s type and license plate number. This information can then be updated
or corrected by a toll booth attendant. When a user clicks the check button, these
details are sent to the local server system, which then uses its local database to check
the vehicle number and type. If the vehicle is first time entry, it enters the details
of the vehicle and a time stamp (date and time) in a local database and pushing the
data to the next tollgate via a cloud server. Details of a vehicle that just entered the
Pallikonda toll gate are shown in Fig. 11b. If the vehicle has previously registered at a
toll gate, its information is stored locally. Such that the local server system calculates
Table 3 Curve detection research and analysis
80

S.no Source and Total No. of No. of Type 1 Type No. of Noise on Type of detected curve TIIR Predicated no. of
destination curve actual Predicated error 2 path is corrected curve by ref. [22]
distance curve curve error
1 12.932459, 113 km 18 19 2 (1–25%, 1 2 Simple curve = 17, compound 0.05 20
79.138573 1–50%) = 2, reverse = 0, sharp = 0
&
13.047688,
80.081534
2 12.600237, 13 km 48 48 1–25% 9 33 Simple curve = 33, compound 0.19 93
78.596748 = 0, reverse = 2, sharp = 13
&
12.593029,
78.631709
3 12.968459, 1.5 km 6 6 0 1 8 Simple curve = 2, compound = 0.17 15
79.155885 0, reverse = 1, sharp = 3
&
12.971857,
79.163570
4 12.931632, 91.5 km 64 64 1–25% 10 33 Simple curve = 44, compound 0.15 98
79.135380 = 0, reverse = 2, sharp = 18
&
12.593029,
78.631709
(continued)
K. Ganesan et al.
Table 3 (continued)
S.no Source and Total No. of No. of Type 1 Type No. of Noise on Type of detected curve TIIR Predicated no. of
destination curve actual Predicated error 2 path is corrected curve by ref. [22]
distance curve curve error
5 19.059460, 10.7 km 5 5 0 1 0 Simple curve = 5, compound = 0.05 5
72.913796 0, reverse = 2, sharp = 18
&
18.949226,
72.840700
6 47.082396, 12.5 km 40 45 4 6 7 Simple curve = 28, compound 0.13 52
3.929467 (2−25%, = 5, reverse = 6, sharp = 6
Vehicle Over Speed Detection System

& 2–50%)
47.136291,
4.016592
7 37.202597, 46.9 km 24 25 3(1–25%, 3 2 Simple curve = 21, compound 0.12 30
− 2–50%) = 0, reverse = 2, sharp = 2
87.010464
&
37.312765,

86.614955
81
82 K. Ganesan et al.

Table 4 The study of proposed curve aware travel time estimation


S.No Source to Location Distance No. of Declared Declared Curve-aware
destination information curves speed reaching reaching time
time (in
minutes)
1 12.544464, Highway 66.1 km 6 100 (Car) 39 (car) 42 (car)
78.201390 (Krishnakiri 80 (truck) 49 (truck) 54 (truck)
to to thopur toll
12.006492, gate)
78.080849
2 12.006198, Mountain 37.7 km 9 80 (car) 28 (Car) 30 (Car)
78.080653 pass (thopur 60 (truck) 37 (truck) 40 (truck)
to to omalur
11.720234, toll gate)
78.073370
3 13.647111, Ghat road 16.6 km 110 40 (Car) 28 (Car) 35 (Car)
79.401600 (Alibri to 40 (truck) 28 43 (Truck)
to GNC toll (Truck)
13.672812, gate)
79.351193
4 13.672823, Ghat road 17.1 km 107 40 (Car) 40 (Car) 42 (Car)
79.351400 (GNC to 40 (truck) 40 50 (Truck)
to Alibri toll (Truck)
13.647667, gate)
79.405564
5 12.910950, Highway 52.1 km 12 100 (Car) 31 (Car) 33 (Car)
79.400920 (Ranipet to 80 (truck) 39 42 (Truck)
to pallikonda (Truck)
12.905704, toll gate)
78.951853
6 12.905825, Highway 51.8 km 13 100 (Car) 30 (Car) 32 (Car)
78.951824 (Pallikonda 80 (truck) 38 40 (Truck)
to to ranipet (Truck)
12.911158, toll gate)
79.401029

the average speed of a vehicle between two tollgates, and if it exceeds that speed, a
fine is assessed as shown in Fig. 11c. Vehicle just entered with timestamp in local
server if it is travelling at normal speed, as shown in Fig. 11d. The Google cloud
platform, which is used to store data on each toll gate vehicle, is depicted in Fig. 12.
Figure 13a demonstrates how to search the Log cloud database. Date, time, a
vehicle’s number, or a Toll gate ID can all be used to search the log’s details. How
to search the Violator cloud database is shown in Fig. 13b. By vehicle identification,
date, time, or toll gate ID, one can search the details of the violator.
Vehicle Over Speed Detection System 83

Fig. 11 a License plate no. and vehicle type entry b Vehicle entered to toll plaza first time c Vehicle
over-speed detected d Vehicle passed between toll plazas in normal speed

3.5 Discussion

In this test-bed for both toll plazas, a total of 3552 vehicles passed through all toll
booths during the two hours test-bed’s time. Figure 14 displays a bar graph of vehicle
passes broken down by booth. During the two hours of testing, two vehicles received
fines for exceeding the government-mandated speed limits of 100 km/h for cars and
80 km/h for trucks. There could be a large number of vehicles that receive fines if the
vehicle speed limit was set using a curve-aware travel time estimation. According to
Fig. 15, there would be 13 to 14 vehicles fined if the speed limit for cars was 90 km/
h.
The Pallikonda and Ranipet Toll Plazas were used for two hours of testing during
the test-bed. This test site was overseen by the Vellore branch of RTO, the Tamil Nadu
84 K. Ganesan et al.

Fig. 12 Google cloud platform

Fig. 13 a RTO log search b Violator log search

government. Figure 16a depicts the RTO and inspector’s presence at the Pallikonda
Toll Plaza during the test-bed period. The experts testing the vehicle over-speed
detection system in the Ranipet toll plaza are depicted in Fig. 16b.
Vehicle Over Speed Detection System 85

Pallikonda Ranipet

380 386
368
313
279 291
230 237

159 164 161 173


154
91 90
64
12 0 00

b1 b2 b3 b4 b5 b6 b7 b8 b9 b10

Fig. 14 Analysis report: booth wise vehicle pass

rani -> pallik


pallik -> rani
14
13

9
7 7
6 6
5

c/v/j 90 c/v/j 95 b/t 70 b/t 75

Fig. 15 Suggestion to reduce vehicle speed

Fig. 16 Field test at a Pallikonda and b Ranipet toll plazas


86 K. Ganesan et al.

4 Conclusion

Highway traffic moving at an excessive speed needs to be controlled. The proposed


vehicle over-speed detection system can be used to determine whether or not a vehicle
that is travelling between two toll plaza roads was travelling at an excessive speed.
In this regard, a new curve-finding algorithm is proposed to precisely determine the
travel time of the vehicle. In the proposed vehicle over-speed detection system, this
curve-aware travel time is used. The Pallikonda and Ranipet toll plazas participated
in the real-time test-bed for a two-hour testing period under the direction of the
RTO, Tamilnadu government. Due to speeding, two vehicles were found and fined.
This system is currently being tested in two plazas; however, in the future, it could
be expanded to all toll plazas. In the future, the camera-based license plate extrac-
tion module will be replaced by an RFID tag-based vehicle information extraction
module, which is currently used in every vehicle in India under the brand name
FastTag.

References

1. Nayak, R. P., Sethi, S., & Bhoi, S. K. (2018). PHVA: A position based high speed vehicle detec-
tion algorithm for detecting high speed vehicles using vehicular cloud. In 2018 International
Conference on Information Technology (ICIT). https://doi.org/10.1109/icit.2018.00054
2. Krishnakumar, B., Kousalya, K., Mohana, R., Vellingiriraj, E., Maniprasanth, K., & Krish-
nakumar, E. (2022). Detection of vehicle speeding violation using video processing techniques.
In 2022 International Conference on Computer Communication and Informatics (ICCCI).
https://doi.org/10.1109/iccci54379.2022.9740909
3. Zou, F., Ren, Q., Tian, J., Guo, F., Huang, S., Liao, L., & Wu, J. (2022). Expressway speed
prediction based on electronic toll collection data. Electronics, 11(10), 1613. https://doi.org/
10.3390/electronics11101613
4. Shen, J., Zhou, W., Liu, N., Sun, H., Li, D., & Zhang, Y. (2022). An anchor-free lightweight deep
convolutional network for vehicle detection in aerial images. IEEE Transactions on Intelligent
Transportation Systems.
5. Roy, S. S., Rodrigues, N., & Taguchi, Y. (2020). Incremental dilations using CNN for brain
tumor classification. Applied Sciences, 10(14), 4915.
6. Samui, P., Roy, S. S., & Balas, V. E. (Eds.). (2017). Handbook of neural computation. Academic
Press.
7. Biswas, R., Vasan, A., & Roy, S. S. (2019). Dilated deep neural network for segmentation of
retinal blood vessels in fundus images. Iranian Journal of Science and Technology, Transactions
of Electrical Engineering, 1–14.
8. Rajput, S. K., Patni, J. C., Alshamrani, S. S., Chaudhari, V., Dumka, A., Singh, R., Rashid,
M., Gehlot, A., & AlGhamdi, A. S. (2022). Automatic vehicle identification and classification
model using the YOLOv3 algorithm for a toll management system. Sustainability, 14(15),
9163. https://doi.org/10.3390/su14159163
9. Wang, W., Yang, J., Chen, M., & Wang, P. (2019). A light CNN for end-to-end car license
plates detection and recognition. IEEE Access, 7, 173875–173883. https://doi.org/10.1109/acc
ess.2019.2956357
10. Huang, Q., Cai, Z., & Lan, T. (2021). A new approach for character recognition of multi-style
vehicle license plates. IEEE Transactions on Multimedia, 23, 3768–3777. https://doi.org/10.
1109/tmm.2020.3031074
Vehicle Over Speed Detection System 87

11. Seo, T., & Kang, D. (2022). A robust layout-independent license plate detection and recognition
model based on attention method. IEEE Access, 10, 57427–57436. https://doi.org/10.1109/acc
ess.2022.3178192
12. Henry, C., Ahn, S. Y., & Lee, S. (2020). Multinational license plate recognition using general-
ized character sequence detection. IEEE Access, 8, 35185–35199. https://doi.org/10.1109/acc
ess.2020.2974973
13. Park, S., Yu, S., Kim, J., & Yoon, H. (2022). An all-in-one vehicle type and license plate
recognition system using YOLOv4. Sensors, 22(3), 921. https://doi.org/10.3390/s22030921
14. Alam, N., Ahsan, M., Based, M. A., & Haider, J. (2021). Intelligent system for vehicles number
plate detection and recognition using convolutional neural networks. Technologies, 9(1), 9.
https://doi.org/10.3390/technologies9010009
15. Alghyaline, S. (2022). Real-time Jordanian license plate recognition using deep learning.
Journal of King Saud University-Computer and Information Sciences, 34(6), 2601–2609.
https://doi.org/10.1016/j.jksuci.2020.09.018
16. Raghunandan, K. S., Shivakumara, P., Jalab, H. A., Ibrahim, R. W., Kumar, G. H., Pal, U., & Lu,
T. (2018). Riesz fractional based model for enhancing license plate detection and recognition.
IEEE Transactions on Circuits and Systems for Video Technology, 28(9).
17. Dalarmelina, N. D., Teixeira, M. A., & Meneguette, R. I. (2019). A real-time automatic plate
recognition system based on optical character recognition and wireless sensor networks for
ITS. Sensors, 20(1), 55. https://doi.org/10.3390/s20010055
18. Singh, P., Patwa, B., Saluja, R., Ramakrishnan, G., & Chaudhuri, P. (2019). StreetOCRCorrect:
An interactive framework for OCR corrections in chaotic Indian street videos. In 2019 Inter-
national Conference on Document Analysis and Recognition Workshops (ICDARW). https://
doi.org/10.1109/icdarw.2019.10036
19. Jagtap, J., & Holambe, S. (2018). Multi-style license plate recognition using artificial neural
network for Indian vehicles. In 2018 International Conference on Information, Communication,
Engineering and Technology (ICICET). https://doi.org/10.1109/icicet.2018.8533707
20. Ravirathinam, P., & Patawari, A. (2019). Automatic license plate recognition for Indian roads
using Faster-RCNN. In 2019 11th International Conference on Advanced Computing (ICoAC).
https://doi.org/10.1109/icoac48765.2019.246853
21. Khan, S. U., Alam, N., Jan, S. U., & Koo, I. S. (2022). IoT-enabled vehicle speed monitoring
system. Electronics, 11(4), 614. https://doi.org/10.3390/electronics11040614
22. Li, Z., Chitturi, M., Bill, A., & Noyce, D. (2012). Automated identification and extrac-
tion of horizontal curve information from geographic information system roadway maps.
Transportation Research Record: Journal of the Transportation Research Board, 2291, 80–92.
23. Horzyk, A., & Ergun, E. (2020). YOLOv3 precision improvement by the weighted centers of
confidence selection. In 2020 International Joint Conference on Neural Networks (IJCNN).
https://doi.org/10.1109/ijcnn48605.2020.9206848
24. Jayaraman, S., Esakkirajan, S., Veerakumar, T. (2015). Digital image processing. Tata McGraw
Hill publication, Indian Edition.
25. Shi, B., Bai, X., & Yao, C. (2017). An end-to-end trainable neural network for image-based
sequence recognition and its application to scene text recognition. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 39(11), 2298–2304. https://doi.org/10.1109/tpami.2016.
2646371
26. Bains, M. S., Bhardwaj, A., Arkatkar, S., Velmurugan, S. (2013). Effect of speed limit compli-
ance on roadway capacity of Indian expressways. Procedia-Social and Behavioral Sciences,
104, 458−467
27. IRC: 73. (1980). Geometric design standards for rural (Non-urban) highways. Indian Roads
Congress.
28. IRC: 38. (1988). Guidelines for design of horizontal curves for highways and design tables.
Indian Roads Congress.
An Intelligent System for Video-Based
Proximity Analysis

Sergey Antonov, Mikhail Bogachev, Pavel Leyba, Aleksandr Sinitca,


and Dmitrii Kaplun

1 Introduction

Recently boosted by the COVID-19 pandemic, digital technologies played an increas-


ingly significant role in the public-health response to contact tracing worldwide.
Budd et al. [12] provides a comprehensive review of digital innovations devel-
oped in response to COVID-19 worldwide, including legal, ethical and privacy
barriers to their implementation, as well as organizational and workforce restrictions.
The review covers technologies developed in responce to five public-health needs,
including epidemiological surveillance, rapid case identification, control of commu-
nity transmission, communication of essential medical information and clinical
support [5].
Interrupting community transmission requires rapid tracing and quarantining of
contacts in order to prevent further transmission. Technologies supporting such activ-
ities are largely based on proximity tracing [17], which is usually implemented using
smartphone apps ([57, 59]) and low-power Bluetooth technologies. Hossain et al.
[18] recently proposed a B5G framework that employs high throughput and low
latency of modern 5G network standard to exchange chest X-ray [20] or CT scan

S. Antonov · D. Kaplun (B)


Department of Automation and Control Processes, Saint Petersburg Electrotechnical University
“LETI”, St. Petersburg 197022, Russia
e-mail: [email protected]
S. Antonov
e-mail: [email protected]
M. Bogachev · P. Leyba · A. Sinitca · D. Kaplun
Centre for Digital Telecommunication Technologies, Saint Petersburg Electrotechnical University
“LETI”, St. Petersburg 197022, Russia
e-mail: [email protected]
A. Sinitca
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 89
S. S. Roy et al. (eds.), Deep Learning Applications in Image Analysis,
Studies in Big Data 129, https://doi.org/10.1007/978-981-99-3784-4_5
90 S. Antonov et al.

images [41] for an early instrumental detection of COVID-19, as well as develop-


ment of a mass surveillance system to control and manage social distancing, mask
wearing, and body temperature monitoring. The above approach lies in the context
of various AI-based integrated emergency response solutions attracting increasing
interest in recent years [40, 42, 44].
Privacy is one of the major concerns in this context, strongly limiting the appli-
cability of various solutions. As a prominent example, Norway has stopped using
the Smittestopp app and switched to the Bluetooth approach [60]. Several inter-
national frameworks with various systematic approaches to privacy preservation
are emerging, including Decentralized Privacy-Preserving Proximity Tracing [58],
the Pan-European Privacy-Preserving Proximity Tracing initiative [61] and the joint
Google–Apple framework [56].
A key limitation of contact-tracing apps such as those mentioned above is that they
require a large proportion of the population to use the app. However, the practical
effectiveness of these apps is strongly limited by smartphone ownership, user compli-
ance, and technical compatibility [12]. An alternative approach, which can be more
effective in a variety of scenarios is proximity tracing based on video surveillance.
There are only few works addressing video surveillance in the context of COVID-
19 pandemic. Punn et al. [38] proposes a framework that utilizes the YOLO v3 object
detection model not only to detect, but also to distinguish between humans using the
Deepsort approach capable of further tracking the identified persons according to
their assigned IDs. The results of the YOLO v3 model are further compared with
other popular convolutional neural network architectures, such as SSD (Single Shot
Detector), R-CNN (Region-Based CNN) and their modifications. Rezaei et al. [39]
use a YOLOv4-based framework and inverse perspective mapping to improve accu-
racy of personal identification for an improved social distance tracking in the presence
of disturbance factors, such as crowd occlusion, partial visibility, and lighting vari-
ations, also providing a risk assessment scheme based on the statistical analysis of
personalized movement trajectories and the rate of social distancing violations.
Like in the case with mobile apps for tracing proximity, any solution based on
video surveillance needs to address privacy concerns. In this paper we propose a
framework which builds on the ideas of object detection and trajectory analysis,
incorporated from the literature on pedestrian tracking, but also integrates elements
that will allow for addressing privacy issues: facial recognition system which maps
faces to anonymzed IDs, and the construction of an anonymzied potential spread
graph, which can be used in scenarios such as contact tracking and epidemiological
surveillance.
Now more than two years since the onset of the pandemic, public attention is
increasingly shifting towards finding optimal exit strategies, including adaptation of
the technologies that have been rapidly deployed earlier in the course of the pandemic,
and finding their place in the post-pandemic society. Here we show explicitly how
the AI-based framework for proximity tracing based on video surveillance in public
places proposed here can be used in different scenarios ranging from individual
contact tracing or epidemiological surveillance of crowds to the improved public
spaces planning.
An Intelligent System for Video-Based Proximity Analysis 91

Existing body of work, e.g., on automatic pedestrian behavior analysis can be


adapted to this context [52]. These approaches usually employ various models for
object detection. However, the pandemic largely changed our vision of the goals
that have to be achieved in public spaces planning. There is compelling evidence
that various social distancing measures also reduced the spread of other infec-
tious diseases such as common cold or flu, which accounts for around 166 million
working days loss in the U.S. only, that nearly doubles when taking into account
parents that skip work due to the colds caught by their children, even outside of the
pandemic context. Therefore, adapation of the technologies widely used during the
current COVID-19 pandemic to reduce community transmissions of other respiratory
diseases such as common cold and flu, could be advantageous for at least a partial
reduction of these losses.
The rest of the paper is organized as follows. Section 2 presents an overview of the
proposed framework and the corresponding video data processing pipeline. Section 3
focuses on the proximity networks, which can be used in a variety of scenarios to
address public-health needs. Section 4 describes the evaluation of our approach for
a series of videos captured by the street surveilance cameras. Section 5 introduces
statistical quantities that are associated with the risks of community transmission and
discusses how they could be used for future improvements in public space planning
aiming at the reduction of community transmission risks in the post-pandemic society.

2 Overview of the Framework

A schematic overview of the proposed framework is presented in Fig. 1.


The proposed framework contains four main modules that are responsible for
persons detection, distance calculation, face recognition and network construction,
respectively, with the first three being repeated for each frame, while the last one
combining all information gathered from the entire scene.
Frames are fed into a trained convolutional neural network model “ssdlite
mobilenet v2 coco” for object detection. The output of the model contains coor-
dinates of bounding boxes of all detected persons in the frame. To find the actual
coordinates of each person in a frame relative to other people, coordinates of bounding
boxes are passed into OpenCV computer vision library, which performs bird’s eye
projection using the homography matrix. The next step is to link coordinates with
individual persons ID’s by matching with previous frames and update their trajecto-
ries. To facilitate the linkage process, every time the position appears not in a close
proximity to an already identified person’s trajectory, a facial recognition algorithm
is applied to the cropped image for identification purposes. To calculate distance
between people, a VP Tree is built and a modified nearest neighbors algorithm is
applied to this tree.
Once individual trajectories are obtained for each detected person (identified by
a unique ID) based on the results of the video analysis, identification of groups of
people that appear in a close proximity to each person, as well as the durations they
92 S. Antonov et al.

Fig. 1 Diagram of workflow


An Intelligent System for Video-Based Proximity Analysis 93

appear in proximity, are among the quantities of interest in the context of contact
tracing purposes.

3 Construction of Proximity Networks

3.1 People Detection and Accuracy Evaluation

A common approach to the detection of individual persons in video obtained from


fixed surveillance cameras is based on convolutional neural networks. There are
several approaches to the neural network training, among them supervised, unsuper-
vised and reinforcement learning. While training by supervised learning generally
leads to superior accuracy and performance, it requires large amounts of data at
learning stage. Datasets used for the network training should contain the “ground
truth” information including segmentation, localization, as well as object classi-
fication, typically summarized in the associated annotation files. Among multiple
variants, convolutional neural networks [27] should be noted as a common solution
for object detection.
Choice of particular solution and its validation are largely based on the accuracy
metrics such as Precision (Prc), Recall (Rec), Intersection Over Union (IoU) and
mean Average Precision (mAP). Figure 2 illustrates the IoU, a measure based on
Jaccard Index that evaluates the overlap between the reference and the predicted
bounding boxes, respectively.
To obtain the accuracy metrics, the IoU is next compared against a fixed threshold
Θ, that equals 0.5 in our example. When I oU < Θ the decision in made in favor
of hypothesis H0 , otherwise the decision is made in favor of hypothesis H1 . . The
accuracy of the decision making procedure is quantified based on the true positive
(TP) rate indicating the rate of decisions in favor of hypothesis H1 under the validity
of hypothesis H1 , , and by the false positive (FP) rate, indicating the rate of decisions
in favor of hypothesis H1 under validity of hypothesis H0 (see, e.g., [48] and refer-
ences therein). In a numerical treatment, based on the above rates, one can estimate
precision

Fig. 2 Intersection Over


Union (IoU) that is a
measure based on Jaccard
Index that evaluates the
overlap between the
reference (indicated by green
border) and the predicted
(indicated by red border)
bounding boxes
94 S. Antonov et al.

TP
Pr c = (1)
T P + FP

and recall
TP
Rec = . (2)
T P + FN

Similarly to the approach taken in [16], we also calculate the widely used detection
accuracy measure mAP, obtained as the area under the Pr c×Rec curve. By definition,
both precision and recall are bounded between 0 and 1, and thus mAP is also bounded
between 0 and 1. It is common to estimate mAP from interpolated Pr c × Rec curves

1 
m AP = pinter p(r ) , (3)
11 r ∈(0,0.1,...,1)

where pinter p(r ) is the interpolation of the Pr c × Rec curve.

3.2 Finding Coordinates of Each Person

Next before proceeding to walking trajectories, one has to transform from the homo-
geneous (also known as projective) coordinates to the world coordinates (corre-
sponding to the bird’s eye view) by means of projective geometry techniques [31].
For simplification purposes, each detected object represented by a bounding box is
associated with its pivot point, resulting in a simplified transformation expressed by
3 × 3 matrices
⎡ ⎤ ⎡ ⎤
xi, x
⎣ y , ⎦ = w⎣ y ⎦, where x, y are the coordinates of the plane (4)
i
wi 1

In order to transform from homogeneous coordinates to world coordinates, one


has to divide the resulting coordinates by wi . Accordingly, the procedure of finding
the location of each person in world coordinates can be expressed as
⎡ ⎤ ⎡ ⎤
wi xi, xi
⎣ wi y , ⎦ = H ⎣ yi ⎦, (5)
i
wi 1

where H the projection matrix, sometimes also referred to as the homography matrix,
which can be estimated using a number of approaches [33], such as direct linear
transformation (DLT) and robust estimation (RANSAC). Assuming that the pivot
point of each bounding boxes is located in the center of its lower edge, it can be
An Intelligent System for Video-Based Proximity Analysis 95

found as
xmax − xmin
xi = xmin + , (6)
2

yi = ymin , (7)

where (xmin , xmax , ymin , ymax ) are the bounding box coordinates. Thus, for a given
homography matrix, transformation to the world coordinates can be expressed as
⎡ ⎤ ⎡ ⎤
wi xi ‘ xmin + xmax −x
2
min

⎣ wi yi ‘ ⎦ = H ⎣ ymin ⎦. (8)
wi 1

3.3 Extracting Walking Trajectories

The idea of constructing walking trajectories based on locations obtained from indi-
vidual video frames requires linking the location points corresponding to the same
person observed in consecutive frames. For that, the first step is commonly searching
for the nearest neighbor points. The latter is usually performed using one of the algo-
rithms such as linear (full) search, search in kd-trees [35], search in BSP-trees [28],
LS-hashing [36], method with keywords [50] and search in VP-trees [54]. As linear
search is computationally inefficient due to its linear complexity of O(n), alternative
algorithms are in focus. The LS-hashing algorithm is based on finding a simple hash
function that can be used instead of direct comparison of point coordinates, resulting
in superior efficacy once a simple hash function is known, although finding such
function is not a straightforward solution in many real-world scenarios. The idea
of the keyword algorithm is to store a list of objects with rarely observed coordi-
nates, which also limits its applicability. Therefore, in our case the remaining options,
namely kd-tree, BSP-tree and VP-tree search algorithms, are of greater interest. In
the following, we focus on the VP-tree search, since it searches for other points
in a circular vicinity around the current pivot point, that is relevant to the contact
proximity analysis problem.

3.3.1 VP-Tree Construction

Like the majority tree construction algorithms, building a hierarchical VP-tree is a


recursive procedure. In the first iteration of the algorithm, a vantage point is selected
and the average distance from this point to all other points is calculated. The input set
of points is divided into two subtrees, assigning the point to the set of points in the
inner (left) subtree if the distance from it to the vantage point is less than the average,
96 S. Antonov et al.

and to the set of points in the outer (right) subtree otherwise. The same operation is
repeated for each subtree. Thus, each node in the tree has a vantage point and a radius
where the points belong to the node. Complexity of the tree construction algorithm
O(n ∗ logn).

3.3.2 Finding Nearest Neighbor in VP-Tree

The algorithm for finding the nearest neighbor to the point x is also recursive. At
any given step, one focuses on a tree node that has a vantage point q and a radius r .
Let us assume that point x is located at some distance d from q. If d is below r , a
recursive algorithm to search for a subtree of the node that contains any points closer
to the vantage point than the radius r is activated. Upon reaching the subtree, we
perform a linear search among the points of this subtree. Otherwise, returning to the
subtree of the node containing points displaced from q further than the given radius
r . When constructing the trajectory of a single walker movement, x is obtained from
the coordinates of this person in the previous frame, and the desired nearest point is
the coordinates of the same person in the current frame.

3.4 Finding People That Appear in Close Proximity

In the context of contact tracing, the next step is typically finding all points that
appear in a close proximity, usually determined by a circular area of a certain radius
around each person, first for a given video frame corresponding to a single point of
time. Since the original VP-tree based search algorithm focuses on finding single
nearest neighbor only, it should be generalized to search for potentially multiple
nearest neighbors within a circle of a given radius. In this context, there are several
possible situations:
1. The Entire Search Area is Included in the Internal Subtree

d(q, s) + r < T (9)

where d(q, s) is the distance from the center of the node to the search point, r is the
search radius, T is the node radius, determining the border of the inner subtree. The
world scale of the distances between two bird’s eye viewpoints is determined using
the size of the camera pixel obtained from the calibration procedure, and the distance
between two points is calculated as
/
d( pi , p j ) = (x j − xi )2 + (y j − yi )2 (11)
An Intelligent System for Video-Based Proximity Analysis 97

If this condition is met, the search can continue in the internal subtree only.
2. The entire search area is included in the external subtree.

d(q, s) − r < T (12)

If this condition is met, the search can only continue in the external subtree.
3. The entire search area is distributed over both subtrees.
In this case, the search is performed in both subtrees. The difficulty of searching for
nearest neighbors is O(log n).

3.5 Face Analysis and Recognition

Facial recognition is a long studied problem which attracted increasing attention in


recent years, leading to a considerable advancements in methodology and algorithms
development (see, e.g., [4] for a detailed review). One of the key issues regarding
widespread use of facial recognition technologies are privacy concerns, and thus
those methodological approaches that are capable of integrating anonymization in
a systematic way are favorable. In this work detected faces are being mapped to
anonymized IDs, which can be then stored in the system, allowing for identities to
be revealed in a controlled way.
The face analysis and recognition problem is generally a stepwise procedure,
including finding and selecting all faces in the images, their initial preprocessing
and alignment, identification of unique facial features, and their comparison against
a database of known people. The above procedure is typically implemented as a
pipeline, where all steps can be performed independently of each other, and thus
particular choice of techniques can be performed independently at each step from a
number of available solutions. Since there is a large body of recent work in this field,
we only provide a brief overview of the available solutions and their pros and cons.
Face detection techniques largely rely upon several well-established approaches.
Retrospecitvely, the Viola-Jones method [53], while being one of the first widely
available and computationally efficient solutions, was characterized by relatively
high false detection rates, as well as requirement of frontal facial images and low
robustness against occlusion. The Histogram of Oriented Gradients (HOG) method
[15] is based on analyzing the gradient of the binarized image, followed by its segmen-
tation into small segments and finding those where the arrangement of gradients is
close to a known facial image, often denoted as the HOG pattern. The keynote advan-
tage of this method is its computational efficiency, as well as reasonable effectiveness
for slightly non-frontal images, as well as moderate robustness against occlusions,
while its major drawback is the requirement of high resolution images, and failure
with low resolution images due to discreteness effects.
In recent years, Multi-Task Cascaded Convolutional Networks (MTCNN) [55]
became one of the most popular solutions for finding faces in images based on
98 S. Antonov et al.

the DNN (Deep Neural Network) approach. The above algorithm consists of three
consective steps, with the first one responsible for the image rescaling, the second
one known as the Proposal Network (or P-Net) looks for the candidate facial regions,
followed by the Refine Network (or R-Net) filtering bounding boxes and finally by
the Output Network (or O-Net) that focuses on facial landmarks (such as eyes and
mouth) localization. Another recent and powerful alternative solution is the MMOD
algorithm introduced by Davis E. King and implemented in the Dlib library [23].
Since it appears one of the most accurate of the other methods discussed above,
while also working well for different face orientations and even under substantial
occlusion, it has been chosen as an instrument used in this work.
However, it is also important to note that deep learning algorithms, while typically
outperforming other approaches in terms of accuracy, require considerably higher
computational resources, that may appear a limiting factor for their application under
limited resources scenarios and/or large amounts of data, as well as online analysis
requirements.
Face rescaling and alignment is an intermediate step between face detection and
face recognition. Common solutions are based on finding specific face landmarks
that can be used in the rescaling and alignment procedure as pivot points.
Face recognition techniques are also well developed. Early approaches were
largely based on such algorithms as Eigenface [21], Fisherface [3] and Local Binary
Patterns Histogram (LBPH) [46]. As these algorithms proved to have numerous
drawbacks, here we follow a more recent approach based on Convolutional Neural
Networks (CNN), that remain one of the most effective and reliable solution to the
date. Prominent examples include Google FaceNet [45] based on convolutional layers
learning face representations directly from the image. FaceNet was trained on the
Labelled Faces in the wild (LFW) [19] dataset to achieve invariance to illumination,
pose, and other variable conditions. Other notable examples include OpenFace [2].
In this work, we used also a neural network based solution implemented in the Dlib
library.
Finally, recognized faces should be associated with IDs of particular persons. This
is a typical problem for machine learning classification algorithms. If no matches are
found, a new ID is added to the database. In this work, we used a KNN classifier,
although many alternative classifiers would do the job.

4 Experiments

4.1 Combined Dataset of Neural Network Training

Next, we evaluated the approach using several sample videos recorded by surveillance
cameras in busy outdoor public places. For neural network training, we combined two
different datasets, that are among the most popular for object detection algorithms
learning, PASCAL VOC [16] and COCO [25]. Although they differ in the amount
An Intelligent System for Video-Based Proximity Analysis 99

Fig. 3 Distribution of people number in images

of annotation, both of them contain sufficient information to extract bounding boxes


around detected people. Figure 3 shows the histogram of person count in images
for the resulting dataset, indicating that the majority of images contained one single
person, while a significant number of images contained up to twenty different people.

4.2 Training a Neural Network Model

We used the combined dataset described above to train a convolutional neural


network model “ssdlite mobile net v2 coco”, which is a lightweight version of SSD
(Single Shot MultiBox Detector) [26] based on the joint architecture of SSDLite
and MobileNetV2 [43], characterized by high object detection accuracy (evalated
by mAP) and computational performance in various image analysis based problems.
Figure 4 shows the loss function obtained during model training, while Fig. 5 shows
the average accuracy of object detection (mAP), altogether indicating the chosen
neural network model demonstrates high accuracy of object recognition.

4.3 Frame Processing

By processing a video frame, the system detects people using a previously trained
neural network model “ssd lite mobile net coco v2”.
After detecting people by the trained network, their bounding box coordinates
were subjected to the homography matrix based transformation and nearest neighbor
search algorithm, followed by face detection and recognition algorithms, as described
100 S. Antonov et al.

Fig. 4 Evolution of the loss during model training

above. Figure 6 exemplifies a processed video frame with indicated bounding boxes,
where those appearing within a close proximity (for an arbitrary 2 m threshold) are
shown in red, while others are shown in green. Figure 7 shows the corresponding
bird’s eye view for the same frame, using similar color notation.
Another example is shown in Figs. 8 and 9, respectively.

4.4 Contact Network Graph

In epidemiological contact tracing, an important quantity that strongly influences


the transmission risk is the duration of contact between each pair of individuals. The
corresponding framework for a given public space can be represented by a weighted
graph, where the nodes correspond to individual persons, while the weights of the
links between them represent contact durations. Figure 10 exemplifies a contact graph
for a representative short scene, where link weights represent contact durations in
seconds.
In order to reduce the risk of infection transmission in public spaces, it is essential
to reduce the duration of contacts. Alternatively, under the assumption that contact
duration above a certain threshold is associated with increased risk of infection
An Intelligent System for Video-Based Proximity Analysis 101

Fig. 5 Evolution of the [email protected] while model training

Fig. 6 Results of processing 1 frame


102 S. Antonov et al.

Fig. 7 Bird eye view of 1 frame

Fig. 8 Results of processing 2 frame

Fig. 9 Bird eye view of 2


frame
An Intelligent System for Video-Based Proximity Analysis 103

Fig. 10 Contact graph

transmission, one can focus on the reduction of the number of links above a certain
threshold weight, i.e., the number of pairs of individuals that appear in close proximity
to each other for durations above a certain threshold value.

5 Further Interpretation and Outlook Towards Adaptation


to the Post-pandemic Society Goals

Now after more than two years since the onset of the pandemic, public attention
is increasingly shifting towards finding optimal exit strategies, including adaptation
of the technologies that have been rapidly deployed earlier in the course of the
pandemic, and finding their place in the post-pandemic society. In the following, we
consider how the above solutions could be used in different scenarios than individual
contact tracing or epidemiological surveillance of crowds, for example, leading to
the improved public spaces planning.
Planning of public spaces strongly affects the probability of congestions, forma-
tion of crowds, organization of queues, that in turn largely determines the numbers
of total contacts that remain in close proximity above a certain duration. There
104 S. Antonov et al.

is a number of well-known mathematical models widely used to simulate collec-


tive dynamics from particle movement to walking trajectories. One of the simplest
models for walking trajectories simulation is a 2D random walk characterized by
random increments. In real-world settings, randomness of increments is an unlikely
scenario, due to inevitable interactions between walkers and stationary objects, as
well as between walkers and other walkers, leading to the adjustment of their trajec-
tories, and thus correlated and self-avoiding walks appear more relevant. For a recent
literature overview of the problem from a multidisciplinary perspective, we refer to
[30, 37, 51], as well as several relevant special cases, including presence of obstacles
[47] and compactness constraints [24] capable of representing typical features of the
real-world public space settings.
In the following, we consider several short scenes, calculate statistics for the
quantities of interest and compare them against similar results for both uncorrelated
and correlated random walk models obtained by computer simulations.
Figure 11 shows the pairwise contact duration matrices representing the duration
of time each pair of individuals remains in a close proximity, for a sample proximity
threshold value. To simplify the comparison between different scenes, as well as
between video analysis based and random walk simulation based results, we define
the proximity threshold as a certain quantile of the distance distribution for all walkers
that can be observed simulateously within the scene. This kind of normalization is a
common approach to the comparison of datasets at different scales, see e. g. [7]. In
this particular example, we have chosen TQ = 5, indicating that on the average each
5th pairwise distance appears below the threshold.
For a statistical characterization of the contact graph properties, a straightforward
approach would be consideration of the distribution of contact durations obtained
for all possible pairs of individuals. Figure 12 shows the statistics for six different
short scenes, including complementary cumulative distribution functions (CCDFs)
indicating the probabilities that inter-arrival times and durations that each individual
remains within the scene exceed the function argument, as well as similar quantities
of the pairwise contact durations for all possible pairs of individuals, each for three
different threshold values, corresponding to TQ = 2, 5 and 10, respectively. The
figure shows that the normalized distributions expressed in the units of the average
contact durations obtained separately for each scene and each threshold value, tend
to follow a simple exponential.
This is generally an expected result, which is in a good agreement with similar
quantities obtained by computer simulations of random walks characterized by the
same average inter-arrival times and durations (for simplicity, exponential distribu-
tions of arrivals and durations within the scene have been considered). The theo-
retical background behind this distribution is rather simple and can be explained
via an event-based concept, considering any pair of individuals following random
trajectories coming into proximity as a random event. In this simplest scenario,
these events constitute Poisson processes with parameters generally depending on
both inter-arrival and duration times, as well as average distances between different
walkers and step sizes performed by a single walker in a given time unit (e.g. one
second). However, since the inter-event distbituion for any Poisson process decays
An Intelligent System for Video-Based Proximity Analysis 105

Fig. 11 Examples of pairwise contact duration matrices for six representative short scenes captured
from a street video surveilance camera for TQ = 5. Matrix sizes are determined by the total number
of individuals captured in each scene, with their total pairwise duration of proximity (in seconds)
indicated by color

by an exponential with only one free parameter that is the average value, normaliza-
tion by division by this average value for each distribution results in a data collapse
indicated by all curves following the same pattern close to a simple exponential with
the unit average. Deviations from this simple theoretical scenario can be attributed
to the discreteness and finite size effects. As one can see from the figure, these devi-
ations are comparable for the observational and for the simulated data, given that
the simulated data contains similar number of frames, average inter-arrival intervals
106 S. Antonov et al.

Fig. 12 Statistics for six


different short scenes,
including a CCDFs of
inter-arrival times and
b durations that each
individual remains within the
scene, as well as
c distributions of the contact
times for all possible pairs of
individuals, each for three
different thresholds TQ = 2,
5 and 10. Straight black lines
show a simple exponential,
while dashed colored curves
show similar results for
simulated random walks
with similar parameters like
in the observational data
(blue curves correspond to
the absence of correlations,
while red curves correspond
to the long-range correlated
random walks with Hurst
exponent H = 1.5
An Intelligent System for Video-Based Proximity Analysis 107

and durations of individuals remaining within the scene, and thus also similar total
numbers of individual trajectories in the entire scene.
However, in most real-world scenarios walking trajectories strongly deviate from
the simplest random model. Typical reasons for that are localization of the objects
of attraction (e.g. counters, doors, passages etc.), as well as obstacles (e.g. barriers,
billboards, kiosks etc.) in both indoor and outdoor public spaces, leading to the spatial
clustering of the walking trajectories. In addition, traffic regulations (e.g. revolving
doors, traffic lights at crosswalks etc.) lead to additional temporal clustering of the
walking trajectories. Among various models used to characterize motion from the
statistical physics viewpoint, long-range correlations appear the most relevant in the
context of human dynamics (for a recent and comprehensive review of literature
on the topic, we refer to [22]). To account for both spatial and temporal clustering,
two-dimensional long-range correlated fields seem to be a relevant model.
Recent data including our own results indicate that long-range correlations are
strongly associated with clustering of events, generally leading to heavy-tailed distri-
butions of both inter-event times and event durations, with the latter being crucial
for the contact proximity durations. The impact of long-range temporal correlations
on the event dynamics have been investigated both analytically [32] and numeri-
cally [1, 14] indicating that the interval distributions between consecutive events in
a series broaden from a single exponential for the simplest Poisson process scenario
to a stretched exponential for linear long-term correlations, and finally converge to
a power-law decay for strong long-term correlations, especially in the presence of
nonlinear interactions in the system [6, 7]. Moreover, in recent years similar distribu-
tions of the inter-event times have been observed in a number of real-world complex
systems, ranging from bursty access patterns driven by user interactions in public
computer networks [6, 29, 34, 49] to various natural phenomena, e.g. in geophysics
[8, 13]. Finally, our recent data indicate that spatial long-range correlations lead to
the manifestations of similar laws in biological polymer structures [9–11].
Figure 13 exemplifies similar distributions obtained by computer simulations
for walks with random increments with Hurst exponent H = 0.5 and long-
range spatiotemporally correlated increments with Hurst exponent H = 1.5. The
figure shows explicitly that stronger spatio-temporal correlations lead to broader
contact duration distributions, indicating that a larger fraction of pairs of individuals
remain within the same proximity thresholds for longer times (depicted by a more
pronounced initial decay in the exceedance probability distributions), compared to
the random increments scenario. The figure also shows that, while some general
qualitative conclusions are possible based on these simulations, particular functional
forms of the distributions obtained for finite systems exhibit non-trivial shapes that
are determined by a complex interplay of correlations, discreteness and finite size
effects, and thus are determined not only by their asymptotic behaviors that could be
eventually derived from known theoretical assumptions, but also depend explicitly
on the system size.
As a remark, obtaining Fig. 13 required simulated datasets that contained 110
times more time steps and 11 times more individual walkers, altogether resulting
in ~103 more walker positions, and potentially up to ~106 more pairwise distances,
108 S. Antonov et al.

Fig. 13 Pairwise proximity duration distributions obtained by computer simulations for a random
walk with random increments with Hurst exponent H = 0.5 (blue curves) and long-range
spatiotemporally correlated increments with Hurst exponent H = 1.5 (red curves), respectively

compared to the observational video examples used in our study. Since the amount
of video analysis required to obtain comparable statistics for different public places
requires considerable computational efforts, we believe that more detailed analysis
including long-term video analysis and best correlated walkers model fitting, for a
better understanding of how public space planning affects both the spatio-temporal
walking trajectory correlation patterns and contact proximity distributions, remains
beyond the scope of this study, and could be considered as an outlook for future
reseach directions.

6 Conclusion and Outlook

To summarize, digital technologies played a major role in the global responce to the
COVID-19 early on from the onset of the pandemic, especially in the context of digital
epidemiological surveillance and contact tracing, and proved their effectiveness in
the real-world context being strongly associated with a number of success stories
An Intelligent System for Video-Based Proximity Analysis 109

leading to the rapid suppression of the community transmission and reduction of the
incidence rates.
While AI and machine learning techniques have been widely applied in web-
based epidemic information support tools and online case tracing, they have not yet
been fully explored in the context of proximity tracing and consecutive analysis for a
more informed public spaces planning in the context of the reduction of the contacts
and contact durations.
In this paper, we have proposed a framework which is based on video-surveilance
for proximity tracing. However, as with the use of mobile apps and Bluetooth, privacy
considerations cannot be emphasized enough for any approach to be of practical use.
This is one of the fundamental ideas in our framework, realized by using anonymized
IDs to identify individuals. Further exploring how privacy can be integrated in the
proposed solution is the most immediate future research direction. Other directions
include training other neural network models and comparing them to find the best
model. Trained models will be evaluated based on the above parameters, such as
mAP with a set IoU threshold of 0.5, the error of the trained model, and the number
of frames per second (FPS) spent on object detection. In addition, we will further
evaluate the approach using large datasets from crowded streets.
Now after more than two year since the onset of the pandemic, public attention
increasingly shifts towards finding optimal exit strategies, including adaptation of
these technologies and finding their place in the post-pandemic society. Looking
forward towards this goal, we also consider how the proximity tracing based on
video surveillance in public places could be adapted to facilitate the improved public
spaces planning.

Acknowledgment The work of Sergey Antonov was supported by the Ministry of Science and
Higher Education of the Russian Federation “Goszadanie” No 075-01024-21-02 from 29.09.2021
(Project No. FSEE-2021-0014).

References

1. Altmann, E., & Kantz, H. (2005). Recurrence time analysis, long-term correlations, and extreme
events. Physical Review E, 71(5), 056106.
2. Amos, B., Ludwiczuk, B., & Satyanarayanan, M. (2016). Openface: A general-purpose face
recognition library with mobile applications. Technical report, CMU-CS-16–118, CMU School
of Computer Science.
3. Anggo, M., & Arapu, L. (2018). Face recognition using fisherface method. Journal of Physics:
Conference Series, 1028, 012119. https://doi.org/10.1088/1742-6596/1028/1/012119
4. Balaban, S. (2015). Deep learning and face recognition: the state of the art. In Biometric and
Surveillance Technology for Human and Activity Identification XII (vol. 9457, p. 94570B).
International Society for Optics and Photonics.
5. Biswas, R., Vasan, A., & Roy, S. S. (2020). Dilated deep neural network for segmentation of
retinal blood vessels in fundus images. Iranian Journal of Science and Technology, Transactions
of Electrical Engineering, 44(1), 505–518.
110 S. Antonov et al.

6. Bogachev, M., & Bunde, A. (2009). On the occurrence and predictability of overloads in
telecommunication networks. EPL (Europhysics Letters), 86(6), 66002.
7. Bogachev, M., Eichner, J., & Bunde, A. (2007). Effect of nonlinear correlations on the statistics
of return intervals in multifractal data sets. Physical Review Letters, 99(24), 240601.
8. Bogachev, M., Eichner, J., & Bunde, A. (2008). On the occurence of extreme events in long-term
correlated and multifractal data sets. Pure and Applied Geophysics, 165, 1195–1207.
9. Bogachev, M., Kayumov, A., & Bunde, A. (2014). Universal internucleotide statistics in full
genomes: A footprint of the dna structure and packaging? PLoS ONE, 9(12), e112534.
10. Bogachev, M., Kayumov, A., Markelov, O., & Bunde, A. (2016). Statistical prediction of
protein structural, localization and functional properties by the analysis of its fragment mass
distributions after proteolytic cleavage. Scientific Reports, 6, 22286.
11. Bogachev, M., Markelov, O., Kayumov, A., & Bunde, A. (2017). Superstatistical model of
bacterial DNA architecture. Scientific Reports, 7, 43034.
12. Budd, J., Miller, B. S., Manning, E. M., Lampos, V., Zhuang, M., Edelstein, M., Rees, G.,
Emery, V. C., Stevens, M. M., Keegan, N., et al. (2020). Digital technologies in the public-health
response to covid-19. Nature Medicine, 1–10.
13. Bunde, A., Bogachev, M., & Lennartz, S.: Precipitation and river flow: Long-term memory
and predictability of extreme events. Extreme Events and Natural Hazards: The Complexity
Perspective, 139–152.
14. Bunde, A., Eichner, J., Havlin, S., & Kantelhardt, J. (2004). Return intervals of rare events
in records with long-term persistence. Physica A: Statistical Mechanics and its Applications,
342(1), 308–314.
15. Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In 2005
IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005)
(vol. 1, pp. 886–893). IEEE (2005). https://doi.org/10.1109/cvpr.2005.177
16. Everingham, M., Eslami, S. A., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A.
(2015). The pascal visual object classes challenge: A retrospective. International Journal of
Computer Vision, 111(1), 98–136.
17. Ferretti, L., Wymant, C., Kendall, M., Zhao, L., Nurtay, A., Abeler-Dörner, L., Parker, M.,
Bonsall, D., & Fraser, C. (2020). Quantifying sars-cov-2 transmission suggests epidemic control
with digital contact tracing. Science, 368(6491).
18. Hossain, M. S., Muhammad, G., & Guizani, N. (2020). Explainable ai and mass surveillance
system-based healthcare framework to combat covid-i9 like pandemics. IEEE Network, 34(4),
126–132.
19. Huang, G. B., Ramesh, M., Berg, T., & Learned-Miller, E. (2007). Labeled faces in the wild: A
database for studying face recognition in unconstrained environments. Technical Report 07-49,
University of Massachusetts, Amherst.
20. Jalali, S. M. J., Ahmadian, M., Ahmadian, S., Hedjam, R., Khosravi, A., & Nahavandi, S.
(2022). X-ray image based COVID-19 detection using evolutionary deep learning approach.
Expert Systems with Applications, 201, 116942.
21. Jalled, F. (2017). Face recognition machine vision system using eigenfaces.
22. Karsai, M., Jo, H. H., Kaski, K., et al. (2018). Bursty human dynamics. Springer
23. King, D. E. (2015). Max-margin object detection
24. Lellouche, S., & Souris, M. (2020). Distribution of distances between elements in a compact
set. Stats, 3(1), 1–15.
25. Lin, T., Maire, M., Belongie, S. J., Bourdev, L. D., Girshick, R. B., Hays, J., Perona, P., Ramanan,
D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common objects in context. CoRR
abs/1405.0312
26. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., Berg, A. C. (2016). Ssd:
Single shot multibox detector (pp. 21–37). Lecture Notes in Computer Science. https://doi.org/
10.1007/978-3-319-46448-0_2
27. Li, Z., Yang, W., Peng, S., & Liu, F. (2020). A survey of convolutional neural networks: Analysis,
applications, and prospects
An Intelligent System for Video-Based Proximity Analysis 111

28. Maneewongvatana, S., & Mount, D. M. (2001). An empirical study of a new approach to nearest
neighbor searching. In Algorithm Engineering and Experimentation (pp. 172–187). Springer
Berlin Heidelberg. https://doi.org/10.1007/3-540-44808-x_14
29. Markelov, O., Nguyen, V., & Bogachev, M. (2017). Statistical modeling of the internet traffic
dynamics: To which extent do we need long-term correlations? Physica A: Statistical Mechanics
and its Applications, 485, 48–60.
30. Moltchanov, D. (2012). Distance distributions in random networks. Ad Hoc Networks, 10(6),
1146–1166.
31. Mundy, J. L., Zisserman, A., et al. (1992). Geometric invariance in computer vision (Vol. 92).
MIT press Cambridge.
32. Newell, G., & Rosenblatt, M. (1962). Zero crossing probabilities for gaussian stationary
processes. The Annals of Mathematical Statistics, 33(4), 1306–1313.
33. Nguyen, T., Chen, S.W., Shivakumar, S. S., Taylor, C. J., & Kumar, V. (2017). Unsupervised
deep homography: A fast and robust homography estimation model.
34. Nguyen, V., Markelov, O., Serdyuk, A., Vasenev, A., & Bogachev, M. (2018). Universal rank-
size statistics in network traffic: Modeling collective access patterns by zipf’s law with long-
term correlations. EPL (Europhysics Letters), 123(5), 50001.
35. Panigrahy, R. (2008). An improved algorithm finding nearest neighbor using kd-trees. Lecture
Notes in Computer Science, pp. 387–398. Springer Berlin Heidelberg. https://doi.org/10.1007/
978-3-540-78773-0_34
36. Pan, J., & Manocha, D. (2011). Fast gpu-based locality sensitive hashing for k-nearest neighbor
computation. In Proceedings of the 19th ACM SIGSPATIAL international conference on
advances in geographic information systems, GIS, pp. 211–220. Association for Computing
Machinery, New York, NY, USA. https://doi.org/10.1145/2093973.2094002
37. Pönisch, W., & Zaburdaev, V. (2018). Relative distance between tracers as a measure of
diffusivity within moving aggregates. The European Physical Journal B, 91(2), 1–7.
38. Punn, N. S., Sonbhadra, S. K., & Agarwal, S. (2020). Monitoring covid-19 social distancing
with person detection and tracking via fine-tuned yolo v3 and deepsort techniques.
39. Rezaei, M., & Azarmi, M. (2020). Deepsocial: Social distancing monitoring and infection risk
assessment in covid-19 pandemic. arXiv preprint arXiv:2008.11672
40. Roy, S. S., Goti, V., Sood, A., Roy, H., Gavrila, T., Floroian, D., Paraschiv, N. & Mohammadi-
Ivatloo, B. (2014). L2 regularized deep convolutional neural networks for fire detection. Journal
of Intelligent & Fuzzy Systems, 1–12.
41. Roy, S. S., Rodrigues, N., & Taguchi, Y. (2020). Incremental dilations using CNN for brain
tumor classification. Applied Sciences, 10(14), 4915.
42. Roy, S. S., Mihalache, S. F., Pricop, E., & Rodrigues, N. (2022). Deep convolutional neural
network for environmental sound classification via dilation. Journal of Intelligent & Fuzzy
Systems, 1–7.
43. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L. C. (2018). Mobilenetv2: Inverted
residuals and linear bottlenecks.
44. Samui, P., Roy, S. S., & Balas, V. E. (Eds.). (2017). Handbook of neural computation. Academic
Press.
45. Schroff, F., Kalenichenko, D., & Philbin, J. (2015). Facenet: A unified embedding for
face recognition and clustering. In 2015 IEEE conference on computer vision and pattern
recognition (CVPR), pp. 815–823. https://doi.org/10.1109/CVPR.2015.7298682
46. Singh, S., Kaur, A., & Taqdir, A. (2015). A face recognition technique using local binary pattern
method. IJARCCE, 165–168. https://doi.org/10.17148/IJARCCE.2015.4340
47. Skliros, A., & Chirikjian, G. S. (2008). Position and orientation distributions for locally self-
avoiding walks in the presence of obstacles. Polymer, 49(6), 1701–1715.
48. Sokolova, A., Uljanitski, Y., Kayumov, A. R., & Bogachev, M. I. (2021). Improved online event
detection and differentiation by a simple gradient-based nonlinear transformation: Implications
for the biomedical signal and image analysis. Biomedical Signal Processing and Control, 66,
102470.
112 S. Antonov et al.

49. Tamazian, A., Nguyen, V., Markelov, O., & Bogachev, M. (2016). Universal model for collective
access patterns in the internet traffic dynamics: A superstatistical approach. EPL (Europhysics
Letters), 115(1), 10008.
50. Tao, Y., & Sheng, C. (2014). Fast nearest neighbor search with keywords. , IEEE Transactions
on Knowledge and Data Engineering, 26, 878–888. https://doi.org/10.1109/TKDE.2013.66
51. Tejedor, V., Schad, M., Bénichou, O., Voituriez, R., & Metzler, R. (2011). Encounter distribution
of two random walkers on a finite one-dimensional interval. Journal of Physics A: Mathematical
and Theoretical, 44(39), 395005.
52. Vannoorenberghe, P., Motamed, C., Blosseville, J. M., & Postaire, J. G. (1997). Automatic
pedestrian recognition using real-time motion analysis. In International conference on image
analysis and processing (pp. 493–500). Springer.
53. Viola, P., & Jones, M. (2001). Rapid object detection using a boosted cascade of simple features.
In Proceedings of the 2001 IEEE computer society conference on computer vision and pattern
recognition (CVPR 2001, vol. 1, pp. I–I). IEEE
54. Yianilos, P. N. (1993). Data structures and algorithms for nearest neighbor search in general
metric spaces. In Proceedings of the fourth annual ACM-SIAM symposium on discrete
algorithms, SODA, pp. 311–321. Society for Industrial and Applied Mathematics, USA.
55. Zhang, K., Zhang, Z., Li, Z., & Qiao, Y. (2016). Joint face detection and alignment using
multitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10), 1499–
1503. https://doi.org/10.1109/lsp.2016.2603342
56. Apple and google framework. https://www.apple.com/newsroom/2020/04/apple-and-google-
partner-on-covid-19-contact-tracing-technology/
57. Covidsafe app, Australia. https://www.health.gov.au/resources/apps-and-tools/covidsafe-app
58. The dp-3t project. https://github.com/DP-3T/documents
59. Hamagen app, israel. https://govextra.gov.il/ministry-of-health/hamagen-app/download-en/
60. Norway halting smittestop app. https://www.amnesty.org/en/latest/news/2020/06/norway-cov
id19-contact-tracing-app-privacy-win/
61. Pepp-pt project. https://github.com/pepp-pt/pepp-pt-documentation/blob/master/PEPP-PT-
high-level-overview.pdf
Deep Learning-Based Conjunctival
Melanoma Detection Using Ocular
Surface Images

Kanchon Kanti Podder, Mohammad Kaosar Alam, Zakaria Shams Siam,


Khandaker Reajul Islam, Proma Dutta, Adam Mushtak, Amith Khandakar,
Shona Pedersen, and Muhammad E. H. Chowdhury

1 Introduction

The eye is a crucial and among the most intricate sensory organs which we have
as humans. It aids in our ability to visualize objects as well as our perception
of light, depth, and colour. Conjunctival nevus [1], which is a relatively ordinary
disorder, possesses several distinct clinical presentations [2]. Sufferers who ask about
conjunctival lesions are frequently encountered during ordinary clinical treatment

Supplementary Information The online version contains supplementary material available at


https://doi.org/10.1007/978-981-99-3784-4_6.

K. K. Podder
Department of Biomedical Physics and Technology, University of Dhaka, Dhaka 1000,
Bangladesh
M. K. Alam · Z. S. Siam
Department of Electrical, Electronic and Systems Engineering, Universiti Kebangsaan Malaysia,
43600 Bangi, Malaysia
Z. S. Siam
Department of Electrical and Computer Engineering, Presidency University, Dhaka, Bangladesh
K. R. Islam · A. Khandakar · M. E. H. Chowdhury (B)
Department of Electrical Engineering, Qatar University, 2713 Doha, Qatar
e-mail: [email protected]
P. Dutta
Department of Electrical and Electronic Engineering, Chittagong University of Engineering and
Technology, Chittagong 4349, Bangladesh
A. Mushtak
Clinical Imaging Department, Hamad Medical Corporation, Doha, Qatar
S. Pedersen
Department of Basic Medical Sciences, College of Medicine, Qatar University, 2713 Doha, Qatar

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 113
S. S. Roy et al. (eds.), Deep Learning Applications in Image Analysis,
Studies in Big Data 129, https://doi.org/10.1007/978-981-99-3784-4_6
114 K. K. Podder et al.

[3]. Conjunctival nevi could exhibit a range of malignant or benign characteristics


[4]. An uncommon but possibly fatal malignant growth of the eye is called conjunc-
tival melanoma [1], which develops from melanocytes found within the conjunctival
epithelium’s basal cells [5]. This uncommon tumour accounts for around 2% of all
eye tumours, 5% of optic melanomas [6], and 0.25% of each type of melanoma [7].
Mortality rates of at least 30% are associated with conjunctival melanoma [8], which
demands costly treatment, while a bad prognosis is linked to a belated diagnosis
[3, 5]. Conjunctival melanoma often manifests as a pigmented or colourful sharp
conjunctival lesion, however unusual cases with a variety of morphologies might
cause the diagnosis to be delayed [9]. This condition might be caused either by
nevus or acquired melanosis [10]. To diminish the mortality caused by this condi-
tion, prompt diagnosis and the practicality of detection are necessary, given the
contemporary scenario in several countries that involve an ageing population as well
as insufficient healthcare resources. An ophthalmologist performs a conventional
clinical examination to determine whether a patient has conjunctival melanoma by
viewing the ocular surface under a slit lamp, where a biopsy is necessary to verify
the diagnosis [3]. The implementation of these in-clinic investigations has, however,
been considerably impacted by the contemporary outbreak due to COVID-19 [11].
Therefore, ophthalmologists face significant difficulties in the prompt identification
of conjunctival melanoma [3].
Medical imaging has already been greatly impacted by deep learning, and this
influence is only anticipated to increase in future [12, 13]. Deep learning, according
to several experts, is going to be a key factor in the forthcoming medicine and a key
instrument for medical practice and research [14–18]. In terms of the analyses of
medical images, deep learning methods have already demonstrated impressive, and
frequently unheard-of, performance and accomplishment in a wide range of tasks
from both low- and high-level image processing functions, including image classi-
fication, detection, segmentation, enhancement, denoising, reconstruction, registra-
tion etc. [19–26]. Deep learning techniques that make use of digital images with
pathological lesions are thought to be useful for enhancing the detection of skin
malignancies [27, 28]. Even though many studies utilizing deep learning models
have concentrated on skin melanoma [29–32], the use of modern deep learning tech-
nology to identify conjunctival melanomas has been underexplored. Because of the
lack of substantial data including ground truth data of conjunctival diseases, training
traditional deep neural networks to identify conjunctival melanoma is very difficult.
Very recently, deep learning techniques for identifying conjunctival melanoma from
the optic surface images were explored [3]. However, their dataset was not well
curated. Also, for the classification to perform even better, more research is required.
The current study’s goal is to examine contemporary deep learning techniques used
to detect conjunctival melanoma utilizing a sizable, enhanced optic surface image
dataset. Four classes of image data, that are conjunctival melanoma, melanosis or
nevus, normal conjunctiva, and pterygium [33] images, have been used in the present
Deep Learning-Based Conjunctival Melanoma Detection Using Ocular … 115

study. Considering the research gap available in the field of classifying conjunctival
melanomas, the following contributions are proposed in this study:
• A well-curated dataset for conjunctival melanoma is proposed which is validated
by medical experts.
• An effective and faster augmentation technique is proposed counter to CycleGAN-
based augmentation [3] for increasing a small conjunctival melanoma dataset.
• A high-performing deep learning model is proposed in this study which can
classify the different eye conditions with high accuracy.
Additionally, we incorporated the interpretability of our findings. This study
intends to verify the hypothesis that conjunctival lesions could be classified, and
conjunctival melanoma could be found utilizing optic surface images with the help
of deep learning. The prompt identification of conjunctival lesions might be made
easier by this investigation.
The outline of this study is described in the sections below. The following parts go
into further information about the materials and methods that were utilized. After-
wards, the findings are revealed and discussed. At last, we address the conclusion
and potential future research as we wrap off our study.

2 Methodology

This study proposed a system where an image of the eye taken using a smartphone
can be classified as normal or other eye-related medical conditions. The methods
involved in this system start from data collection, data cleaning and validation, CNN
training and evaluation and visual interpretation. Figure 1 illustrates the step-by-step
workflow of the methodology proposed in this study.

2.1 Data Collection

The focus of this research was on analyzing the anterior segment utilizing a deep
learning system and images of the eye’s surface. The preliminary melanoma data set
on which our data set is developed was taken from [3]. Normal, Pterygium, Nevus,
and Conjunctival melanosis were the four categories present in that dataset. The
dataset suggested by [3] contains some irrelevant and problematic images identified
by the medical experts of our team. Ocular images of subjects with conjunctival
anomalies are widely available online and can be accessed through various keyword
searches (for example, “normal conjunctiva”, “pterygium”, “conjunctival nevus”,
“conjunctival melanosis”), so we removed irrelevant data from the dataset proposed
in [3] and added new images to the dataset. Expert physicians double-checked the
data to make sure it was accurate and valuable. The details of the original dataset and
116 K. K. Podder et al.

Fig. 1 Depiction of methodology adopted in this study

Fig. 2 Dataset details before and after cleaning and validation

the proposed dataset in this study are illustrated in Fig. 2 and a sample representation
of the different classes in the dataset is available in Fig. 3.

2.2 Data Augmentation

The “Four Class” Dataset is the label considered for the dataset proposed in this study.
It was from this “Four Class Dataset” that, another dataset was developed. Here we
Deep Learning-Based Conjunctival Melanoma Detection Using Ocular … 117

Fig. 3 A sample representation of the proposed dataset displaying images from class “Normal
Conjunctiva”, “Conjunctival Melanoma”, “Nevus”, and “Pterygium”

have a “Binary Class Dataset” where “Normal Conjunctiva” is categorized separately


from “Abnormal Conjunctiva” (which includes pterygium, nevus, and conjunctival
melanosis). Both datasets were divided into training set, a validation set, and a test
set with the percentage of 70%, 10%, and 20%, respectively (Fig. 4).
Due to the small size of dataset, four augmentation techniques were employed to
enhance the size of the train set. We have seen that data augmentation technique is
a proven method to counter the problem of small data set as shown in some other
publications [34–36]. These were random rotation, random affine transformation,
padding, and colour correction. The specifics of the four methods of augmentation
explored in this study are provided in Table 1. Methods of random augmentation
including both single augmentation and multiple augmentations were used in this
investigation. Whether a single augmentation or multiple augmentations would be

Fig. 4 A representation of single and multiple augmentation techniques on an ocular surface image
118 K. K. Podder et al.

Table 1 Augmentation
Augmentation techniques Range
techniques and ranges used in
the training set of proposed Random rotation +20 to −20 degree
datasets Random affine Degree = 0
Translate range = (0.05, 0.15)
Scaling range = (0.9, 0.95)
Padding Range = (0,10)
Fill = (black, white)
Mode = (‘Constant’, ‘Edge’)
Colour correction Brightness = (0, 0.2)
Contrast = (0, 0.2)

used was determined randomly in the augmentation model. The augmentation model
would then randomly decide which combination of augmentations to use if multiple
augmentations are chosen. Single and multiple augmentation techniques were used
to an image of the ocular surface, as shown in Fig. 3.
In each of the two datasets, the size of the training set for each class was expanded
to three thousand samples by applying these four augmentation techniques. As the
validation and test sets were used for evaluating deep learning models in a real-world
setting, these two sets were left unchanged throughout the process. Table 2 contains
a description of the sizes of the datasets along with the augmentation [37] factors.

Table 2 The detailed description of proposed datasets. The curated dataset is validated by
expert doctors and the training samples are increased by an augmentation factor using different
augmentation techniques
Dataset Class Original Validation Testing Training Augmentation Training set
data set set set factor after
samples augmentation
Binary Normal 125 13 25 87 34.48 3000
class conjunctiva
Abnormal 285 28 57 200 15 3000
conjunctiva
Four Normal 125 13 25 87 34.48 3000
class conjunctiva
dataset Nevus 85 8 17 60 50 3000
Pterygium 70 7 14 49 61.44 3000
Conjunctival 130 13 26 91 32.97 3000
melanoma
Deep Learning-Based Conjunctival Melanoma Detection Using Ocular … 119

2.3 Convolutional Neural Network (CNN) Based


Classification Models

This project utilized state-of-the-art CNNs for classifying ocular surface images of
normal and different eye conditions. Four CNN architectures, ResNet, DenseNet,
GoogLeNet, and EfficientNet were used in this study with pre-trained weights. We
selected these architectures due to there efficacy in previous publication [38]. These
four CNN architectures were trained on a large benchmark dataset “ImageNet” [39],
and the weights adopted in the training are the pre-trained weights that were utilized
for this study utilizing the well-known concept of transfer learning. CNN models are
initiated with the pre-trained weights and optimized during training on the ocular
surface images. Details of the trained CNN architectures are given below:

2.3.1 GoogLeNet

GoogLeNet was proposed in the literature [40], which was built on the Inception
module. The authors of GoogLeNet proposed a wider and deeper Inception which
performed slightly better performance in the ImageNet Large Scale Visual Recogni-
tion Challenge (ILSVRC) 2014 competition. Inside Inception module with dimen-
sionality reduction of GoogleNet, 1 × 1 convolution was added before every 3 × 3 and
5 × 5 convolution. This model is 22 layers deep with 27 pooling layers where 9 incep-
tion modules are stacked linearly. The end of the inception modules is entangaled
to the global average pooling layer. Detailed model architecture with convolutional
layers, pooling, and activations is available in the literature [40].

2.3.2 ResNet

ResNet architecture proposed in the literature [41] was designed to counter the
vanishing gradient problem in the deeper CNN architectures. In a deep CNN archi-
tecture, the features of the earlier layers start vanishing from the network as it goes
deeper and is introduced to more complex feature extractors. As a result, the vanishing
gradient happens and the residual connection in ResNet architecture solves this
problem by implementing a skip connection which flows the feature from the earlier
layer to deeper layers. In this study, ResNet18, ResNet50 and ResNet152 were used.
The designation ResNet, which is then followed by a number consisting of two or
more digits, indicates, quite simply, the ResNet architecture with a specific number
of neural network layers. So, in this ocular surface image classification research,
18, 50, and 152 layers-based ResNet architectures were utilized for evaluation and
comparison with other counterpart CNN architectures.
120 K. K. Podder et al.

2.3.3 DenseNet

The authors in [42] observed that deeper CNN models are more accurate and effi-
cient when the short connections are built among layers closer to input and closer
to output. By applying this observation, authors in [42] proposed DenseNet, which
works in a feed-forward fashion to connect each layer to every other layer. The
authors discovered that utilizing DenseNet had several benefits, including the elim-
ination of the vanishing-gradient problem, which resulted in better feature propaga-
tion and reuse. This particular sort of connection achieved benchmark results on the
ImageNet dataset while also significantly reducing the number of parameters. Both
the Densenet-161 and the Densenet-201 architectures were utilized in this study;
respectively, the depth of each design is 161 and 201 layers.

2.3.4 EfficientNet

All the CNNs, such as VGGNet, ResNet, MobileNet, and SeNet, employ a variety of
methods to improve the accuracy of the network. The methods may increase any one
of the three dimensions (width, depth, or resolution), but at least one of them will. The
authors in [43], addressed these methods of scaling in the literature. The integration
of all these strategies into EfficientNet was accomplished by the proposal of a scaling
mechanism that scales consistently across all of these dimensions. EfficientNet_B7,
a family member of the EfficientNet architecture, achieved 84.3% top-1 accuracy
on ImageNet and pre-trained weights of this model performance were used in our
ocular surface image classification.

2.4 Visualization Methods

Intuition on how CNN performs and reasoning behind its decision-making is always
an intriguing topic. Over the years with the development of visualization tools, the
curiosity behind how CNN works is satisfied effectively. This leads to model’s func-
tionality by showing the rationale behind the inference in a way that human would
figure out the engineering behind it which results in confidence in the CNNs’ outputs.
Among various visualization tools, Grad-CAM [44] was chosen for this investiga-
tion as Grad-CAM shows promising performance in recent computer vision problems
[45]. The method of Gradient-Weighted Class Activation Mapping utilizes gradient
of the feature at any final CNN layer to yield a localization map on images to find out
which region contributes to the decision-making. The benefit of using Grad-CAM
against other visualization technique is that, it is applicable on wide variety of CNN
architectures such as with or without fully connected layers [45]. Because sensi-
tive medical condition classification was carried out in this study, it was necessary
to confirm the region of interest with visualization for the CNN model to take it
Deep Learning-Based Conjunctival Melanoma Detection Using Ocular … 121

Table 3 Details for


Hyper-parameters Details
hyper-parameters used for all
CNN models to train on Batch size 4
“Binary Class” and “Four Optimizer Adam
Class” datasets
Loss function NLL Loss
Learning rate 0.0001
Total epoch 20
Epoch patients 6
Drop factor of learning rate 0.1
Maximum epoch stop 10
Stop criteria Loss

into consideration. As a result, this ultimately strengthened the trust in the decision-
making technique of the models. At the very end of the result section, a discussion
regarding the visual representation and explanation of the Grad-CAM used in this
ocular surface image classification is provided.

2.5 Experimental Setup

Five-fold cross-validation was used in the investigation of the “Binary Class” and
“Four Class” datasets. PyTorch library and Python 3.7 are being utilized in this
study. Google ColabPro platform with a 16 GB Tesla T4 GPU and 120 GB of High
RAM was utilized for training, validation, testing process. Apart from that, hyper-
parameters used in this study for all investigations are given in Table 3.

2.6 Evaluation Metrics

The performance of the CNN models was investigated by utilizing mathematical


metrics such as overall accuracy, precision, sensitivity/recall, F1 score, and speci-
ficity. Let, α = Number of ocular surface images predicted as true positive, γ =
Number of ocular surface images predicted as false positive, δ = Number of ocular
surface images predicted as true negative, and θ = Number of ocular surface images
predicted as false negative. So, the overall accuracy, precision, sensitivity/recall,
F1-score, and specificity may be formulated as given in Eqs. (1–5).

δ
Specificity = (1)
γ +δ
α
Precision = (2)
α+γ
122 K. K. Podder et al.

α
Recall = (3)
α+θ
2 × Precision × Recall
F1 score = (4)
Precision + Recall
α
Overall Accuracy = (5)
α+γ +δ+θ

The confusion matrix and ROC curves present important model evaluation metrics
for deep learning models’ performance on medical image classification. In this study,
the confusion matrix and ROC curves of each CNN model were evaluated to figure
out the best-performing model by comparing other counterpart models.

3 Results

3.1 Binary Classification

“Normal Conjunctiva” versus “Abnormal Conjunctiva” classes are considered binary


classes for classification using seven CNN models. The learning curves of these seven
CNNs are available in Supplementary tables 1 to 7. All the learning curves suggested
the models are well-trained and do not have chances of overfitting and underfitting
problems. Figure 5 displays the mean and standard deviation of accuracies across five-
fold validation using these seven pre-trained CNN models. EfficientNet_B7 achieved
the highest mean accuracy and lowest standard deviation in fold-wise accuracy.
The results showed that GoogLeNet’s performance varied more over five-fold than
EfficientNet_B7, which indicates that GoogLeNet had a comparatively less fold-wise
performance.
Table 4 Depicts binary classification results of all the employed models along
with number of trainable parameters of all models as well as the inference time taken
by each of them.
As can be seen from Table 4, out of the seven distinct CNN architectures that we
used, the best-performing model turned out to be EfficientNet_B7 according to the
outcome of different parameters. Different parameters such as accuracy, precision,
recall, F1 Score and specificity are 99.51%, 99.52%, 99.51%, 99.51% and 99.70%,
respectively. The EfficientNet_B7 is the heaviest network in terms of the number of
trainable parameters (more than 63 million trainable parameters). However, all the
other models also achieved very close performance in terms of the evaluation metrics
used in Table 4.
It is notable that the shallowest network, according to number of trainable param-
eters is GoogLeNet model (only ~5.6 million trainable parameters), compared to
the other networks. However, GoogLeNet achieved a classification performance that
was comparable to that of the EfficientNet_B7 model with regard to the evaluation
Deep Learning-Based Conjunctival Melanoma Detection Using Ocular … 123

Fig. 5 Representation of mean and standard deviation in the five-fold accuracy of all models for
binary classification

Table 4 Performance metrics of different CNN models in detection of “Normal Conjunctiva”


versus “Abnormal Conjunctiva” with five-fold cross-validation method in a binary class dataset
Model Trainable Inference Overall Precision Recall F1 Specificity
parameters time (second) accuracy (%) (%) score (%)
(%) (%)
ResNet18 11,177,538 0.00216 99.27 99.27 99.27 99.26 98.78
ResNet50 23,512,130 0.00532 99.27 99.27 99.27 99.26 99.22
ResNet152 58,147,906 0.01557 98.78 98.78 98.78 98.78 98.76
GoogLeNet 5,601,954 0.00641 98.78 98.79 98.78 98.78 98.27
DenseNet161 26,476,418 0.01893 98.29 98.30 98.29 98.29 98.22
DenseNet201 18,096,770 0.02592 99.02 99.05 99.02 99.03 99.17
EfficientNet_ 63,792,082 0.03334 99.51 99.52 99.51 99.51 99.70
B7

criteria. As of trainable parameters, EfficientNet_B7 model is almost 11 times heavier


than the GoogLeNet model. However, the EfficientNet_B7 model also produced a
performance in classification that was 0.73% more accurate and precise than that
of the GoogLeNet model. In terms of the inference time, ResNet18 took the least
inference time (around 2.16 ms) and also achieved very good performance in clas-
sification (accuracy of 99.27%). Due to a very less inference time (less than 0.04 s),
all the employed networks can be utilized for real-time applications.
124 K. K. Podder et al.

(a)
Fig. 6 The a ROC curve and b confusion matrix for best-performing EfficientNet_B7 model, which
has been trained and tested on binary class data. The confusion matrices and the ROC curves of the
other models can be found in the supplementary materials

The performance as well as effectiveness of one model distinguishing critical


medical complications from normal medical data can be also understood using ROC
curves, AUC score and the confusion matrices. Figure 6 represents ROC curve and
confusion matrix of best-performing network, EfficientNet_B7 for binary classifi-
cation. The confusion matrices and ROC curves of the other models used in binary
classification can be found in Supplementary Figures (1–14).
Figure 6a depicts TPR vs FPR of EfficientNet_B7 in classifying “Normal
Conjunctiva” vs “Abnormal Conjunctiva” in different thresholds. AUCROC was
close to 1.00 to indicate that EfficientNet B7 was able to accurately classify the
sample across all classification thresholds. The value of true positive, true negative,
false positive, and false negative cases of EfficientNet_B7 are shown in confusion
matrix that can be seen in Fig. 6b. Only one of the 285 test instances of the “Abnormal
Conjunctiva” class across five-fold was identified as “Normal conjunctiva”. When
compared to other CNNs, overall performance of EfficientNet_B7 was superior to
that of its counterparts.

3.2 Multi-class Classification

The seven CNN models used in binary classification were also used in four class
classifications. The learning curves of these models are also available in Supplemen-
tary Tables 8 to 14, displaying the trend of well-fitted models. Figure 7 represents
the graphical illustration of mean and standard deviation of accuracies in five-fold
cross-validation of all models on multi-class classification.
Multi-class classification of three cases of ocular illness and normal condition
based on optic images presents significant challenges. EfficientNet_B7, a recently
developed and robust CNN, had the highest mean accuracy across all five folds
(94.43 percent). Although other CNNs, such as DenseNet161, demonstrated larger
Deep Learning-Based Conjunctival Melanoma Detection Using Ocular … 125

Fig. 7 Graphical representation of mean and standard deviation in the five-fold accuracy of all the
models on multi-class classification

standard deviations, the standard deviation of the EfficientNet_B7 model’s accuracy


was small (±1.54), indicating steady performance based on fold-wise accuracy. The
other metrics such as overall accuracy, precision, recall, F1 score, and specificity are
also significant in understanding the functionality of a deep learning model as well
as fold-wise accuracies. Table 5 represents the four-class classification results of all
the CNN models along with number of trainable parameters as well as the inference
time taken by each of them.
Multi-class classification results, as shown in Table 5, exhibit greater variability
than binary classification results (Table 4). Table 5 shows that the EfficientNet_B7
model had the highest performance across all metrics used for assessing models.
The accuracy, precision, recall/sensitivity, F1 Score and specificity are 94.42%,

Table 5 The performance metrics of different state-of-the-art CNN models in the detection of
conjunctival melanoma with a five-fold cross-validation method on a four-class dataset
Model Trainable Inference Overall Precision Recall F1 Specificity
parameters time (second) accuracy (%) (%) score (%)
(%) (%)
ResNet18 11,178,564 0.00226 93.45 93.49 93.45 93.46 97.86
ResNet50 23,516,228 0.00539 91.02 91.22 91.02 91.01 96.91
ResNet152 58,152,004 0.01494 92.72 92.92 92.72 92.76 97.66
GoogLeNet 5,604,004 0.00627 91.75 91.76 91.75 91.72 96.98
DenseNet161 26,480,836 0.01789 91.02 91.34 91.02 91.1 96.90
DenseNet201 18,100,612 0.02269 94.42 94.43 94.42 94.42 98.15
EfficientNet_ 63,797,204 0.03188 94.42 94.55 94.42 94.43 98.20
B7
126 K. K. Podder et al.

Fig. 8 The a ROC curve and b confusion matrix for best performing EfficientNet_B7 model (the
“others” class is labelled as the “abnormal” class) trained and tested on the multi-class dataset

94.55%, 94.42%, 94.43%, and 98.20%, respectively. However, in terms of accu-


racy and recall, the DenseNet201 model showed exactly the same performance as
the EfficientNet_B7 model. Precision, F1 score, and specificity were all improved for
EfficientNet_B7. Although DenseNet161 and ResNet50 had more trainable param-
eters than GoogLeNet, the shallower network still managed to outperform them by
a little margin. ResNet18 once again had the fastest inference time (approximately
2.26 ms) with a classification accuracy of 93.45%. In addition, the inference time for
all of the models was less than 0.04 s, making them suitable for usage in real-time
settings.
Figure 8 represents the ROC curve and confusion matrix of best-performing
model, EfficientNet_B7 for multi-class classification. Figure 8a represents ROC
curve to be around 0.99, indicating close-to-perfect performance in multi-class clas-
sification. Figure 8b describes the TP (true positive) , TN (true negative), FP (false
positive), and FN (false negative) capabilities of the best-performing EfficientNet_B7
model. The true positive percentage of EfficientNet_B7 in classifying normal, ptery-
gium, nevus, and melanoma is 0.98%, 0.94%, 0.94%, and 0.91% respectively, which
indicates the model’s higher capability in distinguishing the classes. The confusion
matrix and the ROC curves of the other models can be found in Supplementary
Figures (15–28).

3.3 Comparative Analysis with Existing Literature

The proposed method of using data augmentation and pre-trained CNNs showed
improvement in model performance. The comparative analysis between previous
literature [3] and the proposed method in this study is tabulated in Table 6. In multi-
class or four-class classification, the method proposed in this study achieved 13.42%
improved accuracy and 0.036 improved AUC. The EfficientNet_B7 with image
Deep Learning-Based Conjunctival Melanoma Detection Using Ocular … 127

Table 6 Comparative analysis of the performance of the proposed method with counterpart
literature
Datasets Reference Technique AUC Accuracy
Four class Yoo et al. CycleGAN-based image augmentation, 0.954 81
MobileNetV2
Proposed Dataset cleaning, inclusion of related images, 0.99 94.42
method image augmentations, EfficientNet_B7
Binary Yoo et al. CycleGAN-based image augmentation, 0.976 96.5
class MobileNetV2
Proposed Dataset cleaning, inclusion of related images, 1.00 99.73
method image augmentations, EfficientNet_B7

augmentation techniques outperformed the CycleGAN-based Image Augmentation


and MobileNetV2-based study reported in [3]. The proposed method also outper-
formed the previous literature in binary classification by 3.23% accuracy and 0.024
AUC.

3.4 Visualization Using Grad-CAM

Figures 9 and 10 represent the visual interpretation of the best-performing models


in the “Binary Class” and “Four Class” datasets, respectively using Grad-CAM. It is
easier to comprehend the model’s prediction process when using this visual represen-
tation. Figure 9 provides a visual interpretation of EfficientNet_B7 and ResNet18,
two of the top-performing models in the “Binary Class” investigation.
Both models were effectively predicting the classes that corresponded to the
region of interest. Also, this study was undertaken to categorize three different
medical conditions, including Nevus, Pterygium, and Conjunctival Melanoma, in
addition to Normal subjects, thus visual interpretability is especially crucial in
“Four Class” investigations. Figure 10 displays the visual interpretation of the best

Fig. 9 Visual interpretation of ResNet18 and EfficientNet_B7 model predictions on the “Binary
Class” dataset
128 K. K. Podder et al.

Fig. 10 Visual interpretation of DenseNet201 and EfficientNet_B7 model predictions on the “Four
Class” dataset

performing model, EfficientNet_B7, beside the DenseNet201 interpretation. From a


visual perspective, EfficientNet_B7 revealed that the features learned from the rela-
tive region of interest during training are the key to the models’ capacity to classify
ocular surface images at maximum accuracy.

4 Conclusion

In conclusion, the proposed study used state-of-the-art CNN models with data cura-
tion, validation and single and multiple augmentation techniques to classify ocular
surface images for different medical condition investigations (“Binary Class” and
“Four Class”). EfficientNet_B7 was the best-performing model with 99.73% and
94.42% accuracy for “Binary Class” and “Multi-Class” respectively utilizing the
methodology proposed in this study. The results for both types of investigation
outperformed previously published literature [3]. Moreover, this model showed a
high degree of sensitivity of 99.51% and 99.42% for the “Binary Class” and “Four
Class” investigations, respectively. The performance of the best model, EfficientNet_
B7, was also evaluated through Grad-CAM-based visual interpretation as this study
includes the diagnosis of sensitive medical conditions using ocular surface images.
In future, the proposed model can be implemented in the server so that the model
can produce predictions with visual interpretation for clinicians and patients. The
implementation of such a server-based implementation of the proposed model can
be used in remote areas for telemedicine facilities and helps people in the rural area
to easily diagnose eye conditions with visual interpretation.

Funding This work was made possible by Qatar National Research Fund (QNRF) NPRP12S-
0227–190164 and International Research Collaboration Co-Fund (IRCC) grant: IRCC-2021–001.
The statements made herein are solely the responsibility of the authors.
Deep Learning-Based Conjunctival Melanoma Detection Using Ocular … 129

References

1. Damato, B., & Coupland, S. E. (2008). Conjunctival melanoma and melanosis: a reappraisal
of terminology, classification and staging. Clinical & Experimental Ophthalmology, 36(8),
786–795.
2. Oellers, P., & Karp, C. L. (2012). Management of pigmented conjunctival lesions. The Ocular
Surface, 10(4), 251–263.
3. Yoo, T. K., Choi, J. Y., Kim, H. K., Ryu, I. H., & Kim, J. K. (2021). Adopting low-shot deep
learning for the detection of conjunctival melanoma using ocular surface images. Computer
Methods and Programs in Biomedicine, 205, 106086.
4. Shields, C. L., Fasiudden, A., Mashayekhi, A., & Shields, J. A. (2004). Conjunctival nevi:
clinical features and natural course in 410 consecutive patients. Archives of Ophthalmology,
122(2), 167–175.
5. Wong, J. R., Nanji, A. A., Galor, A., & Karp, C. L. (2014). Management of conjunctival
malignant melanoma: a review and update. Expert Review of Ophthalmology, 9(3), 185–204.
6. Isager, P., Engholm, G., Overgaard, J., & Storm, H. (2002). Uveal and conjunctival malignant
melanoma in Denmark 1943–97: observed and relative survival of patients followed through
2002. Ophthalmic Epidemiology, 13(2), 85–96.
7. Chang, A. E., Karnell, L. H., & Menck, H. R. (1998). The National Cancer Data Base report
on cutaneous and noncutaneous melanoma: A summary of 84,836 cases from the past decade.
Cancer: Interdisciplinary International Journal of the American Cancer Society, 83(8), 1664–
1678.
8. Larsen, A. C., Dahmcke, C. M., Dahl, C., Siersma, V. D., Toft, P. B., Coupland, S. E., et al.
(2015). A retrospective review of conjunctival melanoma presentation, treatment, and outcome
and an investigation of features associated with BRAF mutations. JAMA Ophthalmology, 133
(11), 1295–1303.
9. Kao, A., Afshar, A., Bloomer, M., & Damato, B. (2016). Management of primary acquired
melanosis, nevus, and conjunctival melanoma. Cancer Control, 23(2), 117–125.
10. Damato, B., & Coupland, S. E. (2008). Conjunctival melanoma and melanosis: a reappraisal
of terminology, classification and staging. Clinical & Experimental Ophthalmology, 36 (8),
786–795.
11. Hallak, J. A., Scanzera, A., Azar, D. T., & Chan, R. P. (2020). Artificial intelligence in ophthal-
mology during COVID-19 and in the post COVID-19 era. Current Opinion in Ophthalmology,
31(5), 447.
12. Ching, T., Himmelstein, D. S., Beaulieu-Jones, B. K., Kalinin, A. A., Do, B. T., Way, G. P.,
et al. (2018). Opportunities and obstacles for deep learning in biology and medicine. Journal
of The Royal Society Interface, 15(141), 20170387
13. Topol, E. J. (2019). High-performance medicine: the convergence of human and artificial
intelligence. Nature Medicine, 25(1), 44–56.
14. DuBois, K. N. (2019). Deep medicine: How artificial intelligence can make healthcare human
again. Perspectives on Science and Christian Faith, 71(3), 199–201.
15. Rahman, T., Akinbi, A., Chowdhury, M. E., Rashid, T. A., Şengür, A., Khandakar, A., et al.
(2022). COV-ECGNET: COVID-19 detection using ECG trace images with deep convolutional
neural network. Health Information Science and Systems, 10(1), 1–16.
16. Rahman, T., Khandakar, A., Islam, K. R., Soliman, M. M., Islam, M. T., Elsayed, A., et al.
(2022). HipXNet: Deep learning approaches to detect aseptic loos-ening of hip implants using
X-ray images. IEEE Access, 10, 53359–53373.
17. Abir, F. F., Alyafei, K., Chowdhury, M. E., Khandakar, A., Ahmed, R., Hossain, M. M., et al.
(2022). PCovNet: A presymptomatic COVID-19 detection framework using deep learning
model using wearables data. Computers in Biology and Medicine, 147, 105682.
18. Chowdhury, M. H., Shuzan, M. N. I., Chowdhury, M. E., Reaz, M. B. I., Mahmud, S., Al Emadi,
N., et al. (2022). Lightweight end-to-end deep learning solution for estimating the respiration
rate from photoplethysmogram signal. Bioengineering, 9(10), 558.
130 K. K. Podder et al.

19. Wang, G., Ye, J. C., Mueller, K., & Fessler, J. A. (2018). Image reconstruction is a new frontier
of machine learning. IEEE Transactions On Medical Imaging, 37(6), 1289–1296.
20. Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical
image segmentation. In International Conference on Medical image computing and computer-
assisted intervention (pp. 234–241).
21. Haskins, G., Kruger, U., & Yan, P. (2020). Deep learning in medical image registration: A
survey. Machine Vision and Applications, 31(1), 1–18.
22. Karimi, D., Dou, H., Warfield, S. K., & Gholipour, A. (2020). Deep learning with noisy labels:
Exploring techniques and remedies in medical image analysis. Medical Image Analysis, 65,
101759.
23. Rahman, T., Chowdhury, M. E., Khandakar, A., Mahbub, Z. B., Hossain, M. S. A., Alhatou,
A., et al. (2022). BIO-CXRNET: A robust multimodal stacking machine learning technique
for mortality risk prediction of COVID-19 patients using chest x-ray Images and clinical data.
Neural Computing and Applications.
24. Tahir, A. M., Qiblawey, Y., Khandakar, A., Rahman, T., Khurshid, U., Musharavati, F., et al.
(2022). Deep learning for reliable classification of COVID-19, MERS, and SARS from chest
X-ray images. Cognitive Computation, 1–21.
25. Tahir, A. M., Chowdhury, M. E., Khandakar, A., Rahman, T., Qiblawey, Y., Khurshid, U.,
et al. (2021). COVID-19 infection localization and severity grading from chest X-ray images
Computers in Biology and Medicine, 139, 105002.
26. Qiblawey, Y., Tahir, A., Chowdhury, M. E., Khandakar, A., Kiranyaz, S., Rahman, T., et al.
(2021). Detection and severity classification of COVID-19 in CT images using deep learning.
Diagnostics, 11(5), 893.
27. Pacheco, A. G. C., & Krohling, R. A. (2020). The impact of patient clinical information on
automated skin cancer detection. Computers in Biology and Medicine, 116, 103545.
28. Han, S. S., Park, G. H., Lim, W., Kim, M. S., Na, J. I., Park, I., et al. (2018). Deep neural networks
show an equivalent and often superior performance to dermatologists in onychomycosis diag-
nosis: Automatic construction of onychomycosis datasets by region-based convolutional deep
neural network. PloS one, 13(1), e0191493.
29. Bhimavarapu, U., & Battineni, G. (2022). Skin lesion analysis for melanoma detection using
the novel deep learning model fuzzy GC-SCNN. In Healthcare, p. 962.
30. Martin-Gonzalez, M., Azcarraga, C., Martin-Gil, A., Carpena-Torres, C., Jaen, P., & Health,
P. (2022). Efficacy of a deep learning convolutional neural network system for melanoma
diagnosis in a hospital population. International Journal of Environmental Research and Public
Health, 19(7), 3892.
31. Haenssle, H. A., Fink, C., Schneiderbauer, R., Toberer, F., Buhl, T., Blum, A., et al. (2018). Man
against machine: diagnostic performance of a deep learning convolutional neural network for
dermoscopic melanoma recognition in comparison to 58 dermatologists. Annals of Oncology,
29(8), 1836–1842.
32. Brinker, T. J., Hekler, A., Enk, A. H., Klode, J., Hauschild, A., Berking, C., et al. (2019). A
convolutional neural network trained with dermoscopic images performed on par with 145
dermatologists in a clinical melanoma image classification task. European Journal of Cancer,
111, 148–154.
33. Yin, G., Gendler, S., & Teichman, J. (2022). Ocular surface squamous neoplasia in a patient
following oral steroids for contralateral necrotising scleritis. BMJ Case Reports CP, 15(12),
e253300.
34. Rahman, T., Chowdhury, M. E., Khandakar, A., Mahbub, Z. B., Hossain, M. S. A., Alhatou,
A., et al. (2022). BIO-CXRNET: A robust multimodal stacking machine learning technique
for mortality risk prediction of COVID-19 patients using chest x-ray images and clinical data.
arXiv preprint arXiv:2206.07595
35. Khandakar, A., Chowdhury, M. E., Reaz, M. B. I., Ali, S. H. M., Kiranyaz, S., Rahman, T.,
et al. (2022). A novel machine learning approach for severity classification of diabetic foot
complications using thermogram images. Sensors, 22(11), 4249.
Deep Learning-Based Conjunctival Melanoma Detection Using Ocular … 131

36. Rahman, T., Khandakar, A., Islam, K. R., Soliman, M. M., Islam, M. T., Elsayed, A. et al.
(2022). HipXNet: Deep learning approaches to detect aseptic loos-ening of hip implants using
x-ray images. IEEE Access, 10, 53359–53373.
37. Wang, H., Wang, Z., Du, M., Yang, F., Zhang, Z., Ding, S., et al. (2020). Score-CAM: Score-
weighted visual explanations for convolutional neural networks. In Proceedings of the IEEE/
CVF conference on computer vision and pattern recognition workshops (pp. 24–25).
38. Schlemper, J., Oktay, O., Schaap, M., Heinrich, M., Kainz, B., Glocker, B., et al. (2019).
Attention gated networks: Learning to leverage salient regions in medical images. Medical
Image Analysis, 53, 197–207.
39. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). ImageNet
large scale visual recognition challenge. International Journal of Computer Vision, 115(3),
211–252.
40. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., et al. (2015). Going deeper
with convolutions. In Proceedings of the IEEE conference on computer vision and pattern
recognition (pp. 1–9).
41. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In
Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).
42. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K. Q. (2017). Densely connected convo-
lutional networks. In Proceedings of the IEEE conference on computer vision and pattern
recognition (pp. 4700–4708).
43. Tan, M., & Le, Q. (2019). Efficientnet: Rethinking model scaling for convolutional neural
networks. In International conference on machine learning (pp. 6105–6114).
44. Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Grad-
cam: Visual explanations from deep networks via gradient-based localization. In Proceedings
of the IEEE international conference on computer vision (pp. 618–626).
45. Podder, K. K., Chowdhury, M. E., Tahir, A. M., Mahbub, Z. B., Khandakar, A., Hossain, M. S.,
et al. (2022). Bangla sign language (bdsl) alphabets and numerals classification using a deep
learning model. Sensors, 22(2), 574.
Plant Diseases Classification Using
Neural Network: AlexNet

Mohd Anas, Sanjiban Sekhar Roy, Kunwar S. Srivastava,


and Jashabir Chakraborty

1 Introduction

Not so long ago, India was an agricultural country. Even today, roughly, there are
118 million farmers in the country [1]. One of the major issues that these farmers/
cultivators face is several diseases that affect their plants. Not only this exacerbates
their economic problem, but also their social life; several hours, and sometimes years,
of hard work. There are several chemicals that can be employed to alleviate this
problem. The major issue here is diagnosis, and unless farmers have a lab in their
vicinity, it is likely that diseases will be misidentified. Furthermore, the situation
may get worsen, as it is often and spread to other farms. India has seen a large
increase in smartphone sales and this is coupled with the rise of middle class. Various
telecommunication companies want to have hold of the rising market and this has
led to the cost of internet usage to almost nearly zero. There are nearly 833 million
internet [2] users which is equal to 59.28% of the population of India. In this chapter,
we have work to provide all the farmers and cultivators with smartphones with internet
access, we could reduce the food loss in the country.
In order to help these farmers, David. P. Hughes and Marcel Salathe, in their
paper have created a database called, PlantVillage, which is an open access database
of 50,000 + images of healthy and diseased crops. This database has more than 150
crops and 1800 diseases. PlantVillage is a community of people helping each other,
by answering the questions and identifying the diseases by looking at pictures in the
questions. It is helpful but it has drawbacks as stated above [3]. In the paper, David

M. Anas · S. S. Roy (B) · K. S. Srivastava


School of Computer Science and Engineering, Vellore Institute of Technology, Vellore 632014,
India
e-mail: [email protected]
J. Chakraborty
Mata Gujri College of Pharmacy, Mata Gujri University, Bihar, India

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 133
S. S. Roy et al. (eds.), Deep Learning Applications in Image Analysis,
Studies in Big Data 129, https://doi.org/10.1007/978-981-99-3784-4_7
134 M. Anas et al.

P. Hughes & Marcel Salathé, have described the advantage of computer diagnos-
tics tools over human diagnosis. And we cannot download all the images in their
dataset. But in April 2016, PlantVillage released a subset of their dataset for image
classification challenge on CrowdAI [3].

2 Machine Learning and Deep Learning

In this section, we have discussed about the machine learning and neural network in
details.

2.1 History

Deep Learning was an underappreciated field due to several reasons such as, absence
of powerful GPUs, absence of required data and limited scientific work. In fact, deep
learning is term coined to attract interest in neural networks again. There have been
three phases of development in the field: it was known as cybernetics in 1940−1960s,
connectionism in 1980−1990s and deep learning from late 2000s. It is also known
as artificial neural network (ANNs) due to the fact that its design is inspired from
biological neural network [4].
So, earliest neural network models were simple linear models. They were designed
to take inputs{x1 , x2 …..xN } at the input layer, corresponding to an output y. The
network would learn the weights {w1 , w2 ……wN } such that

f (x, w) = y = x1 ∗ w1 + x2 ∗ w2 + . . . . . . . . . . . . . x N ∗ W N (1)

McCulloch-Pitts Neuron, Perceptron and ADALINE (adaptive linear element)


were some of the linear models. Although, these models were very useful but they had
limitations, most importantly, they couldn’t replicate XOR function. Neural network
were no longer popular after the discovery. There were massive research going on
during the second phase or popularly known as connectionism. The most impor-
tant development in this phase was successful implementation of backpropagation
algorithm for training purposes.
Algorithms such as backpropagation and LSTM are still popular. But the reason
why the popularity of neural net declined was unrealistic claims made by the compa-
nies and then under delivering them. Meanwhile, various other machine learning
models were performing far better than neural networks, thus declining its popularity.
In 2006, Geoffrey Hinton, trained a neural network called, deep belief network. This
sparked interest in neural network again. World had more computation power and
more data. By 2012, deep learning had proved to be useful state of art technology in
the field of object detection, image classification and computer vision.
Plant Diseases Classification Using Neural Network: AlexNet 135

2.2 Machine Learning Basics

A learning program is said to learn from experience E on task T with respect to


performance measure P, if its performance on T improves with experience E. A
learning program produces a representation R (often called a hypothesis h) of what
it has learned. Another program can use R to perform T. A learning program uses a
learning algorithm A to produce R from E [4].

2.2.1 Capacity, Overfitting and Under Fitting

The main challenge in machine learning is that our trained model must perform
well on new data points. This ability to perform well on new data points is called
generalization. When we train a model on a dataset, we have an error measure known
as, training error. We want this error to be as low as possible. But, in order to have a
working model, we want our model to have good generalization as well which means
that our test error should be low [4, 5].
Take linear regression for example, we train the model by minimizing the training
error, which is:

1  train 2
 X w − Y train  (2)
m train 2

Similarly, we want to minimize our test error, which would be:

1  test 2


test
X w − Y test 2 (3)
m
There are two factors determining the performance of machine learning models.
First is to make training error small, and second is to reduce the difference between
training and test error.
Underfitting occurs when model is not able to make the training error small and
Overfitting occurs when it cannot reduce the difference between the training error
and test error.
In simpler words, when model hasn’t sufficiently learned the features, we call it
under fitting, whereas when model memorizes the features instead of learning from
data, we call it overfitting. We can control whether a model is more likely to overfit
or underfit by altering its capacity.

2.2.2 Hyperparameters and Validation Set

Generally, machine learning algorithms have several parameters that control the
behaviour of training algorithm, these parameters are called Hyperparameters. We
136 M. Anas et al.

usually do not learn hyperparameters, because it is not appropriate to learn the hyper-
meter on training set. If we learn hyperparameters on training set, it will almost
always overfit. To solve this problem, we need another dataset, known as validation
set. Validation set is taken from training set but not included during training process.
Validation set is used during and after training, in order to estimate generalization
or test loss. We use this to update hyper parameters accordingly [4]. Typically, we
take 80% of training dataset for training and 20% for validation.

2.2.3 Gradient Descent and Stochastic Gradient Descent

Gradient Descent and its variations is widely used in several deep learning algorithms
[6]. It minimizes an error function.

1 
N
E in (w) = e(h(xn ), yn ) (4)
N n=1

In order to compute error or the gradient of error, we have to evaluate the hypoth-
esis at every point in the sample. We go down the error surface along the direction
subjected by gradient descent. The steps used in this case are iterative, and we take
one step at a time and one step is full epoch. Simply, we consider epoch when we
take all the example at once. So, weight update formula in this case:

w = −α∇ E in (w) (5)

In case of stochastic gradient descent, instead of having movement in the w space,


we will try to do it on space on one example at a time.
∇Ein is based on all examples (xn , yn ). Because we will use another method, we
will call the standard gradient descent as “batch” GD. In case of stochastic gradient
descent, we pick one example at time and apply gradient descent on that point e(h(xn ),
yn ), instead of whole dataset. Now, let’s think of the average direction that we are
going to be send along.
Average direction:

E n [−∇e(h(xn ), yn )] (6)

If we take the error measure that we are going to minimize, in this case, just one
example, and take the expected value, we get an equation which is similar to equation
mentioned above [4, 6, 7].
Average direction:

1 
N
E n [−∇e(h(xn ), yn )] = −∇e(h(xn ), yn ) (7)
N n=1
Plant Diseases Classification Using Neural Network: AlexNet 137

So, this is as if we are actually going in the direction we want, except that we only
use one example in the computation and then keep repeating. Thus, we will always
get the expected value in that direction and with time, the noise will average out and
we’ll go along the desired direction.

2.2.4 Neural Network and Backpropagation

Suppose we assign weights the notations wil j where l is hidden notation for layers
[7, 8] (Fig. 1).

⎨ 1 ≤ l ≤ L , layer s
W eight : Wilj = 0 ≤ i ≤ d (l−1) , input (8)

1 ≤ j ≤ d (l) , out put

And if we use tanh(s) as the activation function, where (Fig. 2):

es − e−s
θ (s) = tanh(s) = (9)
es + e−s

Output in neural network is x (l)


j

⎛ (l−1)

d
x (l) (l)
j = θ (s j = θ
⎝ wi(l)j xi(l−1) ⎠ (10)
i=0

Apply x lj tox1(0) . . . . . . .xd(0) → x1L = h(x).

2.2.5 Applying Stochastic Gradient Descent

We take one example at a time and apply it to the network and adjust the weight of
the network in the direction of negative of the gradient descent and thus makes it
stochastic [7].
All the weights w = {wil j } determine h(x).
Error on example (x n , yn ) is:

1
e(h(xn ), yn ) = e(w) = (h(xn ) − yn )2 (11)
2
So, to implement SGD, all we have to do is implement gradient of ∇e(w)

∂e(w)
∇e(w) = f or all i, j, l (12)
∂wil j
138 M. Anas et al.

Fig. 1 A multi-layer perceptron

Fig. 2 Graph for tanh(x) activation function


Plant Diseases Classification Using Neural Network: AlexNet 139

Fig. 3 Backpropagation:
phase I

All we have to do is compute this for every i,j,l and then take entire value of
weight and move along negative gradient (Fig. 3).
We can evaluate ∂e(w)
∂wl
using chain rule:
ij

∂e(w) ∂e(w) ∂s lj
= ×
∂wil j ∂s lj ∂wil j

∂e(w) ∂s lj
wher e = δ l
j and = xil−1 (13)
∂s lj ∂wil j

Now let’s find δ for final layer. When we computed the same we got xs for first
layer and then we propagate it forward until we get to the output. The reason is that
if we know δ for final layer, we will be able to use it to find δ for previous layers by
propagating backwards, and hence the name, backpropagation.

∂e(w)
= δlj , f or f inal layerl = L and j = 1
∂s lj

So,

∂e(w)
= δ1L and e(w) = e(h(xn ), yn ) (14)
∂s1L
140 M. Anas et al.

Fig.4 Backpropagation:
phase II

e(w) is error measure. This is applied on each layer until we reach the output, h(x n )
and compare it to target output yn.

e(w) = e(x1L , yn ) (15)

Suppose we are using mean square error, then (Fig. 4)

2
e(w) = x1L − yn


x1L = θ s1L ; θ (s) = 1 − θ 2 (s) f or tanh (16)

∂e(w)
δil−1 =
∂s l−1
j
d(l) ∂e(w) ∂s lj ∂ xil−1
= × × (17)
j=1 ∂s lj ∂ xil−1 ∂sil−1

∂e(w) ∂s lj l ∂ xi
l−1
wher e = δ l
j , = wi j , = θ  sil−1
∂s lj ∂ xil−1 ∂sil−1

d(l)
= δlj × wil j × θ  sil−1
j=1
Plant Diseases Classification Using Neural Network: AlexNet 141

2

d(l)
δil−1 = 1 − xil−1 wil j × θ  sil−1 (18)
j=1

2.2.6 Backpropagation Algorithm

1. Initialize all weights wil j at random.


2. For t = 0, 1….. do
3. Pick n from {1, 2, … N}
4. Forward: compute all x lj
5. Backward: compute δlj
6. Update the weights,

wil j ← wil j − nxil−1 δlj

7. Iterate to the next step until it is time to stop.


8. Return the final weight, wil j .

2.3 Convolution Neural Network

Convolution neural network is a special kind of neural network. It was given the
name because it uses convolution in at least one layer. It is widely used in computer
vision, image segmentation, classification etc. among other things [9, 10].

2.3.1 Convolution

In mathematics and engineering, convolution is described as mathematical operation


between two functions. It is defined as the integral of the product of the two functions
after one is reversed and shifted.

s(t) = x(a)w(t − a)da

s(t) = (x ∗ w)(t) (19)

Convolution is denoted by asterisk (*).


In deep learning, function x(a) is known as input and function w(t-a) is known as
kernel.
Convolution controls three important ideas that helps a machine learning
system: sparse interactions, parameter sharing and equivariant representations.
Additionally, convolution provides a means for working with inputs of variable size.
142 M. Anas et al.

2.3.2 Pooling

A layer of convolution network has three stages: convolution layer, activation function
such as ReLU and a pooling layer. A pooling layer changes the output of the net by
replacing some areas of input by its statistical summary. It performs down sampling
in height and width dimensions. The commonly used pooling layer is max pooling.

2.3.3 ReLU

The rectifier linear unit is an activation function defined as

f (x) = max(0, x) (20)

Convolutional nets were some of the first working deep networks trained with
backpropagation. It is not fully clear why convolutional networks succeeded when
general backpropagation networks were considered to have failed.

2.4 Various Deep Learning Libraries

There are several deep learning libraries to choose from. Some popular ones are:

2.4.1 Theano

Theano is a framework based on python developed by the LISA group and run by
Yoshua Bengio at the University of Montreal [11].

2.4.2 Torch

Torch is a deep learning framework developed by Ronan Collobert, Clement Farabet


and Koray Kavukcuoglu [12].

2.4.3 Caffe

Caffe is a Python deep learning library developed by Yangqing Jia at the Berkeley
Vision and Learning Centre. The biggest advantage of Caffe is the number of pre-
trained network that be downloaded from their model zoo [13].
Plant Diseases Classification Using Neural Network: AlexNet 143

2.4.4 Tensorflow

TensorFlow is an open-source programming library for machine learning over a


scope of assignments, and created by Google to address their issues for frameworks
fit for building and preparing neural systems to identify and interpret examples and
relationships.

2.4.5 Deep Learning 4J

Deeplearning4j is an open-source, distributed deep learning framework for Java and


Scala programming languages. It supports a variety of neural network architectures
such as feedforward, recurrent, and convolutional networks, and enables deployment
of models on GPUs, CPUs, and embedded devices [14].

3 Experimental Work and Results

In this section, we have discussed the experimental results and the model used.

3.1 Dataset

The dataset on CrowdAI consists of 54,309 images for training the neural network.
It has 14 different species of crop, 17 fungal diseases, 4 bacterial diseases, 2 mold
diseases, 2 viral disease, 1 disease caused by a mite and 12 crop species that are
visibly healthy. This means that there are 38 classes of images.
These 14 crop species are: Apple, Blueberry, Cherry, Corn, Grape, Orange, Peach,
Bell Pepper, Potato, Raspberry, Soybean, Squash, Strawberry, and Tomato (Fig. 5).
In the Fig. 1 above, there are 38 images each corresponding to different class of
diseases [3].

3.2 Data Pre-processing

Since we are trying to tune AlexNet, we have to make sure that the size of images
must be of exactly the same size as was used to originally train it. AlexNet was
trained on images of size 256 × 256 pixels with central crop of 227 × 227 pixels.
This means that we have to resize all the images of PlantVillage dataset. Instead
of having to deal with images straight from the disk, we will store them in LMDB
which is a high performance embedded transactional database. While Caffe does
supports reading images directly from the disk, using LMDB as the data store has
144 M. Anas et al.

Fig.5 Different 38 disease classes of leaves

quite significant performance gains. Finally, we will compute the mean of all the
images. This will be useful in both, training and testing processes. After correctly
updating LMDB store references, fine tuning the parameters in configuration files,
and changing hyperparameters in solver configuration file, we will train the model.

3.3 Architecture

In 2012, Alex Krizhevsky, Ilya Sutskever and Geoff Hinton submitted a convolution
network called, Alexnet, for an Imagenet ILSVRC challenge. ILSVRC challenge
also known as ImageNet challenge is conducted every year where participants have
to make a model that can classify millions of images into 1000 classes of object.
They won the challenge the same year and since then it was always a variation of
CNN that won the challenge (Fig. 6).
The input layers in AlexNet are formed by the raw pixel values obtained from the
image, and the final layer gives a probability distribution across all the classes. The
intermediate layers use a “processed version” of the output of the previous layer as
Plant Diseases Classification Using Neural Network: AlexNet 145

Fig. 6 Architecture of alexNet

their input, and over the whole training period they learn to activate against more and
more complex features depending on how deep they are in the overall architecture.
The neural net such as AlexNet are computationally very expensive and intensive. It
usually takes several weeks to train on ImageNet dataset. Fortunately, the features
learnt by earlier layers are very generic in nature, and thus can be used on new
dataset with totally different classes. This approach is known Transfer Learning or
Fine Tuning. In transfer learning, we take a pre-trained model and use the learnt
weight and after modification of the final fully connected layers, we use them to
train on new dataset. This gives us better result. In our PlantVillage dataset, we have
38 classes instead of 1000 classes from ImageNet. So, we have to change the num_
output parameter of fully connected layer in the training configuration file Caffe
[3, 8, 15, 16].

3.4 Results

If data is pre-processed and files are correctly configured, there will be no problem
in training the model. So, when we train the model, we have to make sure that we
are maintaining the log file. This is done in order to understand the training process.
Also, this log file can be used to generate graph. It took roughly around 2 h for
training the model for 2000 iterations (Fig. 7).
We can see the development of three performance measure: training loss, test
loss, test accuracy. Training and Test loss has significantly decreased from nearly 1
to 0.1, whereas the test accuracy on the test dataset was around 91.3%, which is pretty
impressive. The two most important factors to be considered in transfer learning are
size of the data and similarity of the data to the original dataset. If new dataset is
small and similar to original dataset, there is a high chance that the model will over
fit. In case we have large dataset, this may work given that both datasets are similar
[17–27].
146 M. Anas et al.

Fig. 7 Training curve for accuracy and loss with 2000 iterations

4 Conclusion

In conclusion, the use of deep learning in the form of image classification can provide
a budget-friendly and efficient solution to the problem of plant diseases affecting
farmers and cultivators. Otherwise, farmers would need well equipped labs to deter-
mine the disease. AlexNet is able to obtain 98 to 99% accuracy on training set and
91.3% accuracy on test set. In the future, we would like to employed different deep
learning models and perform different types of augmentations.

References

1. Agarwal, K. (2021). Indian agriculture’s enduring question: Just how many farmers does the
country have?. The Wire. Retrieved, 22.
2. BBC. (2023, January 23). India media guide. BBC News. https://www.bbc.com/news/world-
south-asia-12557390
3. Hughes, D., & Salathé, M. (2015). An open access repository of images on plant health to
enable the development of mobile disease diagnostics. arXiv preprint arXiv:1511.08060.
4. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. Book in preparation for
MIT Press. http://www.deeplearningbook.org
5. Jabbar, H., & Khan, R. Z. (2015). Methods to avoid over-fitting and under-fitting in
supervised machine learning (comparative study). Computer Science, Communication and
Instrumentation Devices, 70, 163–172.
6. Robbins, H., & Monro, S. (1951). A stochastic approximation method. The annals of
mathematical statistics, 400–407.
Plant Diseases Classification Using Neural Network: AlexNet 147

7. Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine
Learning, 2, 1–127. Also published as a book. Now Publishers.
8. Hecht-Nielsen, R. (1992). Theory of the backpropagation neural network. In Neural networks
for perception (pp. 65–93). Academic Press.
9. Roy, S. S., Awad, A. I., Amare, L. A., Erkihun, M. T., & Anas, M. (2022). Multimodel phishing
URL detection using LSTM, bidirectional LSTM, and GRU models. Future Internet, 14(11),
340.
10. O’Shea, K., & Nash, R. (2015). An introduction to convolutional neural networks. arXiv
preprint arXiv:1511.08458.
11. Al-Rfou, R., Alain, G., Almahairi, A., Angermueller, C., Bahdanau, D., Ballas, N., Bastien,
F., Bayer, J., Belikov, A., Belopolsky, A., Bengio, Y., Bergeron, A., Bergstra, J., Bisson, V.,
Snyder, J. B., Bouchard, N., Boulanger-Lewandowski, N., Bouthillier, X., de Brébisson, A.,
… Zhang, Y. (2016). Theano: A python framework for fast computation of mathematical
expressions. arXiv e-prints, arXiv-1605.
12. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z.,
Gimelshein, N., Antiga, L., Desmaison, A., Köpf, A., Yang, E., DeVito, Z., Raison, M., Tejani,
A., Chilamkurthy, S., Steiner, B., Fang, L., ... Chintala, S. (2019). Pytorch: An imperative style,
high-performance deep learning library. Advances in neural information processing systems, 32.
13. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., … Guadarrama, S. &
Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. In Proceedings
of the 22nd ACM international conference on Multimedia (pp. 675–678).
14. Gibson, A., Nicholson, C., Patterson, J., Warrick, M., Black, A. D., Kokorin, V., ... & Eraly, S.
(2016). Deeplearning4j: Distributed, opensource deep learning for Java and Scala on hadoop
and spark. Towards Data Science.
15. Fei Fei, L., Karpathy, A., Johnson, J. CS231N–Stanford University
16. Krizhevsky, A., Sutskever, I., Hinton, G. E. ImageNet classification with deep convolutional
neural networks. University of Toronto.
17. Roy, S. S., Goti, V., Sood, A., Roy, H., Gavrila, T., Floroian, D., Paraschiv, N., Mohammadi-
Ivatloo, B. (2014). L2 regularized deep convolutional neural networks for fire detection. Journal
of Intelligent & Fuzzy Systems, (Preprint), 1–12.
18. Roy, S. S., Mihalache, S. F., Pricop, E., & Rodrigues, N. (2022) Deep convolutional neural
network for environmental sound classification via dilation. Journal of Intelligent & Fuzzy
Systems Preprint, 1–7.
19. Roy, S. S., Rodrigues, N., & Taguchi, Y. (2020). Incremental dilations using CNN for brain
tumor classification. Applied Sciences, 10(14), 4915.
20. Biswas, R., Vasan, A., & Roy, S. S. (2020). Dilated deep neural network for segmentation of
retinal blood vessels in fundus images. Iranian Journal of Science and Technology, Transactions
of Electrical Engineering, 44(1), 505–518.
21. Roy, S. S., Mihalache, S. F., Pricop, E., & Rodrigues, N. (2022). Deep convolutional neural
network for environmental sound classification via dilation. Journal of Intelligent & Fuzzy
Systems, (Preprint), 1–7.
22. Deep learning research should be encouraged for diagnosis and treatment of antibiotic
resistance of microbial infections in treatment associated emergencies in hospitals.
23. Lee, K. C., Roy, S. S., Samui, P., & Kumar, V. (Eds.). (2020). Data analytics in biomedical
engineering and healthcare. Academic Press.
24. Samui, P., Roy, S. S., & Balas, V. E. (Eds.). (2017). Handbook of neural computation. Academic
Press.
25. Forecasting stock price by hybrid model of cascading multivariate adaptive regression splines
and deep neural network
26. Roy, S. S., & Taguchi, Y. H. (2021). Identification of genes associated with altered gene
expression and m6A profiles during hypoxia using tensor decomposition based unsupervised
feature extraction. Scientific reports, 11(1), 1–18.
27. Ali, M., Magdon-Ismail, M., Lin, H. T. Learning from Data-Abu. https://amlbook.com/
Hyperspectral Images: A Succinct
Analytical Deep Learning Study

L. Sandeep Kumar, G. K. Panda, and B. K. Tripathy

1 Introduction

Since the advent of imaging spectrum (1980s), Hyperspectral images (HIs) have
been acquired owing to computational classificatory capability for fine spectra that
provides a resolving power for a diverse range of applications. Some includes remote
sensing based environmental, atmospheric and ocean observations [66], meteorolog-
ical applications, military [37], geological exploration and mining [53], crops, vege-
tation and food analysis and standalone biomedical fields [56]. In addition to having
high spectral and spatial resolution, HIs have many bands and abundant information
because they cover ultraviolet, visible, near-infrared, and mid-infrared wavelengths.
This offers an avenue of research HI-based image correction [77], noise reduction
[40], transformation [48], dimensionality reduction, and classification [8].
For the machine learning (ML) based methods to processes HIs, there is a high
need to label several legitimate samples for training. Early researches on this regard
were focused with spectral information based HI classification methods like, support
vector machine [72], random forest, neural networks [20, 67], and Polynomial logistic
regression [45]. An HI represents the image as a “hypercube, (x, y, λ)” in which
the first two dimensions indicate its spatial coordinates and the third indicates the
number of bands. As a result, each pixel represents a pattern with as many attributes
as there are bands. With a complexity on bands (large number) associated in HIs,
the high data volume (populates exponentially) to be processed further relates to the
avenue to reduce the dimensionality and to minimize the computation complexity

L. S. Kumar
Biju Patnaik University of Technology, Rourkela, Odisha, India
G. K. Panda
MITS School of Biotechnology, Utkal University, Bhubaneswar, Odisha 765017, India
B. K. Tripathy (B)
School of Information Technology and Engineering, VIT, Vellore, Tamil Nadu 632014, India
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 149
S. S. Roy et al. (eds.), Deep Learning Applications in Image Analysis,
Studies in Big Data 129, https://doi.org/10.1007/978-981-99-3784-4_8
150 L. S. Kumar et al.

in many real life HI applications. To cater the dimension reduction methods in


HIs, numerous applications have also been proposed using feature extraction and
feature selection. Some prominent methods include, principal component analysis
(PCA) [32], independent component analysis (ICA) [33, 71], and linear discriminant
analysis (LDA) [17]. The deep learning method has excellent capabilities in image
processing, particularly in recent years, when image classification, target detection,
and other fields have sparked its use. There are a number of deep learning network
models available to improve the performance of HI processing, such as the convolu-
tional neural network (CNN), the deep belief network (DBN) [24], and the recurrent
neural network (RNN). In addition, to resolve the problem of poor classification
results due to a lack of training samples, tensor-based classification model [51, 52]
was proposed and experiments revealed that when the number of training samples
is small, this method outperforms to support vector machines and deep learning. In
this first part of our discussion, one of our primary goals is to enhance the accu-
racy of the classification. We use Hyperspectral image of Sundarban mangrove area
through seninel-2 satellite (Fig. 1). The input images contributed through 12 bands
are processed with spatial analysis using DL based 3D CNN. Principal Compo-
nent Analysis (PCA) is implemented to derive 3D patches of the Sundarban satellite
image. The process exhibits with 96% classification accuracy.
The remaining part is summarized as follows. In Sect. 2 we review the related
research of the study. In Sect. 3 we highlight some concepts of Hyperspectral image.
Section 4 discusses the overview of deep learning and CNN. Section 5 presents an
empirical study of 3D-CNN classification on HIs of Sundarban Mangrove region. In
Sect. 6, we present a hybrid-MSSN model, validate with experiment analysis with
three HI datasets and discuss the outcomes.

Fig. 1 Bhitarkanika mangrove (Source Google Maps): a Binary image b Grayscale image c RGB
image
Hyperspectral Images: A Succinct Analytical Deep Learning Study 151

2 Related Researches

In the process of classification of HIs, the spectral dimension (Fig. 2) helps in iden-
tifying the significant variations of reflectance between image pixels which change
with wavelength [38]. In a study [31], it was observed that, the classification accu-
racy drops dramatically after a certain increased value of spectral bands. Since a
majority of spectral bands are redundant in nature, so carrying all bands into consid-
eration, affect to the model’s performance. Dimension reduction techniques [28, 57]
on this regard are used to identify such unnecessary bands without compromising the
image’s information content. The modified brown stick rule for HI [3] contributes
a phenomenal aspect in dimension reductions. In majority of the cases, the reduced
band features suffer with the anomalies of object identification and necessitates for
discriminative spatial features. As per a study [19] the pixels next to each other
belongs to the same class in HIs, hence applications of HI’s spatial features along
with spectral features is an intuitional and motivation for an effective classification,
to study. There have been some methodologies adapted on feature extractions like
Gray level co-occurrence matrix [44, 54], stationary wavelet transform (SWT) [43,
73], discrete wavelet transforms [10, 22], morphological profiles [4, 55] have been
used in may real-world applications.
Neural network-based techniques have been implemented to tackle many complex
problems of remote sensing [67]. DL techniques have become extremely popular in
recent years with several real life applications like, study of gene characteristics [25],
text-based image retrieval [60], audio signal classification [9], image processing [2],
health care analysis [36], measuring confidence in interviewees [61], Face mask
detection [64], classification of skin cancer [63] and computer vision [70]. In general

Fig. 2 Image dimension: a Hyperspectral image b RGB image


152 L. S. Kumar et al.

it has influenced the research in AI in a major way. Some such studies are presented
in [1, 69]. The application of Deep learning (DL) in ANN has led to the development
of Deep Neural Networks (DNN) [6]. Some of the DL algorithms which are used in
HI classifications are stacked autoencoders (SAEs) [5], deep belief networks (DBNs)
[30]. Convolutional neural networks (CNNs) [21, 49] are used in HI classifications
[11]. CNNs have wide applications like, MRI segmentation [68], Diabetic retinopathy
[58], in study of Big Data [7], Classification of pests [16], COVID-19 detection [62]
and whether classification [23], There are some innovative approaches with 2D-CNN
[50], 3D-CNN [26, 46], spectral-spatial LSTMs [80], SSUN [76], SSRN [79] have
also been employed in HI classifications. Literature shows that 2D-CNN alone, is
not able to generate discriminative features of classification [59] whereas 3D CNN is
found to be suitable for volumetric samples. However, it lacks in generating discrim-
inative features of classes that have textural similarity across several spectral bands.
Taking these shortcomings a HybridSN model [59] was proposed which comprises
of 2-D and 3-D convolution layers to generate discriminative spectral and spatial
features. MCNNCP model [78] also contributes promising accuracy in using 3-D
and 2-D convolution layers based solutions.
DLs have achieved noteworthy performance in the domains of visual information
processing and AI. Some special DNNs like Gated recurrent unit networks are used
for detecting toxicity [39] and wide res-Net being used for age and gender estimation
[14]. This approach pioneered the extraction of hierarchical deep features automat-
ically in a practicable way for an HI. They consider an image to be organized with
hierarchical components like pixels, edges, parts and objects. In contrast to shallow
handcrafted features, end-to-end deep features are capable of representing more
abstract and complex shapes in the image. They perform well even in circumstances
where there are rapid regional changes in an image.
Normal image classifications presume on the data that follows uniform distri-
bution between diverse classes and prone with discriminate samples belonging to
the majority classes leading to an imbalanced phenomenon (in case of HIs). Hence,
special care or measure needs to be addressed to tackle such imbalanced character-
istics of HI classification [65]. Studies in [29, 47, 74] demonstrate on data augmen-
tations, pixel-pairing and auto allocations of unlabeled samples respectively and
demonstrate their efficacy in HI classifications. Studies in [27, 41–43] modeled with
recent novel concepts of SWT and CNN, decomposition and deep residual nets, 3D-
2D-and depth wise separable-1D CNNs, CNN with optimization (Grey Wolf) and
1D-EWT and 3D-CNN.
So, following are some of the intuitive literature outcomes that motivate us to
address through the proposed models undertaken in the following sections.
1. Ensemble a DL-model to address the hierarchical feature extractions.
2. To perform and learn with limited training data.
3. To demonstrate minimum information loss due to dimension reductions, convo-
lutions and Max pooling operations.
4. To address the issues of vanishing gradients, minimum computational time, class
imbalance problem and tolerance to noise.
Hyperspectral Images: A Succinct Analytical Deep Learning Study 153

3 Hyperspectral Images

To start with the fundamental concepts of a digital image, we can interpolate it


into the form of binary, grayscale, color and Hyperspectral images. Binary images
consist with 0 and 1’s to represent black and white respectively and occupy in a
2-D matrix (r-rows. c-columns). Grayscale digital images range from 0 to 255 to
represent white to black with intermediate levels of gray-scale. As per the biological
aspects of human cone cells (eye) to render environment colours, combinations of
RGB-scales (red, green, blue) are digitized into (r-rows × c-columns) × 3 channels.
These RGB coloration is based on the reflected light from objects fall under separate
wavelengths (long, medium, short for red, green and blue respectively) in the visible
spectrum (perceived by human eyes) of the electromagnetic radiation.
Alongside, there are lot of wavelengths beyond the visible spectrum signify valu-
able information which the human eye cannot perceive. To be formal, spectral image
is a kind of similar to RGB colour image with many channels describing the spatial
and spectral information. Multi-spectral image consists with n-band images, where
each band has corresponding light intensity to the wavelength (not necessarily spread
over a contiguous wavelength range). A λ-band Hyperspectral image consists with
n grey-scale images, where each band has corresponding light intensity to the wave-
length being stacked on top of each other over a contiguous wavelength range (r-rows
× c-columns × λ bands).

4 Deep Learning and CNN

The idea behind Deep Learning (DL) is to train computers/machines artificially with
an approach to model complex algorithms to learn from experience, classify and
recognize data or images just like a human brain does. As a type of ANN, CNN is
also used for image or object recognition (processing images, analyzing videos, and
detecting obstacles in autonomous vehicles). There have been phenomenal develop-
ments in devising methods pertain to ANN in DL-based classification and object/
image recognition domain. DL-based three Core layers (dense, convolution and
output) offer learning based HI solutions towards a supervised, semi-supervised or
unsupervised models.
Hi-based DL models are being developed for many classifications and object
identification purposes in using these three designs. The adaptability of these design
for application models depends on the availability of labeled HI-data. To be specific,
if the HI-model is based on the mapping process of labeled datasets in respect to
the ground truth then the supervised model is used. To extract/unavil properties of
HI data from unlabeled datasets, the unsupervised design is addressed and while
with availability of little/small portion of HI based labeled data, the semi-supervised
design is preferred to get use in the model. Further, convolutional neural networks
(CNNs) in contrast to deep forward neural networks (DNNs) and autoencoders (AEs)
154 L. S. Kumar et al.

play a vital role in many HI-based intensive applications. In a high-dimensional


recognition or prediction system, the role of convolution layers in CNN is specif-
ically oriented to identify or learn the local patterns from images or sequences of
images. There are three simple operational steps generally viewed in CNN models
(feed forward and one direction) for HI classifications. First, identification of input
image and the conversion into image pixels (array) by the input layer. Then it passes
through multiple hidden layers. The feature extraction process is being taken care
by convolution followed by the usage of pooling, rectified linear units on need basis.
Object classification is being taken care at fully connected layer and to identify with
label at output layer. The most general form of a CNN is identified with a group
of convolutional and pooling into modules; however there are variant possible of
groups.
In HI-based research point of view, the top ten most popular CNNs can be
represented as, Convolutional Neural Networks (CNNs), Long Short Term Memory
Networks (LSTMs), Recurrent Neural Networks (RNNs), Generative Adversarial
Networks (GANs), Radial Basis Function Networks (RBFNs), Multilayer Percep-
tions (MLPs), Self-Organizing Maps (SOMs), Deep Belief Networks (DBNs),
Restricted Boltzmann Machines( RBMs) and Autoencoders.

4.1 HI Based Deep Feature Selection

With high spectral resolution based HIs, information from each pixel is generally
interpolated to one-dimensional spectral vectors. 1D-CNN model helps in identifying
specific features (from the pool of spectral information) of the HI through such
pixels for further classifications. In simpler description, 1D-CNN takes labeled HI-
data as input, process with class labels during training and updates network weights
iteratively using stochastic gradient descent algorithms and results with classified
data being trained with each pixel classification. Convolution operations on 1-D
feature vectors are performed using a 1-D convolution kernel defined in Eq. 1.
 Hi −1

 (x+h)
vl,x j = f kl,h j,m v(l−1),m + biasl, j (1)
m h=0

The 2-D CNN (Eq. 2) uses a 2-D convolution-kernel to exhibit a convolution


operation on 2-D matrix in using 2-D filter.
 
l −1 W
 H l −1
x,y (x+h)(y,w)
mapl, j = f kl,h,w
j,m map(l−1),m + biasl, j (2)
m h=0 w=0

To perform a convolution operation on 3-D data, 3-D CNNs use 3-D convolution
kernels (Ri refers to the size of each kernel). As the main objective is to extract the
Hyperspectral Images: A Succinct Analytical Deep Learning Study 155

low-level features contained in the HIs, we use 3-D filter at the input image and
generates a cube or cuboid in the 3-D volume space. In 3-D convolution, the same
3-D kernel is applied to overlapping 3-D cubes in the input to extract the features.
Max pooling, Dropout, Batch Normalization, Flatten methods are generally used to
route multi-scale feature maps generated from each 3-D convolution layer.
⎛ ⎞
l −1 Q
 P  i −1 R
 i −1
(x+ p)(y+q)(z+r )
= tanh ⎝ + biasl, j ⎠
x,y,z p,q,r
mapi, j wi, j,m map(i−1),m (3)
m p=0 q=0 r =0

In addition to splendid advantages of Deep Neural Network (DNN) usage, some


of its observed limitations include, (a) difficulty in accommodating large number
of input features in case of small first hidden layer, (b) high increase of weights in
case of accommodating large input features to a large first hidden layer, (c) difficulty
due to the vanishing gradient point in case of large number of layers, the gradient is
high at neurons near output and comparatively low at near inputs. The Spinal Fully
Connected Layer (SFCN) or SpinalNet [34] model is in interpreting with human
somatosensory system offers solutions to such issues being observed in conventional
DNNs. SFCN is based on gradual inputs, local output and probable global influence,
reconfiguration of weights during training. The architecture [34] of the model is
shown in Fig. 3.

Fig. 3 SpinalNet (Source [34]


156 L. S. Kumar et al.

4.2 HI Based Optimization

In HI-based classifications, with algorithmic approach of optimizers are used during


the learning process in neural network. The main purpose of these algorithms is
to minimize the difference between the expected and actual values in adjusting or
updating the weights in order to make the most accurate predictions.
The gradient descent technique is found to be one of the prominent methods
adopted by many research image classification applications in the context of deep
learning and to get an optimized neural network. Gradient descent may be classified
into three basic variants according to the amount of data used: batch gradient descent,
stochastic gradient descent (SGD), and mini-batch gradient descent. In addition to
the SGD optimization, adaptive moment (Adam), AdaDelta, the root mean square
propagation (RMSProp), Nesterov, AdaMax, Nadam GD [65, 79] are also take part
in many applications.
The Adaptive Moment Estimation (Adam) is a replacement optimization algo-
rithm for SGD for training deep learning models [75], which combines the capabil-
ities of both the RMSProp and AdaGrad [35]. This optimizer needs little memory
and tuning, can handle sparse gradients on very noisy problems, impressive speed
of convergence and mean absolute error and found to be the most preferred opti-
mizers including Hyperspectral image analysis. The mean and variance (1st and 2nd
moments) of the gradient are calculated as:

m t = β1 ∗ m t−1 + (1 − β1 ) ∗ εt (4)

vt = β2 ∗ vt−1 + (1 − β2 ) ∗ εt2
vt
wt+1 = wt + wt , wher ewt = −η √ ∗ εt (5)
st + ∈

The relative contribution of past history with regards to the present gradient is
controlled through the decay rates (β1 and β2 hyper parameters), each parameter wt
replaces with w. η is the first level learning rate, εt represents the gradient at time t,
vt signify for the exponential average and st is the exponential average of the square
of the gradient.

5 3D-Convolutional Neural Network Based HI


Classification on Sentinel-2 Satellite Data of Sundarban
Mangrove Regions

Sundarban is one of the largest mangrove areas in the world stretching from India to
Bangladesh with a delta formed by the rivers, Brahmaputra, Meghan and Padma in
the Bay of Bengal. Around 106 islands and supports a good number of biodiversity
Hyperspectral Images: A Succinct Analytical Deep Learning Study 157

Fig. 4 HI/classification map: a Sundarban b Indian pines [13] c PaviaU [13]

(Fig. 5). It is home to a wide range of wildlife species including endangered species
and supports for biodiversity through its 106 islands.

5.1 Dataset Description

We have used Hyperspectral images (HI) of Sundarban mangrove region. Actual


Hyperspectral images from 12 bands are collected from the Sentinel-2 satellite
images (COAH, [12]).
The remote sensing based satellite images contain more than three bands which
contains a diverse set of information about any specific geographical location in
contrast to the general images (3 bands, red, green and blue bands). With the help
of more data in the form of bands, we can understand and analyze the data effec-
tively. The image in Fig. 4a represents a satellite image cube that contains R-rows,
C-columns, and B-bands. As stated above, the input Sentinel-2 based HIs for the
experiment comprise with 12 bands (coastal aerosol, Near Infra-Red(NIR), Short
Wave Infra-Red(SWIR), and RGB), wavelength ranging from 0.443 to 2.190 micro
meters with 10–60 m of spectral resolution. In using the COAH tool, HIs with less
than one percent of clouds, being filtered with cloud cover map were selected for
input image analysis (Fig. 5).

5.2 Experimental Setup and Hyper-Parameters

Outcomes of the undertaken experiment on Sundarban satellite HI data is processed


on Google Colab Pro™ cloud platform with graphical processing unit (GPU) anal-
ysis. In using Python libraries and methods like rasterio, loadmad, EarthPy, the input
HIs brought into the frame of stack to compare with six major classes. It includes,
158 L. S. Kumar et al.

Fig. 5 Satellite data: Sundarban mangrove a Composite image b Ground truth image c 12-band
HI visualization

Barren land (BL): Land devoid of vegetation or sand dunes, River(RV) bodies, Dense
Mangrove (DM): Mangrove forest with dense canopy cover, Open Mangrove (OM):
Mangrove forest with open canopy and mudflats with very less mangrove cover,
Agriculture (AG): Active agricultural practice and Human habitat (HUM): Human
habitation often under the canopy shade of non-mangrove plants.
Principal Component Analysis (PCA) and TensorFlow based Keras package of
Python is used to extract 3D patches (containing true-classes) and to categorize the
reduced high-dimensional input with (0.7 to 0.3 of scale-1) for encoding.
Next, we processed into a 3D-CNN through Convolution, Dropout and Dense
layers with 1,204,098 trainable parameters. The model adapts 6optimizers discussed
in Sect. 2.3 and selects the best (here, the Adams). Methods like TesnsorBoard,
EarlyStopping, and ModelCheckpoint were used to tackle issues of keeping track
of learning logs during every batch, monitor metric of learning status to overcome
issues of overfitting and to epoch-leveled control checkpoint losses respectively.
Classification accuracy of the undertaken input HI is shown below.
Plot the classification report (page 11) in graph.
To overcome the unbalanced classes and to minimize the loss in the training and
validation of HI patches the categorical cross-entropy (CCE) (Fig. 6). The functions
of CCE can be identified as CC E = − iC ti log f (S i ) and f (S i ) = ee Si / Cj e S j
with C set of classes, t i ground truths and S i corresponding CNN score for each
class-i having softmax activation function. The data were augmented using random
horizontal and vertical flips. After tuning, the monitor = ‘val_loss’ and restore_best_
weights = True, the batch size to (1024 × 6) and the optimizer used was Adam with
CCE.
Hyperspectral Images: A Succinct Analytical Deep Learning Study 159

Fig. 6 Training, testing: sundarban mangrove a Accuracy b Loss

6 A Novel Deep Learning Hybrid-MSSN Architecture


for Hyperspectral Image Classification

It is often the case that scientists combine two or more types of architectures instead
of relying on a single approach (hybrid models), which can result in better results
when dealing with complex problems. In other words, they are a class of methods
that integrate the advantages of different models in the same system. The following
sections describe on the methodology for classifying (deep) Hyperspectral images
(HIs) from three HI-datasets.

6.1 Architecture of Hybrid-MSSN Model

The architecture of our HI-based deep classification model is presented in Fig. 7.


In the model, we use multi-scale CNNs and spinal fully connected network
(SFCN). In the process of HI based spectral and spatial feature extractions, we
use 3D-CNNs and for spatial feature learning we use 2D-CNN. First, the model is
initialized with satellite band based high-dimension HI which was meant to address
high-spectral features. We use principal component analysis (PCA) to filter the unnec-
essary bands, de-correlate and reduce the spectral dimension without compromising
the HI’s information content.

6.2 Dataset Description

We have used the following three popular HI-datasets [13] to validate our Hybrid-
MSSN model (Table 1).
160 L. S. Kumar et al.

Fig. 7 Proposed architecture of our model


Hyperspectral Images: A Succinct Analytical Deep Learning Study 161

Table 1 Description of experimental HI-datasets


HI Datasets HI-captured source HI-description Ground truth
description
Indian pines North-western Indiana Spatial dimension, Classes-16, patches
(IPD) (AVIRIS sensor) 145 × 145 × 200 21,025
(rows, columns, Land coverage
filtered-bands) (forest)-66%
Land coverage
(farming)–33%
Crop coverage (corn,
soybeans)
Highways (dual
lane)-1
Railway line–1
Houses/structures/
roads
Salinas Salinas valley, California Spatial dimension, 512 × Classes-16, patches
(SD) (AVIRIS sensor) 217 × 20 Land coverage (bare
(rows, columns, soils)
filtered-bands) Land coverage
(vineyard fields)
Land coverage
(farming)
Crop coverage
(vegetables)
Pavia Pavia, northern Italy, Spatial dimension, 610 × Classes-09, patches
University (ROSIS sensor) 610 × 103
(PUD) (rows, columns,
filtered-bands)

6.3 Experimental Setup and Result Analysis

The detailed process of the model is outlined in Algorithm 1. The undertaken exper-
imental setup is based on Google Colaboratory pro cloud platform with Python,
Jupyter notebook and GPUs. Keras, a deep learning tool, was used to validate the
model.
162 L. S. Kumar et al.

In the deep CNN classification, as the layer becomes deeper, the spatial dimensions
of feature maps shrink sharply and results to a loss. In conventional cases, the FC
layer frequently points to the deepest Convolutional (or pooling) layer and hence the
network seriously depends on the global data which reflect to high computation time.
Thus, to overcome such issues, we use both shallow and deep convolution features
[76] to account the complexity in HIs, where distinct items likely to have varying
scales and the Spinal Fully Connected network (SpinalNet, [34]) instead of the dense
layer.
To experiment the HI, first we use PCA transformation to extract the most infor-
mative r spectral bands (IPDr = 30, PUDr = 15 and SDr = 15) as per the Modified
Brown Stick Rule (MBR) [3]. With iterative noise filtrations we get HI-cubes with
reduced dimension of (13 × 13 × r) which is relevant to the findings in [59].
The HI-cubes were further categorized into two groups with distinct training and
testing samples (Fig. 8). One group comprises with 10 and 90 percent of train to test
samples and the other group with 30 and 70 percent to compensate for the problem of
class imbalance. Table 2 represents the classification outcomes of the three datasets
with oversampling.
The undertaken approach also achieves impressing results on all 3d-patches of
3-datasets without oversampling; for instance with (13 × 13) patch size, accuracies
at 3-datasets are represented in Table 3. We use accuracy measures like, Overall
Hyperspectral Images: A Succinct Analytical Deep Learning Study 163

Fig. 8 Testing: IPD, SD, PUD a Overall accuracy b Average accuracy c Kappa score

Table 2 Classification performance: with oversampling


HI data sets Window size Train: test ratio Train: test ratio
(3D-Patch) (10:90) (30:70)
(IPD) 9×9 99.956 ± 0.01 99.958 ± 0.01
11 × 11 99.967 ± 0.02 99.986 ± 0.01
13 × 13 99.967 ± 0.02 100.000 ± 0.01
(SD) 9×9 99.934 ± 0.002 99.981 ± 0.002
11 × 11 100.000 ± 0.001 100.000 ± 0.001
13 × 13 100.000 ± 0.000 100.000 ± 0.000
(PUD) 9×9 99.989 ± 0.002 99.981 ± 0.002
11 × 11 99.994 ± 0.002 100.000 ± 0.002
13 × 13 100.000 ± 0.001 100.000 ± 0.001

Accuracy (OA), Average Accuracy (AA), Kappa value (KA) and Class-wise accuracy
to evaluate the model.
We use the first category of 10 percent training samples for model validation.
With 26 , 27 and 28 filters (3 × 3 × 3 dimension) in the first, second and third phase
of 3D-Convolution layers respectively, we adapt ‘Relu’ activation function. In the
model, each 3D convolution layer follows with 3D Max pooling with pooling size
164 L. S. Kumar et al.

Table 3 Classification performance: without oversampling


Data Accuracy Training: testing Training: testing Training: testing
Sets Measures (20:80) (30:70) (80:20)
(%) (%) Time (%) Time (%) Time (S)
(S) (S)
(IPD) OA 98.65 ± 0.05 79.74 99.27 ± 0.03 85.38 99.65 ± 0.01 144:71
AA 98.69 ± 0.20 02.21 98.07 ± 0.12 02.77 99.43 ± 0.02 0.718
K 98.47 ± 0.10 99.15 ± 0.03 99.61 ± 0.02
(SD) OA 99.20 ± 0.05 184.22 99.58 ± 0.03 138.38 99.95 ± 0.01 213.21
AA 99.32 ± 0.20 07.59 99.68 ± 0.12 06.67 99.92 ± 0.02 01.97
K 99.10 ± 0.10 99.53 ± 0.00 99.94 ± 0.02
(PUD) OA 99.84 ± 0.05 59.18 99.84 ± 0.03 86.62 99.99 ± 0.01 206.87
AA 99.70 ± 0.20 06.57 99.65 ± 0.12 05.85 99.99 ± 0.02 01.37
K 99.79 ± 0.10 99.80 ± 0.03 99.99 ± 0.02

2 and dropout ratio of 0.5. The 2D-convolution layer in the model has 256 filters
(3 × 3 dimension), dropout ratio of 0.25. In all SFCNs (1–5), the layer width is set
to 256 and half width is set to the round of integer value to half of the layer width,
which play a significant role. We use Adam optimizer, having categorical cross-
entropy loss function (Fig. 9). The learn-rate and decays were assigned as 0.001 and
1e−06 respectively. The model is trained over 20 epochs with a batch size of 256.
The model is compared with four published methods (Fig. 10), EMP-SVM [18],
MCNN-CP [78], 2D-CNN [50] and hybrid-SN [59].
The performance of the model is also investigated (Table 4) by repeating the
experiments with data that contains noise, with and without weak class oversampling
and with different spatial sizes and train-test ratios (Fig. 11).

Fig. 9 Epochs and training/validation loss a–c 10% Training IPD, SD, PUD d–f30% Training IPD,
SD, PUD
Hyperspectral Images: A Succinct Analytical Deep Learning Study 165

Fig. 10 Class-wise classification accuracy: Training sample (T.S.) with oversample (O.S.) a & b
IPD c & d SD e & f PUD

Table 4 Accuracy of datasets with noise


Data Accuracy With noise
sets (%) Speckle noise Gaussian noise Salt & Pepper
v= v= v= v= v= v= a=0 a= a=1
10 30 50 10 30 50 0.5
(IPD) OA 98.68 99.60 99.75 98.87 96.54 97.59 99.95 99.60 95.56
AA 99.09 98.50 98.62 99.26 96.54 97.59 99.91 99.50 97.12
K 98.49 99.55 99.72 98.71 96.55 98.44 99.94 99.55 94.95
(SD) OA 99.94 96.91 99.38 99.88 99.94 99.85 99.83 99.52 97.16
AA 99.89 96.17 99.33 99.70 99.88 99.80 99.79 99.16 93.85
K 99.93 96.56 99.31 99.87 99.93 99.83 99.81 99.47 96.83
(PUD) OA 99.97 99.92 99.76 99.27 99.10 95.89 100.00 99.73 99.53
AA 99.94 99.75 99.70 99.56 98.31 93.77 100.00 99.38 99.49
K 99.96 99.90 99.69 99.88 98.80 94.58 100.00 99.64 99.38

7 Conclusions and Future Scope

This literature addresses basic issues related to satellite imaging techniques and hyper
spectral based classification techniques. In the first part of experiment analysis, we
used sentinel-2 based satellite image of Sundarban Mangrove and classified the land
coverage with respect to six ground truth labels with comparative better accuracy.
Further with identified issues like training size limitation, better computational time
and better classification performance under noise, we adapted a combined 3D-2D
DL approach for the generation of hierarchical discriminative deep spectral-spatial
features and HI classification. A multi-scale feature learning technique is employed
in the framework, which increases the ability of the model to classify the objects of
diverse shapes even after the information loss by the convolution mechanism. The
use of SpinalNet model enhances the accuracy and controls the error. Experimental
results demonstrate that our model is capable enough to classify with a limited number
of training samples and thus avoid the need for oversampling and performs well even
166 L. S. Kumar et al.

Fig. 11 Classification maps: a–c Ground truth of IPD, SD and PUD d–f Our model (with 30%
training) of IPD, SD and PUD g–i Our model (with 10% training and oversampling) of IPD, SD
and PUD

in the presence of Gaussian and Poisson noise. The model demonstrates with three
benchmark datasets by giving consistent and competitive values for Overall Accuracy
(OA), Aver- age Accuracy (AA), and Kappa Accuracy (KA) compared to the other
four state-of-the-art models. Being a supervised classification based model, it offers
with best usage on labeled Hyperspectral datasets and most suitable for applications
based on land cover mapping, agriculture and global climate.
Hyperspectral Images: A Succinct Analytical Deep Learning Study 167

References

1. Adate, A., Arya, D., Shaha, A., & Tripathy, B. K. (2020). Impact of deep neural learning on
artificial intelligence research. In S. Bhattacharyya, A. E. Hassanian, S. Saha, & B. K. Tripathy
(Ed.), Deep Learning Research and Applications (pp.69–84). De Gruyter Publications. https:/
/doi.org/10.1515/9783110670905-004
2. Adate, A., & Tripathy, B. K. (2018). Deep learning techniques for image processing. In S.
Bhattacharyya, H. Bhaumik, A. Mukherjee & S. De (Eds.), Machine Learning for Big Data
Analysis (pp. 69–90). De Gruyter. https://doi.org/10.1515/9783110551433-00357
3. Bajorski, P. (2010). Investigation of virtual dimensionality and broken stick rule for hyperspec-
tral images. In 2010 2nd Workshop on Hyperspectral Image and Signal Processing: Evolution
in Remote Sensing (pp. 1–4).
4. Benediktsson, J. A., Palmason, J. A., & Sveinsson, J. R. (2005). Classification of hyperspec-
tral data from urban areas based on extended morphological profiles. IEEE Transactions on
Geoscience and Remote Sensing, 43(3), 480–491.
5. Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H., et al. (2007). Greedy layer-wise training
of deep networks. Advances in neural information processing systems, 19, 153.
6. Bhattacharyya, S., Snasel, V., Hassanian, A. E., Saha, S., & Tripathy, B. K. (2020). Deep
learning research with engineering applications. De Gruyter Publications. ISBN: 3110670909,
9783110670905. https://doi.org/10.1515/9783110670905
7. Bhardwaj, P., Guhan, T., & Tripathy, B. K. (2021). Computational biology in the lens of CNN,
Studies in Big Data. In S.S. Roy, & Y.-H. Taguchi (Eds.), Handbook of Machine Learning
Applications for Genomics, (Chapter 5) (vol. 103). ISBN: 978–981–16–9157–7 496166_1_En
8. Binol, H. (2018). Ensemble learning based multiple kernel principal component analysis for
dimensionality reduction and classification of hyperspectral imagery. Mathematical Problems
in Engineering, 2018, 14. Article ID 9632569.
9. Bose, A., & Tripathy, B. K. (2020). Deep learning for audio signal classification. In S. Bhat-
tacharyya, A. E. Hassanian, S. Saha, & B. K. Tripathy (Ed.), Deep Learning Research and
Applications (pp. 105–136). De Gruyter Publications. https://doi.org/10.1515/9783110670905-
00660
10. Bruce, L. M., Li, J., & Huang, Y. (2022). Automated detection of subpixel hyperspectral targets
with adaptive multichannel discrete wavelet trans-form. IEEE Transactions on Geoscience and
Remote Sensing, 40(4), 977−980
11. Chen, Y., Lin, Z., Zhao, X., Wang, G., & Gu, Y. (2014). Deep learning-based classi-fication of
hyperspectral data. IEEE Journal of Selected topics in applied earth observations and remote
sensing, 7(6), 2094–2107.
12. COAH: Copernicus Open Access Hub. https://scihub.copernicus.eu
13. Grupo de Inteligencia Computacional. (2014). Hyperspectral remote sensing scenes. http://
www.ehu.eus/ccwintco/index.php
14. Debgupta, R., Chaudhuri, B. B., Tripathy, B. K. (2020). A eide resNet-based approach for
age and gender estimation in face images. In A. Khanna, D. Gupta, S. Bhattacharyya, V.
Snasel, J. Platos, A. Hassanien (Eds.), International Conference on Innovative Computing and
Communications, Advances in Intelligent Systems and Computing (vol. 1087, pp. 517–530).
Springer. https://doi.org/10.1007/978-981-15-1286-5_44
15. Deepa, P., & Thilagavathi, K. (2015). Feature extraction of hyperspectral image using prin-
cipal component analysis and folded-principal component analysis. In 2015 2nd International
Conference on Electronics and Communication Systems (ICECS) (pp. 656–660).
16. Dharmasastha, K. N. S., Banu, K. S., Kalaichevlan, G., Lincy, B., & Tripathy, B.K. (2022).
Classification of pest in tomato plants using CNN. In M. N. Mohanty, S. Das, M. Ray, B. Patra
(Eds.), Meta Heuristic Techniques in Software Engineering and Its Applications. METASOFT
2022. Artificial Intelligence-Enhanced Software and Systems Engineering (vol. 1). Springer.
https://doi.org/10.1007/978-3-031-11713-8_6
17. Du, Q. (2007). Modified fisher’s linear discriminant analysis for hyperspectral imagery. IEEE
Geoscience and Remote Sensing Letters, 4(4), 503–507.
168 L. S. Kumar et al.

18. Fauvel, M., Benediktsson, J. A., Chanussot, J., & Sveinsson, J. R. (2008). Spectral and spatial
classification of hyperspectral data using svms and morphological profiles. IEEE Transactions
on Geoscience and Remote Sensing, 46(11), 3804–3814.
19. Fauvel, M., Tarabalka, Y., Benediktsson, J. A., Chanussot, J., & Tilton, J. C. (2012). Advances
in spectral-spatial classification of hyperspectral images. Proceedings of the IEEE, 101(3),
652–675.
20. Fu, A., Ma, X., & Wang, H. (2018). Classification of hyperspectral image based on hybrid
neural networks. In: IGARSS 2018 2018 IEEE International Geoscience and Remote Sensing
Symposium (pp. 2643–2646).
21. Fukushima, K., & Miyake, S. (1982). Neocognitron: A self-organizing neural net-work model
for a mechanism of visual pattern recognition. In Competition and Cooperation in Neural Nets
(pp. 267–285). Springer.
22. Ghasemzadeh, A., & Demirel, H. (2016) Hyperspectral face recognition using 3d discrete
wavelet transform. In 2016 Sixth International Conference on Image Processing Theory, Tools
and Applications (IPTA) (pp. 1–4).
23. Ghiya, A.S., Vijay, V., Ranganath, A., Chaturvedi, P., Tripathy, B.K. & Banu, K. S. (2021).
Weather classification: Image embedding using xonvolutional autoencoder and predictive
analysis using stacked generalization. In ANTIC conference. BHU.
24. Guo, Y., Liu, Y., Oerlemans, A., Lao, S., Wu, S., & Lew, M. S. (2016). Deep learning for visual
understanding: A review. Neurocomputing, 187, 27–48.
25. Gupta, P., Bhachawat, S., Dhyani, K., & Tripathy, B. K. (2021). A study of gene characteristics
and their applications using deep learning, (Chapter 4), Studies in Big Data. In S. S. Roy, &
Y.-H. Taguchi (Eds.), Handbook of Machine Learning Applications for Genomics (vol. 103).
ISBN: 978–981–16–9157–7, 496166_1_En
26. Hamida, A. B., Benoit, A., Lambert, P., & Amar, C. B. (2018). 3-d deep learning approach
for remote sensing image classification. IEEE Transactions on geoscience and remote sensing,
56(8), 4420–4434.
27. Harikiran, J., Ladi, S. K., Panda, G. K., Dash, R., Ladi, P. K. (2020). Hyperspectral image
classification bi-dimensional empirical mode decomposition and deep residual networks. In
2020 International Conference on Artificial Intelligence and Signal Processing (AISP) (pp.1–
6).
28. Harsanyi, J. C., & Chang, C.-I. (1994). Hyperspectral image classification and dimensionality
reduction: An orthogonal subspace projection approach. IEEE Transactions on geoscience and
remote sensing, 32(4), 779–785.
29. Haut, J. M., Paoletti, M. E., Plaza, J., Plaza, A., & Li, J. (2019). Hyperspectral image clas-
sification using random occlusion data augmentation. IEEE Geoscience and Remote Sensing
Letters, 16(11), 1751–1755.
30. Hinton, G. E., Osindero, S., & Teh, Y.-W. (2006). A fast learning algorithm for deep belief
nets. Neural computation, 18(7), 1527–1554.
31. Hughes, G. (1968). On the mean accuracy of statistical pattern recognizers. IEEE transactions
on information theory, 14(1), 55–63.
32. Imani, M., & Ghassemian, H. (2014). Principal component discriminant analysis for feature
extraction and classification of hyperspectral images. In 2014 Iranian Conference on Intelligent
Systems (ICIS) (pp. 1–5).
33. Jayaprakash, C., Damodaran, B. B., Sowmya, V., & Soman, K. P. (2018). Dimensionality
reduction of hyperspectral images for classification using randomized independent component
analysis. In 2018 5th International Conference on Signal Processing and Integrated Networks
(SPIN) (pp. 492–496)
34. Kabir, H. M. D., Abdar, M., Jalali, S. M. J., Khosravi, A., Atiya, A.F., Nahavandi, S., &
Srinivasan, D. (2020). SpinalNet: Deep neural network with gradual input
35. Kathuria, A. (2018) Intro to optimization in deep learning: Momentum, Rmsprop and Adam.
https://blog.paperspace.com/intro-to-optimization-momentum-rmsprop-adam/
Hyperspectral Images: A Succinct Analytical Deep Learning Study 169

36. Kaul, D., Raju, H., & Tripathy, B. K. (2022). Deep learning in healthcare, in: Deep Learning in
Data Analytics. In: D.P. Acharjya, A. Mitra, N. Zaman (Eds,), Deep Learning in Data Analytics-
Recent Techniques, Practices and Applications, Studies in Big Data (vol. 91, pp. 97–115).
Springer. https://doi.org/10.1007/978-3-030-75855-4_6
37. Ke, C. (2017). Military object detection using multiple information extracted from hyperspec-
tral imagery. In 2017 International Conference on Progress in Informatics and Computing
(PIC) (pp. 124–128).
38. Khan, M.J., Khan, H.S., Yousaf, A., Khurshid, K., & Abbas, A. (2018). Modern trends in
hyperspectral image analysis: A review. IEEE Access. 6, 14118−14129
39. Kumar, V., & Tripathy, B. K. (2020). Detecting toxicity with bidirectional gated recurrent unit
networks. In V. Bhateja, S. Satapathy, Y.D. Zhang, V. Aradhya (Eds.), Intelligent Computing
and Communication. ICICC 2019. Advances in Intelligent Systems and Computing (vol. 1034).
Springer. https://doi.org/10.1007/978-981-15-1084-7_57
40. Kwon, H., Hu, X., Theiler, J., Zare, A, & Gurram, P. (2013). Algorithms for multispectral
and hyperspectral image analysis. Journal of Electrical and Computer Engineering, 2013, 2.
Article ID 908906
41. Ladi, S. K., Panda, G. K., Dash, R., et al. (2022). A novel grey wolf optimisation based CNN
classifier for hyperspectral image classification. Multimed Tools Appl, 81, 28207–28230.
42. Ladi, S. K., Panda, G. K., Dash, R. et al. (2022). A novel strategy for classifying spectral-spatial
shallow and deep hyperspectral image features using 1D-EWT and 3D-CNN. Earth science
informatics
43. Ladi, S. K., Dash, R., Panda, G. K., Ladi, P. K., & Dhupar, R. (2019). Hyperspectral image
classification using swt and cnn. In 2019 International Conference on Information Technology
(ICIT) (pp. 172–177).
44. Li, C., Zuo, H., Fan, T. (2017). Hyperspectral image classification based on gray level co-
occurrence matrix and local mean decomposition. In 2017 4th International Conference on
Systems and Informatics (ICSAI) (pp. 1219–1223).
45. Li, J., Bioucas-Dias, J. M., & Plaza, A. (2010). Semisupervised hyperspectral image segmen-
tation using multinomial logistic regression with active learning. IEEE Transactions on
Geoscience and Remote Sensing, 48(11), 4085–4098.
46. Li, Y., Zhang, H., & Shen, Q. (2017). Spectral–spatial classification of hyperspectral imagery
with 3d convolutional neural network. Remote Sensing, 9(1), 67.
47. Li, W., Wu, G., Zhang, F., & Du, Q. (2017). Hyperspectral image classification using deep
pixel-pair features. IEEE Transactions on Geoscience and Remote Sensing, 55(2), 844–853.
48. Ma, Y., Li, R., Yang, G., Sun, L., & Wang, J. (2018). A research on the combination strategies
of multiple features for hyperspectral remote sensing image classification. Journal of Sensors,
2018, 14. Article ID 7341973.
49. Maheswari, K., Shaha, A., Arya, D., Tripathy, B. K., & Rajkumar, R. (2020). Convolutional
neural networks: A bottom-ip approach. In S. Bhattacharyya, A. E. Hassanian, S. Saha, &
B.K. Tripathy (Ed.), Deep Learning Research with Engineering Applications (pp.21–50). De
Gruyter Publications. https://doi.org/10.1515/9783110670905-002
50. Makantasis, K., Karantzalos, K., Doulamis, A., & Doulamis, N. (2015). Deep super-vised
learning for hyperspectral data classification through convolutional neural networks. In 2015
IEEE International Geoscience and Remote Sensing Symposium (IGARSS) (pp. 4959–4962).
51. Makantasis, K., Doulamis, A. D., Doulamis, N. D., & Nikitakis, A. (2018). Tensor-based
classification models for hyperspectral data analysis. IEEE Transactions on Geoscience and
Remote Sensing, 56(12), 6884–6898.
52. Makantasis, K., Doulamis, A., Doulamis, N., Nikitakis, A., & Voulodimos, A. (2018). Tensor-
based nonlinear classifier for highorder data analysis. In 2018 IEEE International Conference
53. Notesco, G., Dor, E. B., & Brook, A. (2014). Mineral mapping of makhtesh ramon in israel
using hyperspectral remote sensing day and night LWIR images. In 2014 6th Workshop on
Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS) (pp. 1–
4).
170 L. S. Kumar et al.

54. Pesaresi, M., Gerhardinger, A., & Kayitakire, F. (2008). A robust built-up area presence index
by anisotropic rotation-invariant textural measure. IEEE Journal of selected topics in applied
earth observations and remote sensing, 1(3), 180–192.
55. Pesaresi, M., & Benediktsson, J. A. (2001). A new approach for the morphological segmentation
of high-resolution satellite imagery. IEEE transactions on Geoscience and Remote Sensing,
39(2), 309–320.
56. Pike, R., Lu, G., Wang, D., Chen, Z. G., & Fei, B. (2016). A minimum spanning forest-based
method for noninvasive cancer detection with hyperspectral imaging. IEEE Transactions on
Biomedical Engineering, 63(3), 653–663.
57. Plaza, A., Mart´ınez, P., Plaza, J., P´erez, R. (2005). Dimensionality reduction and classification
of hyperspectral image data using sequences of extended morphological transformations. IEEE
Transactions on Geoscience and remote sensing, 43(3), 466–479.
58. Prabhavathy, P., Tripathy, B.K., Venkatesan, M. (2022). Analysis of diabetic retinopathy detec-
tion techniques using CNN Models. In: S. Mishra, H. K. Tripathy, P. Mallick, K. Shaalan
(Eds.), Augmented Intelligence in Healthcare: A Pragmatic and Integrated Analysis. Studies
in Computational Intelligence (vol. 1024). Springer, https://doi.org/10.1007/978-981-19-107
6-0_6
59. Roy, S. K., Krishna, G., Dubey, S. R., & Chaudhuri, B. B. (2020). Hybridsn: Exploring 3-d-2-d
cnn feature hierarchy for hyperspectral image classification. IEEE Geoscience and Remote
Sensing Letters, 17(2), 277–281.
60. Singhania, U., & Tripathy, B. K. (2021). Text-based image retrieval using deep learning. In
Encyclopedia of Information Science and Technology (5th ed., p. 11). https://doi.org/10.4018/
978-1-7998-3479-3.ch007
61. Rungta, R. K., Jaiswal, P., Tripathy, B. K. (2022). A deep learning based approach to measure
confidence for virtual interviews. In A. K. Das et al. (Eds.), Proceedings of the 4th International
Conference on Computational Intelligence in Pattern Recognition (CIPR) (pp. 278–291). CIPR
2022, LNNS 480.
62. Sihare, P., Khan, A. U., Bardhan, P., & Tripathy, B. K. (2022). COVID-19 detection using
deep learning: A comparative study of segmentation algorithms. In A. K. Das et al. (Eds.),
Proceedings of the 4th International Conference on Computational Intelligence in Pattern
Recognition (CIPR) (pp. 1–10). CIPR 2022, LNNS 480.
63. Jain, S., Singhania, U., Tripathy, B.K., Abouel, E. N., Aboudaif, M. K., & Ali, K. K. (2021).
Deep learning based transfer learning for classification of skin cancer. Sensors (Basel), 21(23),
8142 https://doi.org/10.3390/s21238142. (IF:4.35)
64. Surya, Y. S., Geetha Rani, K. T., & Tripathy, B. K. (2022). Social distance monitoring and face
mask detection using deep learning. In: J. Nayak, H. Behera, B. Naik, S. Vimal, D. Pelusi (Eds.),
Computational Intelligence in Data Mining. Smart Innovation, Systems and Technologies (vol.
281). Springer. https://doi.org/10.1007/978-981-16-9447-9_36
65. Sun, T., Jiao, L., Feng, J., Liu, F., & Zhang, X. (2015). Imbalanced hyperspectral image clas-
sification based on maximum margin. IEEE Geoscience and Remote Sensing Letters, 12(3),
522–526.
66. Teng, M. Y., Mehrubeoglu, R., King, S. A., Cammarata, K., & Simons, J. (2013). Investig tion
of epifauna coverage on seagrass blades using spatial and spectral analysis of hyperspectral
images. In 2013 5th Workshop on Hyperspectral Image and Signal Processing: Evolution in
Remote Sensing (WHISPERS) (pp. 1–4).
67. Tripathy, B. K., & Anuradha, J. (2015). Soft computing-advances and applications. Cengage
Learning publishers. ASIN: 8131526194, ISBN-109788131526194
68. Tripathy, B. K., Parikh, S., Ajay, P., & Magapu, C. (2022). Brain MRI segmentation techniques
based on CNN and its variants, (Chapter-10). In J. Chaki (Ed.), Brain Tumor MRI Image
Segmentation Using Deep Learning Techniques (pp. 161−182). Elsevier publications. https://
doi.org/10.1016/B978-0-323-91171-9.00001-6
69. Tripathy, B. K., & Adate, A. (2021). Impact of deep neural learning on artificial intelligence
research, Chapter-8. In D. P. Acharjya et al (Ed.), Springer publications.
Hyperspectral Images: A Succinct Analytical Deep Learning Study 171

70. Voulodimos, A. (2018). Deep learning for computer vision: a brief review. Computational
Intelligence and Neuroscience, 2018, 13. Article ID 7068349.
71. Wang, & Chang, C. I. (2006). Independent component analysis based dimensionality reduction
with applications in hyperspectral image analysis. In IEEE Transactions on Geoscience and
Remote Sensing (vol. 44, no. 6, pp. 1586–1600).
72. Wang, X., & Feng, Y. (2008). New method based on support vector machine in classification
for hyperspectral data. In 2008 International Symposium on Computational Intelligence and
Design (pp. 76–80)
73. Wang, Y., & Cui, S. (2014). Hyperspectral image feature classification using stationary wavelet
transform. In 2014 International Conference on Wavelet Analysis and Pattern Recognition
(pp. 104–108)
74. Wu, Y., Mu, G., Qin, C., Miao, Q., Ma, W., & Zhang, X. (2020). Semi-supervised hyperspectral
image classification via spatial-regulated self-training. Remote Sensing, 12(1)
75. Xingjian, S., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., & Woo, W.C. (2015). Convolu-
tional LSTM network: A machine learning approach for precipitation nowcasting. In Proceed-
ings of the 28th International Conference on Neural Information Processing Systems (Vol. 1,
pp. 802–810).
76. Xu, Y., Zhang, L., Du, B., & Zhang, F. (2018). Spectral–spatial unified networks for hyper-
spectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 56(10),
5893–5909.
77. Zhang, X., Zhang, A., & Meng, X. (2015). Automatic fusion of hyperspectral images and laser
scans using feature points. Journal of Sensors, 2015, 9. Article ID 415361
78. Zheng, J., Feng, Y., Bai, C., & Zhang, J. (2021). Hyperspectral image classification using mixed
convolutions and covariance pooling. IEEE Transactions on Geoscience and Remote Sensing,
59(1), 522–534.
79. Zhong, Z., Li, J., Luo, Z., & Chapman, M. (2018). Spectral–spatial residual network for
hyperspectral image classification: A 3-d deep learning framework. IEEE Transactions on
Geoscience and Remote Sensing, 56(2), 847–858
80. Zhou, F., Hang, R., Liu, Q., & Yuan, X. (2019). Hyperspectral image classification using
spectral-spatial lstms. Neurocomputing, 328, 39–47.
Chest X-Ray Image Classification
of Pneumonia Disease Using EfficientNet
and InceptionV3

Neel Ghoshal, Mohd Anas, and Sanjiban Sekhar Roy

1 Introduction

Pneumonia is a type of respiratory infection that affects the lungs. It leads to inflam-
mation in the lungs and fluid buildup in the air sacs within, causing difficulties in
breathing and simultaneous cardiovascular health effects. Pneumonia is considered
to be the single largest cause of death in children worldwide, leading to an estimated
count of 5.9 million deaths for children under 5 years old annually [1]. Chest X-Rays
and Radiography methods have been prevalent in the medical industry for quite some
time and the use of such methods and tools have been administered in diagnosing
and curing issues and illnesses such as cancer, infections, emphysema and pneu-
monia. The specialized analysis and diagnosis of an illness through the use of X-Ray
outputs are generally conducted by expert radiologists in person. In recent times,
the number of cases requiring chest X-Rays have substantially increased [2], hence
simultaneously, radiologists working on these outputs now have to devote higher
levels of time for this task. The requirement of expertise for this task comes from the
extremely detailed and niched characteristics of the components present in the lung
which has to be analyzed and deduced via intricate characterizations and traits which
coherently point towards a general illness category. Due to the aforementioned cause
of increased frequency of Chest X-Ray instances, it is a possibility that due to this
vast volume of data to be manually processed, can be a reason which simultaneously
leads to time delays, cost problems, and/or errors which may occur, which in the end
is something that needs to be avoided via any medical institution. Through the work
described in this chapter, we propose an automated medical image diagnosis system
which essentially will allow the radiologists and staff alike to gain an alternate and
handy method to efficiently process and analyze data without much hassle or manual

N. Ghoshal · M. Anas · S. S. Roy (B)


School of Computer Science and Engineering, Vellore Institute of Technology, Vellore, India
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 173
S. S. Roy et al. (eds.), Deep Learning Applications in Image Analysis,
Studies in Big Data 129, https://doi.org/10.1007/978-981-99-3784-4_9
174 N. Ghoshal et al.

work. For our problem statement, we have used two Convolutional Neural Network
(CNN) based algorithms to classify Chest X-Ray scans for the illness of pneumonia.
These CNN based algorithms have worked well with this specific image classi-
fication problem due to it’s inherent trait of reducing dimensionality of data and
efficient processing for accurate results [3]. The aforementioned advantages are
due to the neural network subdivisions and their tasks, namely Convolution Layer
which breaks down the entire image into smaller sub-parts of it for and efficient
and less-dimensional input layer, Pooling Layer assumes the convolution layer as
input and reduces the dimensionality further, and the Fully Connected Layer which
can be considered as the final layer upon which the network finally learns which
subdivisions/parts are necessary for the classification problem at hand.

2 Literature Survey

Till date, there have been few proposals and advances towards similar and
specific medical diagnosis problems. CNNs and Deep Neural Networks have
allowed researchers to build sophisticated models towards medical issues including
pneumonia, tuberculosis, Covid-19, lung cancer and many more [4].
There are many different techniques and methodologies used to progress the
specific tasks of medical diagnosis employed by various researchers in their respec-
tive fields, some of them include, Convolutional Neural Networks, Transfer Learning,
Image-Level Prediction, Segmentation networks, Localization Networks, Image
Generation Networks, Domain Adaption Networks, and likewise [4].
For example, Crosby et al. employed the use of deep CNNs for distinguishing
between binary labelled chest radiograph data [5]. Deep Learning has also been
employed in detection of foreign objects in chest radiographs using similar data
[6]. The use of General Adversarial Networks can also be seen in deployment of
technology for organ segmentation and bone suppression tasks in Chest X-Rays [7].
Transfer learning based image classifier models have been researched by Showkat
et al. in detection of Covid-19 pneumonia [8]. Deep Learning techniques are used by
Hirata et al. in the pursuit of detecting pulmonary artery wedge pressure metrics using
Standard Chest X-Ray data. The research community pertaining to these specific
tasks have produced a foothold in the use of CNNs in computer vision problems
like these and in 2015 and 2016 more than 300 papers were published on applica-
tions of deep learning in workshops, conferences, journals, and special issues in this
domain[9, 10].
Chest X-Ray Image Classification of Pneumonia Disease Using … 175

Fig.1 Three samples from normal and pneumonia classes

3 Dataset

The dataset used to train our proposed models was obtained from the internet website
named Kaggle, and is named “Chest X-Ray Images (Pneumonia)”. It consists of
5863 images as training samples each of which has a binary feature associated with
it depicting the individual datapoints as either ‘normal’ or ‘pneumonia’. A point to
note here is that, the feature category for this specific dataset is binary in nature,
hence the proposed models will be tasked with the duty of analysing the image for
the presence of the disease of pneumonia in contrast to the task of finding specific
types of pneumonia ranging from bacterial to viral. The images present in the dataset
are formatted X-Ray images of the lungs (Fig. 1).
The dataset consists of 27% images of normal lung x-rays and the remaining
pertaining to those corresponding to pneumonia (Fig. 2).

4 Data Pre-processing

For the task of data pre-processing, all individual images are converted into grayscale
and gaussian blur is applied to them. The conversion of images into grayscale helps
in fine tuning the dataset for the specific image classification task by converting the
pixels present in the images into values depicting the information of the intensity of
176 N. Ghoshal et al.

Fig. 2 Image category


distribution in dataset

light. Gaussian blur, in essence, is applied to reduce the noise and redundant data
present in the information pixels. The concept of Gaussian blur works on it’s char-
acteristic to smoothen the edges and boundaries of objects resulting in enhancement
of object data and smoothening of transitions between boundaries. Image Erosion
is also applied to the categorical data, wherein, the erosion function used to process
the data, reduces or removes pixels on object boundaries, the frequency of pixels
affected depends on the specific inherent characteristics of the image. The Canny
Edge Detection algorithm developed by OpenCV is also used, which reduces noise,
finds the intensity gradient of the image and supresses unwanted pixels (Figs. 3, 4
and 5).

Fig. 3 Grayscale conversion and gaussian blurring


Chest X-Ray Image Classification of Pneumonia Disease Using … 177

Fig. 4 Image erosion

Fig. 5 Canny edge detection


178 N. Ghoshal et al.

5 Proposed Model

We have used 2 distinct models for this classification problem, the Efficient-
Net model and the Inception model. Both of these models are based on CNNs
(Convolutional Neural Networks).

5.1 EfficientNet

EfficientNet is an architecture framework based on the methodology of model scaling


in Convolutional Neural Networks. This architecture uniformly scales all dimensions
of depth/width/resolution using a compound coefficient. The distinguishing factor
for this specific architecture is that it doesn’t use arbitrary scaling for these factors,
it uses a fixed set of scaling coefficients for uniformly scaling the network width,
depth and resolution. Using this technique, the creators have surpassed the accuracy
of almost all high performing convolutional network models, while simultaneously
achieving better efficiency.
For model scaling, the following methodologies of (a) Baseline model, (b) Width
Scaling, (c) Depth Scaling and (d) Resolution Scaling are followed, whereas in
the EfficientNet model, a methodology known as compound scaling is used which
inculcates all the previously techniques into one hybrid and dynamic structure (Figs. 6
and 7).
For obtaining the compound scaling factor, it was observed that the network
depth should be increased for higher resolution images which helps capture high
pixel features in bigger images and correspondingly that network width should be
increased when the resolution is lower, due to the need of capturing the fine grain
patterns present in the images [11]. The compound scaling method employed by
the EfficientNet model using a coefficient ϕ to uniformly scale the width, depth and
resolution for the neural network.
The equations for the same are:

Depth : d = aϕ
Width : w = bϕ
Resolution : r = c
s.t. a ∗ bϕ ∗ cϕ 2
a >= 1, b >= 1, c >= 1

where a, b, c are constants that are determined by a small grid search.


Henceforth, ϕ, is a user specified coefficient that controls how many more
resources are available for model scaling, while a,b,c specify how to assign the
extra resources to network width, depth and resolution respectively [11].
Chest X-Ray Image Classification of Pneumonia Disease Using … 179

Fig. 6 Baseline network


with connecting layers

Fig. 7 Compound scaled


network with connecting
layers
180 N. Ghoshal et al.

The EfficientNet Architecture is the baseline network for implementing a


framework employing the above criterion and characteristics.

5.2 InceptionV3

InceptionV3 is an image recognition model which has demonstrably achieved state-


of-the-art accuracy levels for image associated tasks. It uses and build upon it’s base
architectures of the InceptionV1 model, which inherently consisted of multiple filters
of parallel layers instead of the classical deep layers of a typical CNN model. Each
subpart of a basic Inception model is made of 4 parallel layers, which are: 1*1, 3*3,
5*5 convolutions and a 3*3 max pooling layer.
The InceptionV3 implementable model consists of building blocks, including (a)
convolutions, (b) average pooling, (c) max pooling, (d) concatenations, (e) dropouts
and (f)Softmax (Fig. 8).
The model builds upon the base work of the InceptionV1 model, it enables factor-
ization of data into smaller convolutions, i.e. reducing high dimensional data into
smaller fragments for effective processing, the model also uses spatial factorization
into asymmetric convolutions, which entails subdividing the previously occurred
convolutions into factors of the form n*1, which allows for higher efficiency in
processing and outcome [12]. The model takes into effect the use of auxiliary clas-
sifiers which in essence, acts as a regularizer here, also parallel stride blocks are
created to allow for an efficient grid size reduction algorithm in order to avoid a
representational bottleneck.

Fig. 8 Input layer and output layer dimensions for InceptionV3 model
Chest X-Ray Image Classification of Pneumonia Disease Using … 181

6 Experimental Outcome and Analysis

6.1 InceptionV3

Figure 9 shows the accuracy graphs and validation of accuracy graphs for the Incep-
tionV3 model, the training of the model has occurred for a duration of 15 epochs. The
peak accuracy achieved by the model is high value of 92.93%, it portrays a gradual
and simultaneous increase and decrease in the graph metric values, occurring due
to the fine tuning of model prediction confidence values, until finally arriving at it’s
peak accuracy point and decreasing therein. The validation accuracy curve can be
seen performing a similar curvature until dropping to an extremely low value but
stabilizing itself while moving forward which depicts the overall accuracy value
fluctuation metrics to the change of model parameters.
The loss value function, as shown graphically in Fig. 10, for the model can be seen
taking a huge initial decline and reaching it’s required lowest value moving forward
in a stable and coherent manner. The validation loss curve doesn’t take a steep dive
but goes through a sudden high peak value in between it’s complete graph path, after
which it stabilizes and reaches it’s boundary values, which are close to the loss value
curve boundary values.
These results hence depict the benchmark being set in pneumonia diagnosis using
CNN based algorithms. This outcome, when compared with other models for similar

Fig. 9 Accuracy curve of inception model


182 N. Ghoshal et al.

Fig. 10 Loss value curve of inception model

tasks perform demonstrably better in the outcomes and at the same time is more
efficient due to the inbuilt performance metrics present in the baseline Inception
models, as depicted in Sect. 5.2.

6.2 EfficientNet

Figure 11 shows the accuracy graphs and validation of accuracy graphs for the Effi-
cientNet model, the training of the model has occurred for a duration of 10 epochs.
The peak accuracy achieved by the model is high value of 95.39%, it displays the
accuracy of the model steeply increasing after the first epoch and gradually and stably
achieving it’s peak value after the last epoch. The validation accuracy curve can be
seen performing a similar curvature until dropping to an extremely low value and
steeply increasing after the subsequent epoch but again dropping extremely low after
two more epochs.
The loss value function for the model, as depicted in Fig. 12, can be seen taking an
initial decline and reaching it’s required lowest value while performing simultaneous
but negligible ups and downs throughout the curvature. The validation loss curve can
be observed performing a steep initial decline similar to the loss value function. It
achieves it’s peak boundary value in the following steps therein, but it then suddenly
Chest X-Ray Image Classification of Pneumonia Disease Using … 183

Fig. 11 Accuracy curve of EfficentNet model

increases to an enormous amount and also decreases in the following epoch only to
increase substantially again after 2 more epochs.
These results also simultaneously set the benchmark being set in pneumonia
diagnosis using CNN based algorithms. This outcome, when compared with other
models for similar tasks perform demonstrably better in the outcomes and at the same
time is more efficient and customizable due to the inbuilt model metrics present in
the baseline Inception models, as depicted in Sect. 5.1.

7 Discussion

One of the necessities and dire requirements of radiologists, clinicians and staff alike
working towards the problem of detecting and curing pneumonia and related condi-
tions is the metric factors of time, frequency and volume of data to be processed,
and expertise requirements. The presence of already existing classifiers for other
medical diagnosis and related works, including breast cancer detection [13], and
also the recent use of CNNs being used in Brain Tumour Classification [14]. Almost
all of these can be solved to a significant extent via the use of machine learning and
neural network based models to ease this task. But simultaneously, it must be noted
that the final diagnosis and inferences received from it should be done ultimately
184 N. Ghoshal et al.

Fig. 12 Loss value curve of EfficientNet model

by a trained professional, these classification models, for now, are present only to
aid the clinicians and trained experts in streamlining their tasks. Some limitations a
model like this would pertain along with itself is the explanation of achieved metrics
and reasons embedded therein, and inability to characterize a few key metrics which
demonstrate a substrata of the general illness being caused and which could necessi-
tates simultaneous alternate remedies extending to a cohesion of multiple disorders
either causing or caused from the pneumonia disease. The accuracies achieved in
this chapter, can be improved further by incorporating a larger dataset, or devel-
oping further specific and custom models based exclusively on X-Ray diagnostics.
Another method which can be availed to achieve improvement is to incorporate
medical histories of the patient in a significant shape or form to be included in as
a feature variable in the dataset. Furthermore, data augmentation techniques can
be identified and incorporated in future models for achieving higher output metrics
[15–30].
Chest X-Ray Image Classification of Pneumonia Disease Using … 185

8 Conclusion

In this chapter, we have discussed the outcomes and experimental usage and use-
cases of the EfficientNet and InceptionV3 models for the medical diagnosis of pneu-
monia via Chest X-Rays. We have achieved high performance results of 95.39%
and 92.93% which is achieved at a significantly low computational cost. Thereby,
using the discussed frameworks can highly beneficial in the medical diagnosis of the
disease and come in handy to the professional medical practitioners and radiologists
working with the related problem statement. Further refinement of approaches and
methodologies will definitely provide a highly positive impact towards this cause
and pave the way for further improvements therein.

References

1. Yadav, K. K., & Awasthi, S. (2016). The current status of community-acquired pneumonia
management and prevention in children under 5 years of age in India: A review. Therapeutic
Advances in Infectious Disease, 3(3–4), 83–97.
2. Çallı, E., Sogancioglu, E., van Ginneken, B., van Leeuwen, K. G., & Murphy, K. (2021). Deep
learning for chest X-ray analysis: A survey. Medical Image Analysis, 72, 102125.
3. Li, Q., Cai, W., Wang, X., Zhou, Y., Feng, D. D., & Chen, M. (2014). Medical image clas-
sification with convolutional neural network. In 13th International Conference on Control
Automation Robotics & Vision (ICARCV), Singapore, pp. 844–848. https://doi.org/10.1109/
ICARCV.2014.7064414
4. Çallı, E., Sogancioglu, E., van Ginneken, B., van Leeuwen, K. G., & Murphy, K. (2021).
Deep learning for chest X-ray analysis: A survey. Medical Image Analysis, 72, 102125. ISSN
1361-8415 https://doi.org/10.1016/j.media.2021.102125
5. https://www.spiedigitallibrary.org/journals/journal-of-medical-imaging/volume-7/issue-1/
016501/Deep-convolutional-neural-networks-in-the-classification-of-dual-energy/https://doi.
org/10.1117/1.JMI.7.1.016501.short?SSO=1
6. Deshpande, H., Harder, T., Saalbach, A., Sawarkar, A., Buelow, T. (2020). Detection of foreign
objects in chest radiographs using deep learning. In IEEE 17th International Symposium on
Biomedical Imaging Workshops (ISBI Workshops). Iowa City, IA, USA, pp. 1–4. https://doi.
org/10.1109/ISBIWorkshops50223.2020.9153350
7. Eslami, M., Tabarestani, S., Albarqouni, S., Adeli, E., Navab, N., & Adjouadi, M. (2020).
Image-to-images translation for multi-task organ segmentation and bone suppression in chest
X-ray radiography. IEEE Transactions on Medical Imaging, 39(7), 2553–2565. https://doi.org/
10.1109/TMI.2020.2974159
8. Showkat, S., & Qureshi, S. (2022). Efficacy of transfer learning-based resnet models in chest
x-ray image classification for detecting COVID-19 pneumonia. Chemometrics and Intelligent
Laboratory Systems, 224, 104534.
9. Hirata, Y., Kusunose, K., Tsuji, T., Fujimori, K., Kotoku, J. I., & Sata, M. (2021). Deep learning
for detection of elevated pulmonary artery wedge pressure using standard chest x-ray. Canadian
Journal of Cardiology, 37(8), 1198–1206.
10. Greenspan, H., Summers, R. M., & van Ginneken, B. (2016). Deep learning in medical imaging:
Overview and future promise of an exciting new technique. IEEE Transactions on Medical
Imaging, 35(5), 1153–1159.
11. Tan, M., & Le, Q. (2019). Efficientnet: Rethinking model scaling for convolutional neural
networks. In International Conference on Machine Learning (pp. 6105–6114). PMLR.
186 N. Ghoshal et al.

12. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception
architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (pp. 2818–2826).
13. Mittal, D., Gaurav, D., & Sekhar Roy, S. (2015). An effective hybridized classifier for breast
cancer diagnosis. In 2015 IEEE International Conference on Advanced Intelligent Mecha-
tronics (AIM), Busan, Korea (South), pp. 1026–1031. https://doi.org/10.1109/AIM.2015.722
2674
14. Roy, S. S., Rodrigues, N., & Taguchi, Y. (2020). Incremental dilations using CNN for brain
tumor classification. Applied Sciences 10(14):4915. https://doi.org/10.3390/app10144915
15. Shorten, C., & Khoshgoftaar, T. M. (2019). A survey on image data augmentation for deep
learning. Journal of Big Data, 6, 60.
16. Roy, S. S., Hsu, C., Samaran, A., Goyal, R., Pande, A., et al. (2023). Vessels segmentation
in angiograms using convolutional neural network: A deep learning based approach. CMES-
Computer Modeling in Engineering & Sciences, 136(1), 241–255.
17. Turki, T., & Roy, S. S. (2022). Novel hate speech detection using word cloud visualization and
ensemble learning coupled with count vectorizer. Applied Sciences, 12(13), 6611.
18. Roy, S. S., Goti, V., Sood, A., Roy, H., Gavrila, T., Floroian, D., Mohammadi-Ivatloo, B.,
et al. (2014). L2 regularized deep convolutional neural networks for fire detection. Journal of
Intelligent & Fuzzy Systems, 1–12.
19. Roy, S. S., Mihalache, S. F., Pricop, E., & Rodrigues, N. (2022). Deep convolutional neural
network for environmental sound classification via dilation. Journal of Intelligent & Fuzzy
Systems, 1–7.
20. Forecasting stock price by hybrid model of cascading multivariate adaptive regression splines
and deep neural network.
21. Bose, A., Hsu, C. H., Roy, S. S., Lee, K. C., Mohammadi-Ivatloo, B., & Abimannan, S. (2021).
Forecasting stock price by hybrid model of cascading multivariate adaptive regression splines
and deep neural network. Computers and Electrical Engineering, 95, 107405.
22. Roy, S. S., & Taguchi, Y. H. (2021). Identification of genes associated with altered gene
expression and m6A profiles during hypoxia using tensor decomposition based unsupervised
feature extraction. Scientific Reports, 11(1), 1–18.
23. Roy, S. S., & Samui, P. (2021). Predicting longitudinal dispersion coefficient in natural streams
using minimax probability machine regression and multivariate adaptive regression spline.
International Journal of Advanced Intelligence Paradigms, 19(2), 119–127.
24. Marques, G., Agarwal, D., & de la Torre, I. (2020). Automated medical diagnosis of COVID-19
through EfficientNet convolutional neural network. Applied Soft Computing, 96, 106691.
25. Biswas, R., Vasan, A., & Roy, S. S. (2020). Dilated deep neural network for segmentation of
retinal blood vessels in fundus images. Iranian Journal of Science and Technology, Transactions
of Electrical Engineering, 44(1), 505–518.
26. Roy, S. S., Samui, P., Nagtode, I., Jain, H., Shivaramakrishnan, V., & Mohammadi-Ivatloo,
B. (2020). Forecasting heating and cooling loads of buildings: A comparative performance
analysis. Journal of Ambient Intelligence and Humanized Computing, 11(3), 1253–1264.
27. Roy, S. S., Chopra, R., Lee, K. C., Spampinato, C., & Mohammadi-Ivatlood, B. (2020).
Random forest, gradient boosted machines and deep neural network for stock price fore-
casting: A comparative analysis on South Korean companies. International Journal of Ad
Hoc and Ubiquitous Computing, 33(1), 62–71.
28. Roy, S. S., Mihalache, S. F., Pricop, E., & Rodrigues, N. (2022). Deep convolutional neural
network for environmental sound classification via dilation. Journal of Intelligent & Fuzzy
Systems, 1–7.
29. Chakraborty, C., Bhattacharya, M., Sharma, A. R., Roy, S. S., Islam, M. A., Chakraborty,
S., Dhama, K., et al. (2022). Deep learning research should be encouraged for diagnosis and
treatment of antibiotic resistance of microbial infections in treatment associated emergencies
in hospitals. International Journal of Surgery (London, England), 105, 106857.
30. Lee, K. C., Roy, S. S., Samui, P., & Kumar, V. (Eds.). (2020). Data analytics in biomedical
engineering and healthcare. Academic Press.
Detection of Cancer Using Deep
Learning Techniques

Apoorv Singh, Arjunaditya, and B. K. Tripathy

1 Introduction

Cancer is a dreaded disease which is posing threat to the human society and according
to the data provided by World Health Organisation, cancer accounted for 13% of all
the fatalities in 2018 [1]. In the upcoming years it is predicted to be ranked among
the most deadly diseases in the world. As projected, 12 million individuals are likely
to be affected by cancer in 2030. The number of cancer cases would rise dramatically
in the next few years. Experts, specialists, and medical professionals are developing
new methods to combat cancer, but it is well recognized that this battle is quite
challenging [2–4].
Evaluating the visuals related to medical data by technicians, supported by
computers is referred to as interpretation. Diagnostic ultrasound images, on the
contrary, demand a large volume of data to be addressed by the physician and require
thorough analysis in a short amount of time. These imaging processes include high-
energy electromagnetic radiation. Digital photographs are analyzed by computer-
assisted methods to detect the presence or absence of cancer in the early stages
[5].
Analysis of medical images using computer tools supports medical professionals
in interpretation of medical information inherent in the images. On the other hand,
diagnosing ultrasound images using specific imaging processes such as high intensity
electromagnetic radiation necessitates a significant quantity of data to be controlled
from doctor’s end and involves thorough analysis in a short amount of time. Digital

A. Singh
School of Electronics Engineering, VIT, Vellore, TN 632014, India
Arjunaditya
School of Computer Science and Engineering, VIT, Vellore, TN 632014, India
B. K. Tripathy (B)
School of Information Technology and Engineering, VIT, Vellore, TN 632014, India
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 187
S. S. Roy et al. (eds.), Deep Learning Applications in Image Analysis,
Studies in Big Data 129, https://doi.org/10.1007/978-981-99-3784-4_10
188 A. Singh et al.

photographs analyzed by computer-assisted methods are potentially used to detect


presence or absence of the disease in the early stage. Therefore, early cancer detection
is the top goal for securing lives. To find and diagnose cancer in its early stages,
many visual examinations and manual methods are used. As human interference in
analyzing medical images requires enough time and expertise in order to improve
the efficiency of medical image interpretation, computerized systems for disease
diagnosis have been proposed [5].
Developments in the areas of AI and Machine learning (ML) have been
progressing fast during the recent years and their rise in the fields of computer
vision, image processing, and computer-assisted diagnosis are eye catching [6]. Some
of these applications use the traditional machine learning techniques like Support
Vector Machines (SVM), decision trees, K- Nearest Neighbour (KNN) and back
propagation [7]. Figure 1 illustrates the overall relationship among AI, ML and their
components. An Artificial Neural Network (ANN) has an input layer, an output
layer and a number of hidden layers of neurons according to the requirement of
the applications The input layer accepts attributes in the form of input data and
uses the associated weights in the connections to get the total input before applying
the activation functions to get the outputs at the hidden layer nodes. This process
is repeated layer of after layer till it reaches the output layer which generates the
final outputs [8]. This increases in accuracy of prediction, aids clinicians in mapping
subject’s treatments and eliminating emotional and physical challenges caused by
sickness. An important aspect supporting clinical researchers is an increase in the
number of diagnoses made utilizing latest cutting-edge AI technology. Computer
engineers and health scientists can now successfully diagnose patients by using
multi-factor analysis, classical logistic regression, and analysis assisted by AI. This
is made possible by theoretical and technical advancements in computer programs
and statistics. These estimations are much more accurate than the experimental esti-
mates. Recently, researchers have started to develop new models to predict and detect
cancer using AI. These models are crucial for increasing the precision of survival
from cancer and sensitivity estimations [3].
But just with the detection and management of cancer, this diagnosis must be
made in the earliest stages of the illness. The most important thing is to diagnose
cancer early in order to preserve the lives of many individuals [9]. For this form of
cancer diagnosis, visual examination and manual techniques are typically used. It
takes a lot of effort and is quite error-prone to explain medical imagery [10]. Due to
the ambiguous nature of the symptoms, the limitations of mammography and other
screening methods, and the potential for recurrence after care, a cancer in its initial
phases is extremely challenging [11]. Therefore, high resolution medical diagnostics
in cancer investigations will lead to the development of better predictive models [12].
An analysis of studies on the identification and management of cancer in the liter-
ature shows that the application of AI approaches is expanding [13]. Additionally,
this has come to light that AI techniques are more effective than conventional anal-
ysis methods like statistical and multivariate analysis. Particularly the DL approach
among AI techniques produces excellent outcomes [14].
Detection of Cancer Using Deep Learning Techniques 189

Fig. 1 Categorization of DL neural networks

A specific kind of neural network called DL has numerous hidden layers. DL is


implemented in many different industries recently [15]. It has demonstrated partic-
ularly high efficiency results in use cases like voice recognition, as well as image
detection within advanced devices such as driverless cars and drones [14, 16, 17].
Additionally, fundamental classifications including the identification of cancerous
and healthy tissue are carried out, and conventional ML techniques are used in the
produced models. Deep neural networks powered by artificial intelligence, on the
other hand, offer a better way to use data matrices to create classification models. With
the use of these models, cancer may be identified, its progression can be observed
and predicted, and timely and effective cancer therapy can then be administered [18].
DL approaches operate by using a backpropagation algorithm to uncover fine
structures in huge and frequently complex datasets. Existing techniques, such as
those based on machine learning, have limits when it comes to handling raw data in
its native format without preprocessing [19]. The ability to learn invariant features
is a property of convolutional neural networks (CNN), a type of DL system. To
build patterns for various object identification tasks like detection, segmentation,
and classification, CNNs use feature pooling layers, filter banks, dropout layers,
batch normalization layers, and dense layers. CNNs include a multilevel hierarchy
in which the dispersion of inputs varies throughout training. To achieve improved
performance throughout tasks, preprocessed data is extremely desirable [20]. There
190 A. Singh et al.

are many other CNN variations, including those with shorter connections, like the
DenseNet architecture, which gives a significant reduction in the number of hyper
parameters needed to develop effective designs and has benefits for feature circulation
[21].
ResNets, Xception, and GooLeNet designs are other varieties of CNN architec-
tures that have been more effective recently. These networks are necessary because
multiscale processing is required, job performance across the board degrades as the
network gets deeper, and better topologies with fewer parameters are sought [22–25].
Another critical challenge in DL is the capability of an architecture to store data
over long time periods. Long Short-Term Memory has been suggested as a potential
remedy for this issue (LSTM). Through the states of specialized units, the LSTM
design enforces continuous error flow which is non-global in time and space [26].
The concept of transfer learning is another DL concept worth mentioning. Transfer
learning involves applying features taken from deep convolutional neural networks to
contemporary and inventive jobs. The requirement for this arises from the possibility
that generic tasks may differ significantly from the original tasks and that there won’t
be enough marks or inputs to train DL architecture for new tasks. The use of transfer
learning also allows characteristics to be modified with ease so that they dependably
express generalization well enough [27–29].
DL techniques utilized in cancer detection and treatments are investigated in this
paper. The purpose of the study is to demonstrate, with the help of the literature, the
effectiveness of a deep learning approach—one of the machine learning techniques
treating a condition like cancer, as well as the methodologies and techniques that are
employed and how they are applied [30].

2 Deep Learning

2.1 Basics of Deep Learning

DL has gained a lot of popularity and success in nearly every industry and has
emerged as a useful tool for understanding how machines perceive the world. In
fields including speech recognition, image classification, video scrutiny and natural
language learning DL techniques are applied [31]. Based on a DL created mathemat-
ical model, analysis is performed without using any attribute extractor. The scope
for generalization of DL techniques is one of their key benefits. For additional appli-
cations and data types, a learnt neural network method can be used. When the data
set is inadequate DL performs poorly [32].
DL exists as a kind of machine learning approach which capitalizes on benefits
of nonlinear processing unit layers [15]. The result of the preceding layer is fed
into the subsequent layer as an input. Data is established on the results from the
visualization of the data in the DL approach by understanding multiple feature levels
[33]. A hierarchy is created in the representation by deriving low-level features
Detection of Cancer Using Deep Learning Techniques 191

from top-level features. While generally based on ANN, DL techniques include


more buried layers and neurons [34]. DL techniques show excellent outcomes when
processing a variety of data kinds, including text, audio, and video [35, 36]. There
are several applications of DL, including information retrieval, audio and speech
processing [14], multi-modal and multi-task learning, Natural Language Processing
(NLP), image segmentation and image recognition [16].

2.2 Cancer Diagnosis with DL

When making a diagnosis, doctors frequently draw on their own knowledge, abilities,
and experience. A doctor can never guarantee that his diagnosis of a condition is
accurate, regardless of how talented he is, and many times diseases are misdiagnosed
Technologies involving AI, therefore, appear on the agenda. This is due to the fact
that AI possesses the capacity to evaluate vast quantities of data, resolve complicated
issues, and make very accurate predictions [4]. One of the most modern methods for
AI, DNN describes a number of computational methods that are useful for extracting
data from photos. Many medical disciplines have used DL algorithms for various
medical tasks like radiology, pathology etc. Good efficiency has also been achieved
in the notion of using DL tools for tumor biology and other fields, such as medical
imaging of many species [16].

2.3 Deep Neural Network Characteristics

Any basic neural network consists of an input layer that is connected to the output
immediately. There are several hidden layers inside DNNs that are efficient at
handling complicated issues, each layer’s weight is modified using delta learning
technique. Deep neural networks are also used to discover complex nonlinear inter-
actions by including more hidden layers. Although learning occurs relatively slowly,
DNNs are employed in unsupervised and supervised learning situations. However,
good performance outcomes can be produced, and it is typically employed for
classification and regression purposes [34, 35].
Using a DNN and endoscopic imaging, in [37] lesions were identified and differ-
entiated. It was discovered that there was no appreciable distinction in diagnostic
performance between the artificial intelligence system and skilled endoscopists. The
neural network approach they built has demonstrated great accuracy in discriminating
non-cancerous lesions and high sensitivity [37].
The ability of deep neural networks to identify cancer, specifically lung cancer, in
the presence of low-dose computed tomography and positron emission tomography
scans was examined in [38]. It was shown here that the DNN algorithm has excellent
results in detecting lung cancer. Their work also demonstrated the efforts to screen
192 A. Singh et al.

for lung cancer were more successful as a result of the continued development of
this technique [38].
• A DNN is a type of neural network with any more than two layers and a specific
complexity level [27].
• Advanced mathematical modeling is used to get deeper understanding, and as a
result, the processing of data or features is considered to be complex.
• The task of pattern recognition is carried out by a neural network, which is
a metaphor for the activity of the human brain [8, 20]. In particular, patterns
are recognized to classify cells into non-cancerous and cancerous ones and for
tracking input through various simulated neural association layers [39, 40].
• Dealing with unlabeled data is the major objective of using this network, with
each layer carrying out specific types of tasks [11].

3 Architectures of Deep Learning Neural Networks

Based on the learning technique, the DL neural network architectures are classi-
fied into 4 categories: supervised, semi-supervised, unsupervised, and reinforcement
learning [41]. Figure 1 shows how DL neural networks are categorized.

3.1 Deep Unsupervised Learning

The internal representation of the data is examined by the deep unsupervised learning
architectures, employing a few features without the need for any tagged data. The
dimensionality reduction and clustering techniques used unsupervised methods.
Restricted Boltzmann Machines (RBM) and Auto-Encoders (AE) are a few deep
unsupervised learning architectures [42].

3.2 Deep Supervised Learning

Architectures for supervised learning use predetermined data for training. Target
results and all possible combinations of inputs are fed to the network [43]. The
training phase’s data is validated during testing. Recurrent Neural Networks (RNN),
Long Short Term Memory (LSTM), Convolution Neural Networks (CNN), and gated
recurrent units are few typical methods used under supervised learning [17].
Detection of Cancer Using Deep Learning Techniques 193

3.3 Deep Semi-supervised Learning

Partially labeled data is used for training phase under deep supervised learning
architectures. A few semi-supervised learning architectures include LSTM, RNN,
Generative Adversarial Networks (GAN), GRU, and deep reinforcement learning
[44].

4 Types of Deep Learning Architectures for Cancer


Detection

4.1 Convolutional Neural Networks (CNN)

The analysis of 2D images as well as 3D images was effective with the use of CNN.
A gradient-based algorithm is taken to train majority of the CNN systems [26].
Compared to other neural network models, there are fewer factors to be tweaked.
Feature extractors and classification are both components of the CNN architecture
[45]. The feature extraction layer receives input from one layer before it and passes
it to the next layer after it. Convolution, maximum pooling, and classification are
the three types of layers that make up the CNN architecture. Even numbers are
used to represent convolution layers, while odd numbers are used to represent max-
pooling layers. The categorization layer, the final step of architecture, is a completely
connected layer. For more accuracy, an architecture using back propagation is used
during for classification. Maximum pooling, global average, average, and minimum
pooling are some of the several types of pooling procedures. Using a kernel made up
of a linear or nonlinear activation function, the convolution layer convolves the data
to create feature maps. The activation functions include the rectified linear, sigmoid,
Softmax, identity and hyperbolic tangent functions. The downsampling action occurs
in the pooling layer, which is also known as the subsampling layer. Depending on
the application, there are different numbers of classification layers. Figure 2 shows
the convolution neural network architecture.

Fig. 2 Architecture of a convolution neural network


194 A. Singh et al.

4.2 Multi-scale Convolution Neural Network

A multi-scale convolutional neural network is created by modifying the conventional


CNN [46]. This consists of three convolution layers; a rectified linear unit layer; a
layer that maximizes pooling; as well as two fully linked layers. The input image is
downsampled, and extraction of features is completed for sending to the multi-scale
CNN.

4.3 LeNet-5

This is a 7-stage convolutional neural network that is utilized to categorize hand-


written digits. For a complicated scenario, the number of convolution layers is
employed with input images of size 32 × 32. Figure 3 shows the LeNet design,
which consists of two convolutional layers, subsampling layers, and fully linked
layers. Gaussian connectivity was used on a single output layer [47].

4.4 AlexNet

While Alexnet’s design is identical to that of LeNet’s, it possesses deeper layers,


increased filters for every layer, and connected convolutional layers. After every fully
connected layer and convolutional layer, the function of ReLU activation was added.
With a decreased error of 15.3% from 26%, this was a winning architecture during
2012. It includes data augmentation, dropout, max pooling, and ReLU activations

Fig. 3 Architecture of LeNet-5


Detection of Cancer Using Deep Learning Techniques 195

Fig. 4 Architecture of AlexNet

in addition to 11 × 11, 5 × 5, and 3 × 3 convolutional kernels [48]. In Fig. 4, the


AlexNet architecture is shown.

4.5 ZFNet

Although ZFNet’s architecture was similar to AlexNet, its settings had been fine-
tuned, making it the 2013 challenge winner. There was a 14.8% reduction in inac-
curacies. The number of weights is reduced by using 7 7 kernels rather than 11
11 kernels. The precision is increased as a result of reducing the number of tuning
parameters [49].

4.6 GoogleNet

A part of the GoogleNet design is LeNet, which has an inception structure. It has 22
number of layers, and throughout testing the rate of error decreased gradually from
6.66 to 3.66%. The building was the winner of ILSVRC 2014 [46]. When compared
to the conventional CNN architecture, it has a reduced computational complexity.
Compared to other architectures like AlexNet and VGG [50], it was less frequently
used. In Fig. 5, the GoogleNet architecture is shown.

4.7 VGGNet

The VGGNet, which consists of sixteen convolution layers with several filters, was
the ILSVRC 2014 winner [39]. With this architecture, feature extraction has been
196 A. Singh et al.

Fig. 5 Architecture of GoogleNet

Fig. 6 Architecture of VGGNet

found to be effective, however parameter adjustment is quite important. Three VGG


models with 11 layers, 16 layers, and 19 layers each were proposed: VGG-11, 16,
and 19. All VGG models have three fully connected layers at the very end. Figure 6
shows the architecture of the VGGNet.

4.8 ResNet

In order to employ prevent connections and normalization of batch, the ResNet,


which won the ILSVRC 2015, was used [51]. When compared to the VGGNet, the
computation complexity was lower. The gated recurrent units were utilized for skip-
ping connections. This has 152 layers in total, the inaccuracy is kept at minimum of
3.57%. It finds a solution to the vanishing gradient issue. It has a residual connec-
tion and is one traditional feed forward NN [52]. It consists of a number of left-
over blocks, and depending on the architecture, it operates differently. In Fig. 7, the
residual network is shown.
Detection of Cancer Using Deep Learning Techniques 197

Fig. 7 Architecture of ResNet

4.9 Fully Convolutional Networks (FCNs)

In contrast to the classical CNN, the fully convolutional layer in the fully convo-
lutional network has been replaced with one layer of up-sampling, one layer of
de-convolution, and one completely linked layer, as shown in the Fig. 8. This archi-
tecture was designed so that the fully convolution and the de-convolution layers
create the reversed equivalents of pooling and convolution layers. Up-sampling and
de-convolution layers were added to the design, which increased its accuracy [40,
41].

4.10 U-Net

U-Net, which has two routes, was created for the segmentation of medical images.
The first path has an encoder which records the context of the image. However,
198 A. Singh et al.

Fig. 8 Architecture of fully


convolutional networks

the second path consists of transposed convolutions as well as a decoder [53, 54].
Figure 9 shows the U-Net.

4.11 Recurrent Neural Networks

Figure 10 shows the RNN’s fundamental structure. In [55], various RNN design
variations are described. Numerous functional blocks are included in the recurrent
neural network, as seen in Fig. 10. Recurrent neural networks are susceptible to the
vanishing gradient problem. Recurrent neural networks require memory because they
use prior states as input to determine their present state. It makes use of sequential
data, and connections among nodes create one directed graph. RNNs are used to
convert input sequences into fixed-sized vectors. Using RNN in combination with
the convolutional layer, the effective pixel neighborhood is extended. It is used in
machine translation, time series prediction, and NLP. An example of RNN is long
short-term memory network (LSTM) [56].

4.12 Autoencoders

The auto encoder functions as a potent unsupervised learning architecture with three
layers: encoder, decoder, and code. Encoding data into a more compact representation
Detection of Cancer Using Deep Learning Techniques 199

Fig. 9 Architecture of U-Net

is the function of an encoder. As a result, the input’s distortion is represented by the


compressed image. The compressed input is represented by code. Another layer that
is referred to as a bottleneck is the layer that sits between the encoder and the decoder.
Figure 11 shows the construction of the autoencoder. The decoder converts the code
into a replica of the initial input. The key characteristics are lossy and data-specific.
Four hyperparameters, including the code size, layer count, nodes per layer, and loss
function, need to be tuned before training the architecture. The application areas of
the autoencoder include dimension reduction, image compression, image denoising,
and feature extraction [57, 58].

4.13 Deep Belief Networks

It consists of a forward feed network for the fine adjustment phase and a RBM
(Restricted Boltzmann Machine) for pre-trained model. This network receives the
200 A. Singh et al.

Fig. 10 Architecture of recurrent neural networks

features that the RBM has extracted from the input data vectors. Deep belief networks
use a back propagation design with a slower learning rate. It also has numerous levels
that are hidden. The deep belief network’s primary advantage is its capacity to learn
from higher-level features that are present in earlier levels thanks to its layer-by-layer
learning strategies [59, 60]. In Fig. 12.

5 Steps for Diagnosis of Cancer by Medical Imaging

The medical imaging techniques like MRI, CT scan, and ultrasound were used to
evaluate the healthy function of anatomical organs and analyze diseases [61]. Cancer
diagnosis and therapy planning are crucially dependent on medical imaging modal-
ities. Preprocessing, often known as filtering, is the initial step in the processing
of medical pictures. The goal of filtering is to either eliminate image noise intro-
duced in the acquiring process or for enhancing image quality to get more accurate
details [62]. The term “segmentation” describes the method of identifying ROI, or
region of interest, and in the context of medical pictures, the ROI stands for anatom-
ical organs or any abnormalities associated with them, such as tumors or cysts. To
Detection of Cancer Using Deep Learning Techniques 201

Fig. 11 Architecture of
autoencoders

classify cancer intensity, the classification step typically uses any ML algorithm.
Compression is defined as the process of using machine-assisted techniques to make
files smaller so they can be stored and transferred with more ease. The table shows
the machine learning methods that can be used in each stage of cancer diagnosis
[63].
When assessing an ailment, professionals depend heavily on their first-hand obser-
vations, abilities, and experiences. A doctor can never be in a state of complete surety
and claim that his assessment of the condition is entirely right, and they undoubtedly
get it wrong. This introduces the dependence of Artificial Intelligence powered auto-
mated systems because artificial intelligence (AI) can evaluate enormous volumes of
information, handle complicated prepositions, and anticipate accurately. One of the
most modern methods for AI systems, deep neural networks, describes a number of
computer models that are useful for extracting data from digital images. Algorithms
for DL are utilized in several medical professions [4, 16].
The steps of cancer diagnosis are as follows.
202 A. Singh et al.

Fig. 12 Architecture of deep belief networks

5.1 Cleaning and Pre-processing

The initial stage in the identification process is pre-processing since the raw photos
include noise. Pre-processing is used to boost the quality of a picture that will be
utilized more frequently by eliminating unnecessary image data known as image
noises. If this issue is not resolved, improper categorization may occur. It becomes
crucially important to properly clean the images and convert them into standard forms
for getting high accuracy levels [3].

5.2 Image Segmentation

Image segmentation refers to dividing any image into different sections. It is sepa-
rated into pixel and region, model, and threshold based segmentation. Additionally,
there is additional histogram cutoff, adaptive cutoff point, and boundary detection
approaches. These strategies are also used in combination [3, 64, 65].
Detection of Cancer Using Deep Learning Techniques 203

5.3 Post Processing

After image segmentation, closing and opening operations, island removal, region
merging, border expansion and smoothening is done [3].

6 Diagnosis of Different Types of Cancers Using DL

Table 1 shows DL architectures for various cancer diagnoses. Neural network Archi-
tectures have been extremely useful in illness detection and have also contributed
to research relating to cancer that affects different organs. The convolution sparse
encoder was found to be appropriate for all categories of 3-dimensional datasets in
the proposed work [66]. In [67], lesion identification was achieved while stage of
cancer diagnosis was accomplished using CNN and handmade features. In another
work [68], GoogLeNet was determined to be more successful, with an efficiency
of 85%, as compared to AlexNet, with an accuracy of 82%, and the VGGNet, with
an accuracy of about 84%. When compared to the conventional predictor based on
texture analysis, the model that had combined pre trained SVM and CNN was more
successful for categorizing tumor tissues in digital mammograms [69].
The researchers [70] used a DL method to perform studies on breast cancer
patients. They used a Cox prediction model and genomic datasets to make predic-
tions. They show that whenever there happens to be an abundance of information and
it is utilized to integrate and simplify biomarkers and gene regulation to enable predic-
tion, performance improves. Shimizu and Nakayama [71] used the TCGA database
to identify and work on breast cancer genes and analytical prediction. They employed
AI to identify 184 genes, after which they used ML algorithms such as Random Forest
Classifier along withDL networks to do it. Furthermore, they employed a prognostic
genetic score that utilized just 23 out of the 184 identified genes.
Liu et al. [72] Proposes a CNN model that is capable of identifying tiny cancerous
tumors using gigapixel pathology slides. The proposed system suggested in Cruz-
Roa et al. [73] identifies aggressive lesions in entire slide pictures while minimizing
human work and temporal complications. On breast ultrasound image lesion pictures,
the alternative CNN architectures like LeNet, U-Net, Transfer Learning, and AlexNet
were thoroughly analyzed and it was found that AlexNet and Patch-based LeNet were
the most accurate architectures [74].
Even before DNN tumor identification, the ROI was extracted using the different
watershed and Gaussian mixture model (GMM) algorithms in Das et al. [75]. For the
segmentation of liver tumors, the FCN structure U net was proposed, with subsequent
processing via 3D linked item tagging in order to get better segmentation results [76].
CNNs were proved to be more accurate classifiers than classical machine learning
algorithms [77]. The DNN was shown to be effective for segmenting the cancerous
growth of cells, and it is also appropriate for segmenting tiny lung nodules. Deep
Neural Network efficiency grows as training data increases [78]. The Convolutional
204 A. Singh et al.

Table 1 Deep learning architectures for cancer diagnoses


References Cancer Type of data/imaging DL architecture used Performance
type(s) metrics
[70] Breast Gene expression data Multi omics NN Enhanced
performance with
more omics data
[71] Breast The cancer genome Random forest, NN Log-rank p < 0.05
atlas
[72] Breast Pathology Convolutional Sensitivity: 73%
neural network
(CNN)
[73] Breast Pathology Convnet Positive predictive
Value: 71.6%,
[74] Breast Ultrasound Alexnet (CNN) Fps / image—0.16,
TPF—0.98, F
measure—0.91
[75] Liver Computed tomography Deep Neural Accuracy 99.4%
(CT) Scan/3D Network
[76] Liver Computed tomography Back propagation Accuracy 73.2%
scan neural network
[77] Liver Computed tomography Convolutional Precision: 82.67%
scan neural network Dice: 80.06%
(CNN) Recall: 84.34%
[78] Lung Computed tomography Deep neural Sensitivity: 78.2%
scan network (DNN) Accuracy: 82.1%
Specificity: 86.13%
[79] Lung Computed tomography Deep neural Sensitivity: 78.9%
scan network (DNN)
[80] Lung Computed tomography Resnet Sensitivity: 0.54
scan
[81] Skin Standard images from Deep convolutional Accuracy: 98.55
camera neural networks Sensitivity: 95%
(DCNN)
[82] Skin Dermoscopy images ReLU-rectified Accuracy: 86.67%
linear activation unit
(CNN)
[83] Colon Histopathology image Shallow neural Accuracy: 84%
network
[84] Astrocytic Microarray gene Artificial neural Accuracy: 96.15%
tumor dataset network (ANN)
(continued)
Detection of Cancer Using Deep Learning Techniques 205

Table 1 (continued)
References Cancer Type of data/imaging DL architecture used Performance
type(s) metrics
[85] Prostate Multiparametric Xmasnet (CNN) AUC: 0.84
Magnetic resonance
imaging (mpMRI)/3D
[86] Prostate Multiparametric Deep convolutional AUC: 0.897
Magnetic resonance neural networks
imaging (mpMRI) (DCNN)
[87] Brain Magnetic resonance Input cascade Sensitivity: 0.84
imaging (MRI) convolutional neural Specificity: 0.9
network

Neural Network suggested in Golan et al. [79] is divided into two stages, out of which
the first gathers spatial characteristics, while the second does categorization. The DL
structure was used with an SVM classifier to identify lung nodules; the rule-based
method reduced false positives. The new ResNet design outperforms the traditional
ResNet structure of lesion segmentation [80].
Additionally, using conventional camera pictures, a CNN was employed to detect
melanoma [81]. Convolutional CNN has been proposed for detecting skin lesion
borders in dermoscopy pictures [82]. A smaller network is used to analyze multi-
dimensional gene data in order to definitively diagnose cancerous cell growth in histo-
logical pictures of the colon [83]. Petalidis et al. [84] published data of genomics for
astrocytic malignancies. To be able to explain the necessity for accurate categoriza-
tion of these cancers, they used a neural network technique to merge characteristics
from histological subtypes of these cancers. They were able to identify 59 genes in
this research. They identified accurate classifications for these variants using custom
and separate data with a correctness of 96.15%.
Prostate cancer were identified under the MRI pictures using XmasNet, a CNN-
based algorithm [85]. AUC of 0.897 [86] was reached by it. In the BRATS dataset, the
brain tumor is segmented using the deep interconnected CNN, which has achieved
good performance through a cascaded design [87].

7 Conclusions

DL has been successful in displaying its effectiveness in feature extraction, and


their properties have improved cancer prognosis and prediction. DL models have
revolutionized cancer diagnosis and prediction because of their superior features,
learning architectures have received massive use in cancer cell segmentation and
classification. Data augmentation was critical in diagnosis of cancer and prediction
jobs in order to enhance system efficiency. DL solutions are evaluated and verified in
areas such as replicability and universal applicability in treatment of cancer. These
206 A. Singh et al.

techniques helped in the early detection of cancer and contributed to patient recovery
or life extension.
DL based technological innovation has started to benefit the local and national
medical sectors. Consequently, it is advantageous to use DL technology in cancer
diagnostics and general medicine in order to get further theoretical understanding.
Researchers studying ML algorithms for diagnosing diseases as well as experts in
planning and treating have something to gain from this work’s conclusion.

References

1. Grisold, W. (Ed.) (2021). Wolfgang Grisold, Riccardo Soffietti, Stefan Oberndorfer, Guido
Cavaletti (eds): Effects of cancer treatment on the nervous system.
2. Tang, J., Rangayyan, R. M., Xu, J., El Naqa, I., & Yang, Y. (2009). Computer-aided detection
and diagnosis of breast cancer with mammography: Recent advances. IEEE Transactions on
Information Technology in Biomedicine, 13(2), 236–251.
3. Munir, K., Elahi, H., Ayub, A., Frezza, F., & Rizzi, A. (2019). Cancer diagnosis using deep
learning: A bibliographic review. Cancers, 11(9), 1235.
4. Huang, S., Yang, J., Fong, S., & Zhao, Q. (2020). Artificial intelligence in cancer diagnosis
and prognosis: Opportunities and challenges. Cancer letters, 471, 61–71.
5. Cancer Facts and Figures. (2019). American Cancer Society. https://www.cancer.org/content/
dam/cancer-org/research/cancer-facts-and-statistics/annualcancerfacts-andfigures/2019/can
cer-facts-and-figures-2019.pdf
6. Bhardwaj, P., Guhan, T., & Tripathy, B. K. (2021). Computational biology in the lens of CNN.
In S. S. Roy, Y. H. Taguchi (eds.), Handbook of machine learning applications for genomics
(Chapter 5). Studies in Big Data. ISBN: 978-981-16-9157-7 496166_1_En
7. Tripathy, B. K., & Anuradha, J. (2015). Soft computing-advances and applications. Cengage
Learning Publishers, New Delhi. ASIN : 8131526194. ISBN-10: 9788131526194.
8. Rungta, R. K., Jaiswal, P, & Tripathy, B. K. (2022) A deep learning based approach to measure
confidence for virtual interviews. In A. K. Das et al. (Eds.), Proceedings of the 4th International
Conference on Computational Intelligence in Pattern Recognition (CIPR), CIPR 2022 (pp. 278–
291). LNNS 480.
9. Bhandari, A., Tripathy, B. K., Jawad, K., Bhatia, S., Rahmani, M. K. I., & Mash, A. (2022).
Cancer detection and prediction using genetic algorithms. Comput Intell Neurosci 2022, 18.
https://doi.org/10.1155/2022/1871841
10. Allahyar, A., Ubels, J., & de Ridder, J. (2019). A data-driven interactome of synergistic genes
improves network-based cancer outcome prediction. PLoS Computational Biology, 15(2),
e1006657.
11. Adate, A., Tripathy, B. K., Arya, D., & Shaha, A. (2020) Impact of deep neural learning on
artificial intelligence research. In S. Bhattacharyya, A. E. Hassanian, S. Saha, & B. K. Tripathy
(Eds.), Deep learning research and applications (pp.69–84). De Gruyter Publications. https://
doi.org/10.1515/9783110670905-004
12. Mitchell, M. J., Jain, R. K., & Langer, R. (2017). Engineering and physical sciences in oncology:
Challenges and opportunities. Nature Reviews Cancer, 17(11), 659–675.
13. Obermeyer, Z., & Emanuel, E. J. (2016). Predicting the future—big data, machine learning,
and clinical medicine. The New England Journal of Medicine, 375(13), 1216.
14. Graves, A., Mohamed, A. R., & Hinton, G. (2013). Speech recognition with deep recurrent
neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal
Processing (pp. 6645–6649). IEEE.
15. Bhattacharyya, D. S., Snasel, V., Hassanian, A. E., Saha, S., & Tripathy, B. K. (2020). Deep
learning research with engineering applications. De Gruyter Publications. ISBN: 3110670909,
9783110670905. https://doi.org/10.1515/9783110670905
Detection of Cancer Using Deep Learning Techniques 207

16. Bose, A., & Tripathy, B. K. (2020) Deep learning for audio signal classification. In S. Bhat-
tacharyya, A. E. Hassanian, S. Saha, & B. K. Tripathy (Eds.), Deep learning research and
applications (pp. 105–136). De Gruyter Publications. https://doi.org/10.1515/9783110670905-
00660
17. Singhania, U., & Tripathy, B. K. (2021). Text-based image retrieval using deep learning. In
Encyclopedia of information science and technology (5th edn, p. 11). https://doi.org/10.4018/
978-1-7998-3479-3.ch007
18. Yagna Sai Surya, K., Geetha Rani, T., & Tripathy, B. K. (2022). Social distance monitoring
and face mask detection using deep learning. In J. Nayak, H. Behera, B. Naik, S. Vimal, & D.
Pelusi (Eds.), Computational intelligence in data mining (Vol. 281). Smart Innovation, Systems
and Technologies. Springer, Singapore. https://doi.org/10.1007/978-981-16-9447-9_36
19. Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by
reducing internal covariate shift. In International Conference on Machine Learning (pp. 448–
456). PMLR.
20. Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected
convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (pp. 4700–4708).
21. Kyi, C. W., Birriel, P. C., Davidsen, T. M., Ferguson, M. L., Gesuwan, P., Griner, N. B., Gerhard,
D. S., et al. (2020). NCI office of cancer genomics supports multidisciplinary genomics research
initiatives to advance precision oncology. Cancer Research, 80(16_Supplement), 5862–5862.
22. Pogorelov, K., Randel, K. R., Griwodz, C., Eskeland, S. L., de Lange, T., Johansen, D.,
Halvorsen, P., et al. (2017). Kvasir: A multi-class image dataset for computer aided gastroin-
testinal disease detection. In Proceedings of the 8th ACM on Multimedia Systems Conference
(pp. 164–169).
23. Mesri, M., An, E., Hiltke, T., Robles, A. I., Rodriguez, H., & CPTAC Investigators. (2022).
NCI’s clinical proteomic tumor analysis consortium: A proteogenomic cancer analysis
program. Cancer Research, 82(12_Supplement), 6331–6331.
24. Gupta, P., Bhachawat, S., Dhyani, K., & Tripathy, B. K. (2021). A study of gene characteristics
and their applications using deep learning, (Chapter 4). In S. S. Roy, & Y. H. Taguchi (Eds.),
Handbook of Machine Learning Applications for Genomics (Vol. 103). Studies in Big Data.
ISBN: 978-981-16-9157-7, 496166_1_En.
25. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8),
1735–1780.
26. Maheswari, K., Shaha, A., Arya, D., Tripathy, B. K., & Rajkumar, R. (2020). Convolutional
neural networks: A bottom-up approach. In S. Bhattacharyya, A. E. Hassanian, S. Saha, & B.
K. Tripathy (Eds.), Deep Learning Research with Engineering Applications (pp. 21–50). De
Gruyter Publications. https://doi.org/10.1515/9783110670905-002
27. Tripathy, B. K., & Deepthi, P. H. (2015). Application of spatial FCM in detecting cancer cells.
IIMT Research Network (pp. 1–6, 96–100). ISBN 878-93-82208-77-8.
28. Zhong, Z., Sun, L., & Huo, Q. (2019). An anchor-free region proposal network for Faster
R-CNN-based text detection approaches. International Journal on Document Analysis and
Recognition (IJDAR), 22(3), 315–327.
29. Hanefi Calp, M. (2021). Use of deep learning approaches in cancer diagnosis. In Deep Learning
for Cancer Diagnosis (pp. 249–267). Springer, Singapore.
30. Karahan, Ş., & Akgül, Y. S. (2016). Eye detection by using deep learning. In 2016 24th Signal
Processing and Communication Application Conference (SIU) (pp. 2145–2148). IEEE.
31. Özkan, İN. İK., & Ülker, E. (2017). Derin öğrenme ve görüntü analizinde kullanılan derin
öğrenme modelleri. Gaziosmanpaşa Bilimsel Araştırma Dergisi, 6(3), 85–104.
32. Şeker, A., Diri, B., & Balık, H. H. (2017). Derin öğrenme yöntemleri ve uygulamaları hakkında
bir inceleme. Gazi Mühendislik Bilimleri Dergisi, 3(3), 47–64.
33. Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends® in Machine
Learning, 2(1), 1–127.
34. Tripathy, B. K., Raju, H., & Kaul, D. (2018). Deep learning in health care, accepted in deep
learning for remote sensing and GIS: Frontier advancements and applications. In V. Santhi
(Eds.) CRC publications
208 A. Singh et al.

35. Ravì, D., Wong, C., Deligianni, F., Berthelot, M., Andreu-Perez, J., Lo, B., & Yang, G. Z. (2016).
Deep learning for health informatics. IEEE Journal of Biomedical and Health Informatics,
21(1), 4–21.
36. Küçük, D., & Arici, N. (2018). Doğal Dil İşlemede Derin Öğrenme Uygulamalari Üzerine Bir
Literatür Çalişmasi. Uluslararası Yönetim Bilişim Sistemleri ve Bilgisayar Bilimleri Dergisi,
2(2), 76–86.
37. Ohmori, M., Ishihara, R., Aoyama, K., Nakagawa, K., Iwagami, H., Matsuura, N., & Tada,
T., et al. (2020). Endoscopic detection and differentiation of esophageal lesions using a deep
neural network. Gastrointestinal Endoscopy, 91(2), 301–309.
38. Schwyzer, M., Ferraro, D. A., Muehlematter, U. J., Curioni-Fontecedro, A., Huellner, M. W.,
Von Schulthess, G. K., Kaufmann, P. A., Burger, I. A., & Messerli, M. (2018). Automated
detection of lung cancer at ultralow dose PET/CT by deep neural networks–initial results. Lung
Cancer, 126, 170–173.
39. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Rabinovich, A., et al. (2015).
Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (pp. 1–9).
40. Sihare, P., Ullah Khan, A., Bardhan, P., & Tripathy, B. K. (2022). COVID-19 detection using
deep learning: A comparative study of segmentation algorithms. In A. K. Das et al. (Eds.),
Proceedings of the 4th International Conference on Computational Intelligence in Pattern
Recognition (CIPR) (pp. 1–10), CIPR 2022, LNNS 480.
41. Dai, J., Li, Y., He, K., & Sun, J. (2016). R-fcn: Object detection via region-based fully
convolutional networks. Advances in Neural Information Processing Systems, 29.
42. Raina, R., Madhavan, A., & Ng, A. Y. (2009). Large-scale deep unsupervised learning using
graphics processors. In Proceedings of the 26th Annual International Conference on Machine
Learning (pp. 873–880).
43. Tripathy, B. K., Dash, S., & Patro, B. N. (2012). Study of classification accuracy of microarray
data for cancer classification using multivariate and hybrid feature selection method. IOSR
Journal of Engineering (IOSRJEN), 2(8), 112–119 ISSN: 2250-302.
44. Adate, A., & Tripathy, B. K. (2017). Understanding single image super-resolution techniques
with generative adversarial networks. Advances in Intelligent Systems and ComputingIn J.
Bansal, K. Das, A. Nagar, K. Deep, & A. Ojha (Eds.), Soft computing for problem solving (Vol.
816, pp. 833–840). Springer.
45. Liu, W., Wang, Z., Liu, X., Zeng, N., Liu, Y., & Alsaadi, F. E. (2017). A survey of deep neural
network architectures and their applications. Neurocomputing, 234, 11–26.
46. Mustafa, H. T., Yang, J., & Zareapoor, M. (2019). Multi-scale convolutional neural network
for multi-focus image fusion. Image and Vision Computing, 85, 26–35.
47. LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to
document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
48. Kaul, D., Raju, H., & Tripathy, B. K. (2022). Deep learning in healthcare. In D. P. Acharjya, A.
Mitra, & N. Zaman (Eds.), Deep learning in data analytics, deep learning in data analytics-recent
techniques, practices and applications (Vol. 91, pp. 97–115). Studies in Big Data. Springer,
Cham. https://doi.org/10.1007/978-3-030-75855-4_6
49. Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image
recognition. Preprint retrieved from arXiv:1409.1556.
50. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Fei-Fei, L., et al. (2015).
Imagenet large scale visual recognition challenge. International Journal of Computer Vision,
115(3), 211–252.
51. Tripathy, B. K., Garg, N., & Nikhitha, P. (2014). Image retrieval using latent feature learning
by deep architecture. In Proceedings of the IEEE ICCIC2014 (pp. 663–666)
52. Targ, S., Almeida, D., & Lyman, K. (2016). Resnet in resnet: Generalizing residual architec-
tures. Preprint retrieved from arXiv:1603.08029.
53. Tripathy, B. K., Parikh, S., Ajay, P., & Magapu, C.: Brain MRI segmentation techniques based
on CNN and its variants (Chapter-10). In J. Chaki (Ed.), Brain tumor MRI image segmentation
using deep learning techniques (pp.161–182.). Elsevier publications. https://doi.org/10.1016/
B978-0-323-91171-9.00001-6
Detection of Cancer Using Deep Learning Techniques 209

54. Çiçek, Ö., Abdulkadir, A., Lienkamp, S. S., Brox, T., & Ronneberger, O. (2016). 3D U-Net:
learning dense volumetric segmentation from sparse annotation. In International Conference
on Medical Image Computing and Computer-Assisted Intervention (pp. 424–432). Springer,
Cham.
55. Baktha, K., & Tripathy, B. K. (2017). Investigation of recurrent neural networks in the field
of sentiment analysis. In International Conference on Communication and Signal Processing
(ICCSP), (pp. 2047–2050). https://doi.org/10.1109/ICCSP.2017.8286763
56. Adate, A., & Tripathy, B. K. (2019). S-LSTM-GAN: Shared recurrent neural networks with
adversarial training. In A. Kulkarni, S. Satapathy, T. Kang, A. Kashan (Eds.), Proceedings of
the 2nd International Conference on Data Engineering and Communication Technology (Vol.
828, pp. 107–115). Advances in Intelligent Systems and Computing. Springer, Singapore.
57. Loey, M., El-Sawy, A., & El-Bakry, H. (2017). Deep learning autoencoder approach for
handwritten arabic digits recognition. Preprint retrieved from arXiv:1706.06720.
58. Thomas, S. A., Race, A. M., Steven, R. T., Gilmore, I. S., & Bunch, J. (2016). Dimensionality
reduction of mass spectrometry imaging data using autoencoders. In 2016 IEEE Symposium
Series on Computational Intelligence (SSCI) (pp. 1–7). IEEE.
59. Keyvanrad, M. A., & Homayounpour, M. M. (2014). A brief survey on deep belief networks and
introducing a new object oriented toolbox (DeeBNet). Preprint retrieved from arXiv:1408.3264.
60. Hinton, G. E. (2009). Deep belief networks. Scholarpedia, 4(5), 5947.
61. Jeong, J. (2017). Deep learning for cancer screening in medical imaging. Hanyang Medical
Reviews, 37(2), 71–76.
62. Pereira, G. C., Traughber, M., & Muzic, R. F. (2014). The role of imaging in radiation therapy
planning: past, present, and future. BioMed Research International.
63. Adate, A., & Tripathy, B. K. (2018) Deep learning techniques for image processing. In S.
Bhattacharyya, H. Bhaumik, A. Mukherjee, & S. De (Eds.), Machine learning for big data
analysis (pp. 69–90). De Gruyter, Berlin, Boston. https://doi.org/10.1515/9783110551433-
00357
64. Jain, S., Singhania, U., Tripathy, B., Nasr, E. A., Aboudaif, M. K., & Kamrani, A. K. (2021).
Deep learning-based transfer learning for classification of skin cancer. Sensors (Basel), 21(23),
8142. https://doi.org/10.3390/s21238142
65. Tong, N., Lu, H., Ruan, X., & Yang, M. H. (2015). Salient object detection via bootstrap
learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(pp. 1884–1892).
66. Kallenberg, M., Petersen, K., Nielsen, M., Ng, A. Y., Diao, P., Igel, C., Lillholm, M., et al.
(2016). Unsupervised deep learning applied to breast density segmentation and mammographic
risk scoring. IEEE Transactions on Medical Imaging, 35(5), 1322–1331.
67. Wang, H., Roa, A. C., Basavanhally, A. N., Gilmore, H. L., Shih, N., Feldman, M., Madabhushi,
A., et al. (2014). Mitosis detection in breast cancer pathology images by combining handcrafted
and convolutional neural network features. Journal of Medical Imaging, 1(3), 034003.
68. Ertosun, M. G., & Rubin, D. L. (2015). Probabilistic visual search for masses within mammog-
raphy images using deep learning. In 2015 IEEE International Conference on Bioinformatics
and Biomedicine (BIBM) (pp. 1310–1315). IEEE.
69. Turkki, R., Linder, N., Kovanen, P. E., Pellinen, T., & Lundin, J. (2016). Antibody-supervised
deep learning for quantification of tumor-infiltrating immune cells in hematoxylin and eosin
stained breast cancer samples. Journal of Pathology Informatics, 7(1), 38.
70. Huang, Z., Zhan, X., Xiang, S., Johnson, T. S., Helm, B., Yu, C. Y., Huang, K., et al. (2019).
SALMON: Survival analysis learning with multi-omics neural networks on breast cancer.
Frontiers in Genetics, 10, 166.
71. Shimizu, H., & Nakayama, K. I. (2019). A 23 gene–based molecular prognostic score precisely
predicts overall survival of breast cancer patients. eBioMedicine, 46, 150–159.
72. Liu, Y., Gadepalli, K., Norouzi, M., Dahl, G. E., Kohlberger, T., Boyko, A., Stumpe, M. C.,
et al. (2017). Detecting cancer metastases on gigapixel pathology images. Preprint retrieved
from arXiv preprint arXiv:1703.02442.
210 A. Singh et al.

73. Cruz-Roa, A., Gilmore, H., Basavanhally, A., Feldman, M., Ganesan, S., Shih, N. N.,
Tomaszewski, J., González, F. A., & Madabhushi, A. (2017). Accurate and reproducible inva-
sive breast cancer detection in whole-slide images: A deep learning approach for quantifying
tumor extent. Scientific Reports, 7(1), 1–14.
74. Yap, M. H., Pons, G., Marti, J., Ganau, S., Sentis, M., Zwiggelaar, R., Davison, A. K., & Marti,
R. (2017). Automated breast ultrasound lesions detection using convolutional neural networks.
IEEE Journal of Biomedical and Health Informatics, 22(4), 1218–1226.
75. Das, A., Acharya, U. R., Panda, S. S., & Sabut, S. (2019). Deep learning based liver cancer detec-
tion using watershed transform and Gaussian mixture model techniques. Cognitive Systems
Research, 54, 165–175.
76. Devi, P., & Dabas, P. (2015). Liver tumor detection using artificial neural networks for medical
images. International Journal of Innovative Reserach Science Technology, 2(3), 34–38.
77. Li, W. (2015). Automatic segmentation of liver tumor in CT images with deep convolutional
neural networks. Journal of Computer and Communications, 3(11), 146.
78. Gruetzemacher, R., & Gupta, A. (2016). Using deep learning for pulmonary nodule detection &
diagnosis.
79. Golan, R., Jacob, C., & Denzinger, J. (2016). Lung nodule detection in CT images using deep
convolutional neural networks. In 2016 International Joint Conference on Neural Networks
(IJCNN) (pp. 243–250). IEEE.
80. Kuan, K., Ravaut, M., Manek, G., Chen, H., Lin, J., Nazir, B., Chen, C., Howe, T. C., Zeng,
Z., & Chandrasekhar, V. (2017). Deep learning for lung cancer detection: tackling the kaggle
data science bowl 2017 challenge. Preprint retrieved from arXiv:1705.09435.
81. Jafari, M. H., Karimi, N., Nasr-Esfahani, E., Samavi, S., Soroushmehr, S. M. R., Ward, K., &
Najarian, K. (2016). Skin lesion segmentation in clinical images using deep learning. In 2016
23rd International Conference on Pattern Recognition (ICPR) (pp. 337–342). IEEE.
82. Sabouri, P., & GholamHosseini, H. (2016). Lesion border detection using deep learning. In
2016 IEEE Congress on Evolutionary Computation (CEC) (pp. 1416–1421). IEEE.
83. Chen, H., Zhao, H., Shen, J., Zhou, R., & Zhou, Q. (2015). Supervised machine learning model
for high dimensional gene data in colon cancer detection. In 2015 IEEE International Congress
on Big Data (pp. 134–141). IEEE.
84. Petalidis, L. P., Oulas, A., Backlund, M., Wayland, M. T., Liu, L., Plant, K., Happerfield, L.,
Freeman, T.C., Poirazi, P., & Collins, V. P. (2008). Improved grading and survival prediction
of human astrocytic brain tumors by artificial neural network analysis of gene expression
microarray data. Molecular Cancer Therapeutics, 7(5), 1013–1024.
85. Liu, S., Zheng, H., Feng, Y., & Li, W. (2017). Prostate cancer diagnosis using deep learning
with 3D multiparametric MRI. In Medical Imaging 2017: Computer-Aided Diagnosis (Vol.
10134, pp. 581–584). SPIE.
86. Tsehay, Y. K., Lay, N. S., Roth, H. R., Wang, X., Kwak, J. T., Turkbey, B. I., Pinto, P. A.,
Wood, B. J., & Summers, R. M. (2017). Convolutional neural network based deep-learning
architecture for prostate cancer detection on multiparametric magnetic resonance images. In
Medical Imaging 2017: Computer-Aided Diagnosis (Vol. 10134, pp. 20–30). SPIE.
87. Havaei, M., Davy, A., Warde, D., Biard, A., Courville, A., Bengio, Y., Pal, C., Jodoin, P. M., &
Larochelle, H. (2017). Brain tumor segmentation with deep neural networks. Medical Image
Analysis, 35, 18–31.

You might also like