Smart Computer Vision
Smart Computer Vision
and Computing
Series Editor
Imrich Chlamtac, European Alliance for Innovation, Ghent, Belgium
The impact of information technologies is creating a new world yet not fully
understood. The extent and speed of economic, life style and social changes
already perceived in everyday life is hard to estimate without understanding the
technological driving forces behind it. This series presents contributed volumes
featuring the latest research and development in the various information engi-
neering technologies that play a key role in this process. The range of topics,
focusing primarily on communications and computing engineering include, but
are not limited to, wireless networks; mobile communication; design and learning;
gaming; interaction; e-health and pervasive healthcare; energy management; smart
grids; internet of things; cognitive radio networks; computation; cloud computing;
ubiquitous connectivity, and in mode general smart living, smart cities, Internet of
Things and more. The series publishes a combination of expanded papers selected
from hosted and sponsored European Alliance for Innovation (EAI) conferences
that present cutting edge, global research as well as provide new perspectives on
traditional related engineering fields. This content, complemented with open calls
for contribution of book titles and individual chapters, together maintain Springer’s
and EAI’s high standards of academic excellence. The audience for the books
consists of researchers, industry professionals, advanced level students as well as
practitioners in related fields of activity include information and communication
specialists, security experts, economists, urban planners, doctors, and in general
representatives in all those walks of life affected ad contributing to the information
revolution.
Indexing: This series is indexed in Scopus, Ei Compendex, and zbMATH.
About EAI - EAI is a grassroots member organization initiated through cooper-
ation between businesses, public, private and government organizations to address
the global challenges of Europe’s future competitiveness and link the European
Research community with its counterparts around the globe. EAI reaches out to
hundreds of thousands of individual subscribers on all continents and collaborates
with an institutional member base including Fortune 500 companies, government
organizations, and educational institutions, provide a free research and innovation
platform. Through its open free membership model EAI promotes a new research
and innovation culture based on collaboration, connectivity and recognition of
excellence by community.
B. Vinoth Kumar • P. Sivakumar • B. Surendiran •
Junhua Ding
Editors
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland
AG 2023
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
retrieval. It uses K-Means clustering algorithm and Hamming distance for faster
retrieval of the image.
Chapter 8 provides a bio-inspired convolutional neural network (CNN)-based
model for COVID-19 diagnosis. A cuckoo search algorithm is used to improve
the performance of the CNN model. Chapter 9 presents convolutional CapsNet for
detecting COVID-19 disease using chest X-ray images. The model obtains fast and
accurate diagnostic results with less trainable parameters.
Chapter 10 proposes a deep learning framework for an automated hand gesture
recognition system. The proposed framework classifies the input hand gestures,
each represented by a set of histogram-oriented gradient feature vector into some
predefined number of gesture classes.
Chapter 11 presents a new hierarchical deep learning based approach for seman-
tic segmentation of 3D point cloud. It involves nearest neighbor search for local
feature extraction followed by an auxiliary pre trained network for classification.
Chapter 12 summarizes that the proposed model acts as a better automatic
colorization for colored and grayscale images without human intervention. The
proposed model predicted the color for the new images with good prediction
accuracy close to the real images. In future, such automatic colorization techniques
help to identify vintage images or movies with grayscale images with their details
in a very clear manner.
Chapter 13 proposes a generative adversarial network (GAN) for hyperspectral
image classification. It uses dynamic mode decomposition (DMD) to reduce the
redundant features in order to attain better classification. Chapter 14 presents a brief
introduction about the methodologies used for identifying diabetic retinopathy. It
also uses convolutional neural network models to achieve an effective classification
for diabetic detection of retinal fundus images.
Chapter 15 proposes a modified differential evolution (DE), best neighborhood
DE(BNDE), to solve discrete-valued benchmarking and real-world optimization
problems. The proposed algorithm increases the exploitation and exploration capa-
bilities of the DE and to reach the optimal solution faster. In addition, the proposed
algorithm is applied to grayscale image enhancement.
Chapter 16 presents an overview of the main swarm-based solutions proposed
to solve problems related to computer vision. It presents a brief description of
the principles behind swarm algorithms, as well as the basic operations of swarm
methods that have been applied in computer vision.
We are grateful to the authors and reviewers for their excellent contributions for
making this book possible. Our special thanks go to Mary James (EAI/Springer
Innovations in Communication and Computing) for the opportunity to organize this
edited volume.
Preface vii
ix
x Contents
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
A Systematic Review on Machine
Learning-Based Sports Video
Summarization Techniques
1 Introduction
V. Vasudevan
Department of CSE, Nitte Meenakshi Institute of Technology, Bengaluru, India
M. S. Gounder ()
Department of ISE, Nitte Meenakshi Institute of Technology, Bengaluru, India
Fig. 1 Number of publications in sports video summarization from 2000 to 2020. (Data from
google scholar advanced search with “sports video summarization” OR “sports highlights”
anywhere in the article)
Soccer 938
Baseketball
Baseball
Tennis 578
Golf
Cricket 206
Rugby
Handball 52
Type of Sport
Fig. 2 Number of publications based on types of popular sports videos used to generate video
highlights from 2000 to 2020. (Data from google scholar advanced search with “sports video
summarization” OR “sports highlights” < type of sport > anywhere in the article)
event. Figure 2 shows the publications based on the types of sports videos used
to generate highlights where “type of sport” is substituted with soccer/football,
basketball, baseball, etc. Based on the various criteria considered along with the
number of publications in the literature, we have confined our scope of review to
soccer, tennis, and cricket sports. The rest of this paper is organized as follows.
In Sect. 2, we review the techniques established for sports video summarization
since 2000. Some important ideas, algorithms, and methods evolved over a period
for video highlight generation specific to two of the popular sports videos, namely,
soccer and cricket are reviewed to a greater extent with a quick review on other
sports in Sect. 3. In Sect. 4, scope of future research, weaknesses in the methods
used, and possible solutions are discussed. We conclude the paper in Sect. 5.
A Systematic Review on Machine Learning-Based Sports Video. . . 3
Most of the sports events can be summarized based on features like color, motion,
gesture of players or umpires/referees, combination of audio and visual cues, texts
that displays the scores, and objects. For example, soccer and short version of cricket
matches are played with different color jerseys for each team. The signals of referees
or umpires involve identifying some key gestures.
The method by [42] proposed a dominant color-based video summarization to
extract the video highlights of sports video. The key frames are extracted based on
color histogram analysis. This kind of features gives an additional confidence if the
visual features are used for key frame extraction. This method has not adapted any
such visual features to identify the key frames. However, it is found that the color
factor plays an important role in influencing the sports video summarization. Figure
5 shows all such factors that influence the sports video summarization. The field
dominant color is one of the major features in sports like cricket, soccer, and other
field games. This also can be used to classify the events of on and off field, crowd,
and player or umpire detection [64, 67].
The motion-based and gesture features are proposed by [63, 67]. The motion is
the key in any sports. When the camera motion is also considered, the challenge
to extract the events or key frames becomes more complicated. In [63], the
summarization is more of an event based that is presented as a cinematography. The
authors have proposed an interesting method to not only summarize the soccer video
but also identify scenes of intensive competition between players and emotional
events. In addition to this, they also proposed to classify the clips into many clips
based on cinematographic features like video production techniques, transition of
shots, and camera motions. The intensive competition is identified based on the
movements of players, attack, defense of goal, etc. The reaction of players or the
crowd is considered for emotional moments. This also counts what happened in the
scene, who were involved in the scene, and how the players and audience reacted to
the scene.
The video clips are converted into segments of sematic shots and then each of
them into clips based on camera motion. The interest level of each of these clips is
measured based on cinematography and motion features. This work also classifies
the shots as long view, close view, and medium view. Interestingly, a segment is
created based on semantics in the scene. Thus, it forms a semantic structure of soccer
video. The factors influencing the summarization in this method as per Fig. 4 are
events, movement, object/event detection, and camera motion.
In [67], the authors introduced a dataset called SNOW, which is used to identify
the pose of umpires in cricket. They identified four important events in cricket,
A Systematic Review on Machine Learning-Based Sports Video. . . 5
namely, Six, No Ball, Out, and Wide ball based on the umpire pose or gesture
recognition. The pretrained convolution networks such as Inception V3 and VGG19
have been used for extracting the features. The classification of the poses is based
on the SVM. The authors have attempted to create a database for public use, and
it has been made online for download. Some of the factors influencing the video
summarization in this method are visual cues and object detection.
Another interesting feature used in most of the video summarization methods
is the audio [9, 35, 44, 45, 66, 92]. In Fig. 5, it is evident that audio features
like commentator’s voice and crowd cheering are key factors influencing the
video summarization. In [9], an audiovisual-based approach has been presented.
The audio signal’s instantaneous energy and local periodicity are measured to
extract highlighted moments in the soccer game. In the work proposed by [35],
commentator’s voice, referee’s whistle, and crowd voice are considered to find
the exciting events. The events related to soccer games are goal, penalty shootout,
red card, etc. The authors also considered the audio noise during such events like
musical instruments and applied Empirical Mode Decomposition to filter them.
Another method proposed by [45] also applies audiovisual features to extract key
frames from sports video. The audio features considered here are the excitement
events that are identified by the spike in signal due to crowd cheer. In addition, the
visual features such as score card detection has been proposed using deep learning
methods. As in [9], the factors influencing are visual cues and crowd audio. The
authors of [44] proposed an interesting method to detect the highlights based on
referee whistle sound detection. A band-pass filter is designed to accentuate the
whistle sound of referee and suppress any other audio events. A decision rules-
based [1, 39, 61, 84] time-dependent threshold method has been applied to detect
the regions where the whistle sound occurs. The authors used an extensive 12 hours
testing signal from various soccer, football, and basketball games. Like in [9, 35,
44, 45, 83], the method proposed in [66] employs audio feature like spectrum of
6 V. Vasudevan and M. S. Gounder
signal during the key events like goal. This is applied on top of key event detection
using visual features and color features. Some of the factors used in this method
are audio, color, visual cues, replay, excitements, batting, bowling shots [3], and
player detection. This method is strongly dependent on some of the video production
style, and the detection accuracy is only 70%. A time-frequency-based feature
extraction is used to calculate local autocorrelations on complex Fourier values in
[92]. Then the extracted features are used in exciting scenes detection. The authors
have considered the environmental noise and proved that this method is robust, and
the performance is better. However, the commentator’s voice is not considered.
By carefully looking at all the methods that used audio as one of the features,
there is certainly a scope for additional confidence to extract or identify the key
frames or key events in a sports video. Majority of them focused on identifying
the events based on crowd or spectator cheering. Though there would be additional
noise, some of the methods like [44, 66, 92] have applied methods to deal with
noises.
The clustering-based methods work based on clustering similar frames or shots and
then processing these clusters as required. In [13], a Fuzzy C-Means clustering
method is applied to cluster video frames based on color feature. A shot detection
algorithm is also used to find the number of clusters. The authors attempted
to improve the computation speed and accuracy through this method of video
summarization. Another method [54] attempted to develop a hierarchical summary
based on a state transition model. In addition, the authors also used other cues like
text, audio, and expert’s choice to improve the accuracy of proposed algorithm. This
method uses cues of visual, text, and pitch as the factors of influence (Fig. 5). In [50],
a neuro-fuzzy based approach has been proposed to segment the shots. The content
of the shots is identified by the slots or windows that are more semantic. Hierarchical
clustering of the identified windows provides textual summary to the user based on
the content of video in the shot. The method claims to generate textual annotation
of video automatically. It is also used to compare the similarity between any two
video shots. In [79], a statistical classifier based on Gaussian mixture models with
unsupervised model is proposed. This method adapts majorly the audio features to
find the mismatch between test and pretrained models, which is also discussed in
Sect. 2.3.
All the sports and games do have some moments that can be identified as moment
of excitement. The moments could be part of players reaction, crowd reaction,
A Systematic Review on Machine Learning-Based Sports Video. . . 7
referee’s action, or even the commentator’s reaction. The players reactions and
expressions include high-fives, first pumps, aggressive, tense, and smiling. The
crowd and commentator’s excitement can be identified by the energy of audio signal
or the tone of the commentators. Some of the works that have been reported already
based on audio [9, 35, 44, 45, 66, 92] uses these features to extract key events.
In addition to them, it is also identified that the works like [59, 75, 79] exploit
such excitement-based features to identify key events that eventually contribute to
summarize sports video. In [58], the authors have used multiple features to identify
the key frames. The information from players reactions, expressions, spectators
cheer, and commentator’s tone are used to identify key events. It has been found that
these methods are applied to summarize sports like tennis and golf. The excitement-
based highlight generation reported in [75] considers the secondary events in a
cricket match like drop catches and pressure moments based on certain strategy that
includes loudness of a video, category associated with primary event, and replays.
The player celebration detection, excitement of the crowd, appeals, and some
intense commentaries are considered as excitement features. The method has been
extensively tested on cricket videos. Another method that exploits the commentator
speech is proposed in [79]. The method uses statistical classifier based on Gaussian
mixture models with unsupervised model adaption. The acoustic mismatch between
the training and testing data is compensated using maximum a posteriori adaption
method.
Every sport has its own list of key events. The summarization can be carried out
based on such key events. This would obviously make the viewers to get the
most exciting events in the sports of their choice. For example, the sports soccer
has key events such as goals, foul, shoot, etc. There are substantial number of
publications [4, 7, 34, 38, 39, 41, 48, 65, 75, 76, 81, 86, 91] that addressed the video
summarization based on the key events. The work proposed in [4], an unsupervised
framework for soccer goal event detection using external textual source typically
from the reports of sports website, has been proposed. Instead of segmenting the
actual video based on the visual and aural contents [73], this method claimed to
be more efficient since noneventful segments are discarded. The method seems to
be very promising and can be applied to any sports that has live coverage through
text format in websites. An approach based on language independent, multistage
classification is employed for detection of key acoustic events in [7]. The method
has been applied on rugby. Though the method is like most of the approaches using
audio features, it differentiates in the way it treats the audio events independent
of languages. A hybrid approach based on learning and non-learning method has
been proposed in [34] to automatically summarize sport video. The key events are
goal, foul, shoot, etc. SVM-based method is applied for shot boundary detection,
and a view classification algorithm is used for identifying game-field intensities and
8 V. Vasudevan and M. S. Gounder
player bounding box sizes. In [39], an automatic sports video summarization has
been proposed based on key events based on replays. As shown in Fig. 5, the factor
that influences this work is the replay. The frames corresponding to the replays are
enclosed between gradual transitions. A thresholding-based approach is employed
to detect these transitions. For each key event, a motion history image is generated
by applying Gaussian mixture model. A trained extreme learning machine (ELM)
classifier is used to learn the various events for labeling the key events, detecting
replays, and generate game summarization. They have applied this method for four
different sports containing 20 videos. In contrast to the event detection in the game,
a method has been proposed by [41] to classify the crowd events while evaluating
the video contents into marriage, cricket, shopping mall, and Jallikattu. The method
applies deep CNN by learning the features from the training set of data. However,
the method is good at classifying the events into a labeled outcome. Interestingly, a
more customizable highlights generation method is proposed in [48]. The videos are
divided into sematic slots, and then importance-based event selection is employed to
include those important events in the highlights. The authors have considered cricket
videos for highlight generation. Again, the work proposed in [75] exploits the audio
intensity in addition to replays, player celebration, and playfield scenarios as key
events. Further, the player stroke segmentation [26–29] and compilation in cricket
can be used for highlight generation specific to a player in a match. There is a more
general analysis of various computer vision systems from soccer video semantic
point of video in [81]. The interpretation of the scene was based on the complexity of
the semantic. This work makes an investigation and analysis of various approaches.
Object detection is one of the important computer vision tasks applied in video
summarization. Techniques used in detecting the objects in each image or a frame
have gone through remarkable breakthroughs. Hence, it is highly required to
understand the evolution as well as state-of-the-art techniques used in detecting
the objects present in an image. This covers wide range of techniques starting
from simple histogram-based techniques to complex computationally intensive deep
learning techniques. The techniques evolved over a period of two decades that
in turn address challenges [96] in object detection, which include but not limited
to the following aspects: objects under different viewpoints, illuminations, and
intraclass variations, object rotation and scale changes, accurate object localization,
dense and occluded object detection, and speed up of detection. Figure 6 shows
the predominant object detection techniques (object detectors) evolved over two
decades that include latest development in 2021. Between 2000 and 2011, that
is, before the rebirth of deep convolution neural network, there were more subtle
and robust techniques applied in detecting the objects present in the frame or an
image. It is referred to as traditional object detectors. With the limited computing
resources, researchers have made a remarkable contribution to detect the objects
A Systematic Review on Machine Learning-Based Sports Video. . . 9
based on handcrafted features. Between 2001 and 2004, Viola and Jones have
achieved real-time detection of human face with the help of Pentium III CPU [87].
This detector was named as VJ detector and works with the sliding window concept.
VJ detector improved its detection performance and reduced computation overhead
through integral image, which uses Haar wavelet, feature selection with the help
of AdaBoost algorithm, and detection cascades that were multistage detection
paradigm by spending more computations on face target than background windows
[88, 96]. This approach can certainly contribute to player detection from any of the
sport’s video. In 2005, histogram of oriented gradient (HOG) detector was proposed
by Dala and Triggs [11]. It was another important milestone as it balances the
feature invariance. To detect objects of various sizes, HOG detector rescales the
input frame or an image multiple times while keeping the detection window size
the same. HOG detector was one of the important object detectors used in various
computer vision applications that include sports video processing too. Between
2008 and 2012, Deformable Part-Based Model (DPM) and its variants were the
peak of object detectors that evolved in the traditional object detectors era. DPM
was proposed by Felzenszwalb in 2008 [18] as an extension to HOG, and then
its variants were proposed by Girshick. DPM uses divide and conquer approach
where learning of the model happens by decomposing an object and then ensemble
the decomposed objects parts to form a complete object. The model comprises of
root filter and many part filters. This model has been further enriched [16, 17, 21,
22] to deal with real-world objects with significant variations. A weekly supervised
learning method was developed in DPM to learn all the configurations of part filters
as latent variables. This has been further formulated as Multi-Instance Learning, and
some important techniques such as hard negative mining, bounding box regression,
and context priming were also applied for improving the detection performance [16,
21].
10 V. Vasudevan and M. S. Gounder
From 2012 onwards, deep learning era began with the ability to learn high-
level feature representations of an image [50] with the availability of necessary
computational resources. Then, region-based CNN(RCNN) [24] was proposed and
became a breakthrough research in object detection with the help of deep learning
models. In this era, there were two genres in object detection, namely, two-stage
detection with coarse to fine process and one-stage detection to complete the process
in a step [96]. The RCNN extracts set of object proposals by selective search.
Then, each proposal is rescaled to a fixed size image and given as input to CNN
model trained on AlexNet [50] to extract features. In the end, linear SVM classifiers
are used to predict the presence of an object within each region. RCNN achieved
significant performance improvement over DPM. Even though RCNN had made
significant improvement, it had drawbacks of redundant feature computations on
many overlapped proposals that led to slow detection speed with GPU. In the
same year, Spatial Pyramid Pooling Network (SPPNet) [33] model was proposed
to overcome this drawback. Earlier CNN models require a fixed-size input, for
example, 224x224 image for AlexNet [51]. In SPPNet, Spatial Pyramid Pooling
layer enables to generate a fixed-length representation regardless of the size of
image or region of interest without rescaling it. It is proved that SPPNet was more
than twenty times faster than RCNN, which avoids redundancy while computing the
convolutional features. Though SPPNet had improved the detection performance
in terms of speed, there were drawbacks as training was still multistage and it
only fine-tuned its fully connected (FC) layers. In 2015, Fast RCNN [23] was
proposed to overcome the drawbacks of SPPNet. Fast RCNN train a detector and
a bounding box regressor simultaneously under the same network configurations.
This improved the detection speed 200 times faster than RCNN. Though there was
an improvement in detection speed, it was limited by the proposal detection. Hence,
it led to the proposal of Faster RCNN [71] where object proposals were generated
with a CNN model. Faster RCNN was the first end-to-end and near real-time object
detector. Even though Faster RCNN overcome the drawback of Fast RCNN, there
was still computation redundancy that led to further developments, namely, RFCN
[10] and light-head RCNN [56]. In 2017, Feature Pyramid Network (FPN) [57]
was proposed. It focused on all the layers in top-down architecture for building
high-level semantics at all scales. FPN had become a basic building block of latest
detectors.
Meanwhile in 2015, You Only Look Once (YOLO) was proposed. YOLO was
the first one-stage detector in this deep learning era. It followed entirely different
approach than the previously evolved models. It applied a single neural network to
the full image. The network divided the image into regions and predicted bounding
boxes and probabilities for each region simultaneously. Despite its improvements,
YOLO lacks from a drop of the localization accuracy compared with two-stage
detectors especially for detecting small objects. Based on the initial model, a series
of improvements [8, 68–70] were proposed that further improved the detection
accuracy and detection speed. Almost at the same time, Single Shot Multi-Box
Detector (SSD) [89], which was a second one-stage detector, evolved. The main
contribution of SSD was to introduce multi-reference and multi-resolution detection
A Systematic Review on Machine Learning-Based Sports Video. . . 11
techniques that significantly improved the detection accuracy of some small objects.
Despite the high speed and simplicity, one-stage detectors have lacked the accuracy
of two-stage detectors, and hence, RetinaNet [58] was proposed. RetinaNet focused
on the foreground–background class imbalance issue by introducing a new loss
function called focal loss. It reshaped the standard cross entropy loss to put
more focus on hard, misclassified samples during training. Focal loss achieved
comparable accuracy of two-stage detectors while maintaining very high detection
speed. The deep learning models continue to evolve [55, 77, 94, 95] by considering
both detection accuracy and speed. From these object detectors that evolved over the
last two decades and especially with the rebirth of deep learning models, it is made
possible to choose and apply appropriate detectors to solve most of the computer
vision-based problems that include sports video summarization.
Most of the research works in sports video summarization have used the following
objective metrics to evaluate the constructed models’ performance.
(TP + TN)
Accuracy =
(P + N)
(FP + FN)
Error =
(P + N)
(TP)
Precision =
(TP + FP)
12 V. Vasudevan and M. S. Gounder
4. Recall: Represents the ratio of true detection of frames against the actual number
of frames [4, 32, 34, 59, 90, 93].
TP
Recall =
(TP + FN)
(Precision ∗ Recall)
F1 − Score = 2 ∗
(Precision + Recall)
6. Confusion Matrix (CM): Represents Predicted Positives and Negatives over Pos-
itives and Negatives present in the chosen dataset. This is a highly recommended
model evaluation metric present in literature. In [32], CM matrix with goal, foul,
shoot, and non-highlight events was represented to compute precision and recall
%.
7. Receiving Operating Characteristics (ROC) curve: Precision Vs Recall (FPR Vs
TPR).
It is desirable to use Receiver Operator Characteristic (ROC) curves when
evaluating binary decision problems, which show how the number of correctly
classified positive examples varies with the number of incorrectly classified
negative [12].
As the False Position Rate (FPR) increases (i.e., more non-highlight plays are
allowed to be classified incorrectly), it is desirable that the True Positive Rate (TPR)
increases as quickly as possible (i.e., the derivative of the ROC curve is high) [19].
Other than the above listed objective metrics, which were predominantly used in
the last two decades, the following user experiences (subjective metrics) were also
used as an alternative performance evaluation metric in most of the sports video
summarization works.
1. The quality of each summary is evaluated in seven levels: extremely good, good,
upper average, average, below average, bad, and extremely bad [64].
2. Mean Opinion Score (MOS): Considering the following user experience rating,
(i) the overall highlights viewing experience is enjoyable, entertaining, and
pleasant and not marred by unexciting scenes, (ii) the generated scenes do
not begin or end abruptly, and (iii) the scenes are acoustically and/or visually
exciting.
3. Human vs. system detected shots (closeup, crowd, replay, sixer) [5].
4. Discounted cumulative gain (nDCG) metric, which is a standard retrieval
measure computed as follows:
A Systematic Review on Machine Learning-Based Sports Video. . . 13
k
1 2reli − 1
nDCG(k) =
Z log2 (i + 1)
i=1
where reli is the relevance score assigned by the users to clipi and Z is a
normalization factor ensuring that the perfect ranking produces an nDCG score of
1.
Some of the works [34, 63, 74] that are specific to cricket and soccer video
summarization have several weaknesses or they are not addressed properly. In
this section, the weaknesses and scope for future research are discussed from
the outcome of selected papers. Section 4.1 groups the weaknesses under certain
categories. Section 4.2 highlights the scope for further research in sports video
summarization.
14 V. Vasudevan and M. S. Gounder
Table 1 (continued)
Year of
study publication Major idea Type of sports
[27] 2019 Cricket stroke dataset creation Cricket
[52] 2019 Outcome classification in cricket using deep Cricket
learning
[36] 2019 Classify bowlers Cricket
[60] 2019 AlexNet CNN-based approach Soccer
[41] 2019 CNN-based crowd event detection Cricket
[35] 2019 Decomposed audio information Soccer
[40] 2019 Confined elliptical local ternary patterns and Cricket, tennis,
extreme learning machine baseball, and
basketball
[59] 2019 Multimodal excitement features Golf and tennis
[26] 2020 Cricket stroke localization Cricket
[64] 2020 Transfer learning for scene classification Soccer
[34] 2020 Hybrid approach Soccer
[45] 2020 Content aware summarization—audiovisual Cricket, soccer,
approach rugby, basketball,
baseball, football,
tennis, snooker,
handball, hockey, ice
hockey, and
volleyball
[32] 2020 Multimodal multi-labeled extraction Soccer
In this section, the existing methods and their weaknesses are highlighted based on
the below mentioned categories.
Table 2 (continued)
Study Algorithms Methods Output
[93] 1. Acoustic feature Mel bank filtering, local Highlight scene
extraction autocorrelation on complex generation
2. Highlight scene Fourier values – non-learning
detection methods
Complex Subspace Method –
Unsupervised Learning
[75] 1. Event detection 1. Frame difference for video Video summary with
2. Video shot shot important events
segmentation 2. CNN + SVM framework for
3. Replay detection replay detection
4. Scoreboard detection 3. Pretrained AlexNet for OCR
5. Playfield scenario 4. CNN + SVM for classifying
detection frames
5. Audio cues for excitement
detection
6. AlexNet for player celebration
[9] 1. Highlighted moment 1. Hot spot/special moment Resource constrained
detection through audio detection based on two acoustic summarization based on
cues features user’s narrative
2. Shot (or clip) 2. View type subsequence preference
boundary matching
detection/video 3. Lagrangian optimization and
segmentation convex-hull approximation –
3. Sub-summaries non-learning methods
detection
[80] 1. Excited speech 1. Gaussian mixture models Event highlights
segmentation through 2. Unsupervised model generated based on
pretrained pitched adaptation – average excited speech score
speech segment log-likelihood ratio score
2. Excited speech 3. Maximum a posteriori
detection adaptation – learning methods
[81] Event detection 1. Unsupervised event discovery Video highlights of
Highlight clip detection based on color histogram of cricket
oriented gradients
2. Supervised phase trains SVM
from clips labeled as highlight or
non-highlight
[10] 1. Shot boundary 1.Computation of convex hull of Collection of
detection the benefit/cost curve of each nonoverlapping
2. Video segmentation segment sub-summaries under
3. Candidate 2. Lagrangian relaxation – the given
sub-summaries non-learning methods user-preferences and
preparation duration constraint.
4. Metadata extraction
(continued)
18 V. Vasudevan and M. S. Gounder
Table 2 (continued)
Study Algorithms Methods Output
[54] 1. Pitch segmentation 1. Temporal segmentation to Match summarization
using K-means detect boundaries and wickets with semantic results
2. SVM classifier to 2. Replay detection using Hough like batting, bowling,
recognize digits transform-based tracking boundary, etc.
3. Finite state 3. Ad detection using transitions
automation model based 4. Camera motion using KLT
on semantic rules method
5. Scene change using hard cut
detection
6. Crowd view detection using
textures
7. Boundary view detection
using field segmentation
[39] 1. Excitement detection 1. Rule-based induction to find Summarized video with
2. Key events detection excited clips key events
3. Decision tree for 2. Score caption region using
video summarization temporal image averaging
3. OCR to recognize the
characters
4. Graduation transition
detection by dual
threshold-based method
[41] Event recognition CNN (baseline and VGG16) to Classification of crowd
detect predefined events video into four classes:
marriage, cricket,
Jallikattu, and shopping
mall
[42] Playfield and 1. Color histogram analysis Extracted key frame
non-playfield detection 2. Extraction of dominant color
frames-thresholding hue values –
non-learning methods
[26] 1. Construction of two 1. Pretrained C3D model with Two cricket strokes
learning-based GRU training datasets
localization pipelines 2. Boundary detection with first
2. Boundary detection frame classification
3. Modified weighted mean
TIoU for single category
temporal localization problem
[72] 1. Video shot 1. K-means clustering to build Annotated video clips
recognition visual vocabulary containing events of
2. Shot classification 2. Shot representation by bag of interest
3. Text classification words
3. Classification using multiclass
Kernel SVM
4. Linear SVM for bowler and
batsman category
(continued)
A Systematic Review on Machine Learning-Based Sports Video. . . 19
Table 2 (continued)
Study Algorithms Methods Output
[52] 1. Jittered sampling 1. Pretrained VGGNet is used on Automatically
2. Temporal ImageNet dataset for transfer generated commentary.
augmentation learning Classify the outcome of
3. Training; hyper 2. LRCN to classify the each ball as run, dot,
parameter tuning ball-by-ball activities boundary and wicket
[19] Video classification of 1. Adam optimizer for training Classified shots of
cricket shots the model cricket video that
2. CNN model with 13 layers for belongs to cut shot,
classification cover drive, straight
drive, pull shot, leg
glance shot, and scoop
shot
[53] 1. Play break detection 1. Block creation Summarized video
2. Event detection 2. Thresholding the duration based on the priorities
through visual, audio, between continuous long shots block merging
and text cues 3. Detecting low-level features
3. Peak detection (find (occurrence of replay scene, the
similar events) excited audience or commentator
speech, certain camera motion or
certain highlight event-related
sound, and crowd excitement)
4. Grass pixel ratio to detect the
boundaries in cricket
5. Audio feature extraction:
Root mean square volume
Zero crossing rate
Pitch period
Frequency
Centroid
Frequency bandwidth
Energy ratio
6. Priority assignment – non
learning methods
7. SVM to identify text line from
frame
8. Optical Character Recognition
(OCR) for text recognition –
learning methods
– The motivation for choosing some of the core classifiers like the one in [74] HRF-
DBN for labeling each shot and RF classifier for dividing the shots is not given
clearly.
– Umpire jerseys and its colors are one of the key elements in detecting the umpire
frames. Though it appears to be a straightforward approach, there is no discussion
on the challenges faced while segmenting the frames, for example, the color
variation of jerseys due to different light intensity [74].
20 V. Vasudevan and M. S. Gounder
Table 3 (continued)
Study Algorithms Methods Output
[47] 1. Extraction of exciting 1. Non-learning methods Generated highlights based
clips from audio cues using to detect different views on the selected labeled
short time audio energy 2. Bayes Belief Network clips based on the degree
algorithm (to assign semantic of importance
2. Event detection and concept labels to the
classification (annotation) exciting clips: goals, saves,
using hierarchical tree yellow cards, red cards,
3. Exciting clip selection and kicks in video
4. Temporal ordering of sequence) – learning
selected exciting clips method
[43] Scene classification 1. Radial basis Real-time video indexing
decompositions of a color and dataset
address space followed by
Gabor wavelets in
frequency space
2. The above is used to
train SVM classifier
[35] 1. Split audio and video 1. Empirical Mode Generated events (goals,
2. Intrinsic Mode Function Decomposition (EMD) to shots on goal, shots off
(IMF) extraction from filter the noise and extract goal, red card, yellow card,
audio signal audio penalty decision)
3. Feature extraction from 2. Non-learning methods
energy matrix of the signal to extract features and
((a) energy level of the compute shot score and
frame in shot, (b) audio summary generation
power increment, (c)
average audio energy
increment in continuous
shots, (d) whistle detector)
[66] 1. Video shot segmentation 1. VJ AdaBoost method Highlight generated based
2. MPEG-7-based audio with skin filter for face on user input
descriptor detection – learning
3. Whistle detector method
4. MPEG-7 motion 2. Other algorithms used
descriptor non-learning methods such
5. MPEG-7 color as Discrete Fourier
descriptor Transform for whistle
6. Replay detector detection
7. Persons detector
8. Long-shot detector
9. Zooms detector
(continued)
22 V. Vasudevan and M. S. Gounder
Table 3 (continued)
Study Algorithms Methods Output
[93] 1. Shot boundary detection 1. SVM and NN (replay Highlights the most
2. Shot-type, play break and scoreboard) important events that
classification 2. K-means include goals and goal
3. Replay detection 3. Hough transform attempts
4. Scoreboard detection (vertical goal post
5. Excitement event detection)
detection 4. Gabor filter (Goal Net)
6. Logo-based event 5. Volume of each audio
detection frame,
7. Audio loudness subharmonic-to-harmonic
detection ratio-based pitch
determination, dynamic
thresholds – learning and
non-learning methods
[78] 1. Define segmentation 1. Background subtraction The output video is
points using GMM for replay parameterized based on
2. Replay detection detection events over time and the
3. Player detection and 2. YOLO for player user priority list.
interpolation detection
4. Soccer event 3. Histogram of optical
segmentation flow to capture player
5. Bin-packing to select motion – learning and
subset of plays based on non-learning methods
utility from eight bins
[18] Video classification of 1. Adam optimizer for Classified shots of cricket
cricket shots training the model video that belongs to cut
2. CNN model with 13 shot, cover drive, straight
layers for classification drive, pull shot, leg glance
shot, and scoop shot
[60] 1. Shot classification 1. AlexNet CNN for shot Classified shots of sports
classification video with classes like
close, crowd, long, and
medium shots
[64] Scene classification Pretrained AlexNet CNN Classified shots into
batting, bowling, boundary,
crowd, and close-up
[90] 1. Detect candidate set for 1. Difference and Detected replay from given
logo template accumulated difference in video sequence
2. Find logo template from a window of 20 frames
the candidate set 2. K-means clustering to
3. Match the logo (pair find exact logo template
logo for replay detection) 3. Adaptive criterion:
frame difference and mean
intensity of the current
frame with those of the
logo template – learning
and non-learning methods
(continued)
A Systematic Review on Machine Learning-Based Sports Video. . . 23
Table 3 (continued)
Study Algorithms Methods Output
[82] 1. Video segmentation 1. Two-stream deep neural User-generated sports
2. Highlight classification network (1. Holistic video (UGSV)
feature stream: 2D CNN 2.
Body joint stream: 3D
CNN) – trained from lower
layer to the top layers by
using a UGSV
summarization dataset.
2. LSTM (highlight
classification)
[45] 1. Scorebox detection 1. Nonoverlapping sliding Highlight generated based
(binary map hole filling window operation on on user preferences
algorithm) frame pairs for scorebox
2. OCR to recognize text in detection
scorebox 2. OCR using deep CNN
3. Parse and clean with 25 layers
algorithm to recognize 3. Butterworth band-pass
clearly text from text filter and Savitzky-Golay
region smoothing filter for audio
4. Audio feature extraction feature extraction
5. Key frame detection 4. Speech to text using
(start and end frame Google API2 – both
estimation algorithm) learning and non-learning
methods are used.
[32] 1. Unimodal learning 1. (a) Multibranch 1. Highlight generated
2. Multimodal learning Convolutional Networks based on unimodal
3. Multimodal and (merge the convolutional learning
multi-label learning features from input frames; 2. Highlight generated
then the regression value is based on multimodal
obtained) learning
(b) 3D CNN to capture 3. Highlight generated
more temporal and motion based on multimodal,
information. multi-label learning
(c) Long-term Recurrent
Convolutional Networks
uses pretrained CNN
model to extract features.
2. Pretrained CNN features
with NN (latent features
fusion (LFF) and
pretrained CNN features
with deep NN (early
features fusion (EFF))
3. (a) Construct a network
for training each label
separately
(b) Jointly train a
multi-label network, and
extract the joint features
from the last dense layer.
(continued)
24 V. Vasudevan and M. S. Gounder
Table 3 (continued)
Study Algorithms Methods Output
[63] 1. Shot detection Non-learning methods Generated video summary
(a) Shot classification used to identify important based on user input on the
(b) Replay detection events from long, medium, length (N clips) of the
2. Video segmentation close, and replay views summary
(collection of successive
shots)
3. Clip boundary
calculation
4. Interest level measure
[34] 1. Shot boundary detection 1. Linear SVM classifier Summary with replay and
2. Shot view classification 2. Green color dominance without replay segments
(global view, medium and threshold frequencies
view, and close-up view) over player bounding box
3. Replay detection 3. Histogram difference
4. Play break detection between logo frames
5. Penalty box detection 4. Statistical measure for
6. Key event detection key event detections
– The shots are classified into only specific categories as in [6, 9, 34, 45, 60]. The
shot detection in these works is not generalized. For example, in [34], the size
of the bounding box is the major parameter to decide between medium and long
shots. This may go wrong if the shadows are detected as boundaries. The authors
have not addressed such issues.
– Authors of [34] have used an algorithm to find three parallel lines in a frame to
find near goal ratio. Due to the camera angle, these lines may not appear to be
parallel due to perspective distortion. This part has not been addressed to resolve
such issues.
– The clip boundaries are detected using the camera motion [34]. This may not be
applicable to other sports where the camera keeps moving.
– In most of the works, the video samples and their resolutions are assumed to
be much lower than the quality of broadcast. The frame resolution is mentioned
as 640 × 480 [74]. It is common that the video resolution will not be always
the same. Either every video must be down sampled to the standard size that is
processed or the method must be flexible in processing any resolution videos.
– Number of samples used in methods [74] are significantly less, and the perfor-
mance is the models are justified only with such lower number of samples. The
impact on performance of the algorithm when the samples are increased is not
specified.
– Only few of the methods mentioned the series of sports that has been used for
video samples. The reason for choosing the specific series is not given clearly.
A Systematic Review on Machine Learning-Based Sports Video. . . 25
Table 4 (continued)
Study Algorithms Methods Output
[30] 1. Shot detection 1. Structural Similarity Events are tagged by above
2. Penalty cornet and Index Measure (SSIM) event names and stitched
penalty stroke detection 2. Color segmentation and in the order of appearance
3. Umpire gesture morphological operations – based on user preferences
detection long shot/umpire shot to generate customized
4. Foul detection 3. Field color detection and highlights.
5. Replay and logo shot skin color detection –
detection close-up shot
6. Goal detection 4. Hough transformation
and morphological
operations – goal post
shot – Non-learning
methods
[37] 1. Rally scene detection 1. Unsupervised shot Extracted highlights from
clustering based on HSV unimodal integrated with
histogram multimodal
2. Correlation analysis
between court position and
ball position
3. Rally rank evaluation –
adjusted R squares –
learning methods
[40] 1. Replay segment 1. Thresholding-based Detected replay events
extraction approach (fade-in and
2. Key events detection fade-out transition during
(a) Motion pattern start and end of replay)
detection 2. Gaussian mixture model
(b) Feature extraction (GMM) to extract
silhouettes and generate
motion history image
(MHI) for each key event
3. Confined elliptical local
ternary patterns (CE-LTPs)
for feature extraction
4. Extreme learning
machine (ELM) classifier
for key event detection –
learning methods
[59] 1. Audio analysis (crowd 1. Sound net classifier Automatically extracting
cheer and the commentator (crowd cheer and highlights
excitement detection) commentator speech)
2. Visual analysis 2. Linear SVM classifier
(players – action (commentator excitement
recognition, shot boundary detection – tone based)
detection) 3. Speech to text
3. Text analysis (text conversion (commentator
based: 60 words/phrases excitement)
dictionary) 4. VGG-16 model (player
action of celebration)
5. Optical Character
Recognition (OCR) –
learning methods
A Systematic Review on Machine Learning-Based Sports Video. . . 27
– The methods that use AlexNet CNN and Decision Tree classifier do not apply
substantial number of samples to evaluate the robustness of the mode. Some of
them used as low as 50 samples.
– Dataset has limited number of samples as in [25, 64, 67]. The video has been
chosen based on the type of view it has as predefined. There seems to be no
preprocessing done to classify the videos into different types of views.
– Only four key events are considered in [34]. Likewise, in most of the event-
based methods, standard events relevant to sports (boundaries in cricket, goals in
soccer, etc.) are considered. Any attempt to improve the number of key events or
sub events of main key events would have made this work better.
– The audio events in most of the works [7, 45, 66, 92] uses cheer energy or the
spectrum of commentators. Attempts to record in field audios like stump mics
and umpire’s mics have not been reported in any of the works.
– The replay is the key to identify the semantic segment of the video in [20, 40, 90].
If the replay is not identified, there will be misleading semantics in the segments.
– The human subjective evaluation may not always reveal the true results as in
[74]. The samples used for evaluation are too less in number and that cannot be
considered for any proper conclusion.
The major objective of sports video summarization is to reduce the length of the
broadcast video in a way the shortened video shows interesting events only. As such,
every sports video is lengthy in duration. Some of the sports like cricket has days
long videos. When the automated highlight generation is applied, it is supposed to
deal with huge volume of video data. With the current video broadcasting standards,
every video that is supposed to be processed will be at higher resolution, sometimes
up to 4 and 8 K. Keeping these constraints, any algorithm that is developed to
automate the video summarization should address these requirements. Hence, any
prospective research should deal with such high resolution and high volume of data.
In addition to the resolution of the video, the availability of standardized dataset
is a huge letdown for the researchers to benchmark the results. Only countable
number of researchers [25, 67] attempted to create dataset for sports summarization.
Most of the proposed methods employed custom created dataset or used commonly
available data from sources like YouTube. Creating and standardizing dataset with
huge collection of samples for each of the sports video is one of the high priority
research projects.
In the literature, it is found that most of the methods [6, 21, 34, 36, 38, 43, 50,
54, 59, 67, 72, 85] applied two levels of model building. Since the method of video
28 V. Vasudevan and M. S. Gounder
5 Conclusion
high volume of data like sports video. The results of such summarization can be
instantly compared with highlights generated by broadcasting channels at the end
of every match. Going further, the results of such highlights should also include
some additional key events and drama, not just the key events of the games. For
instance, in soccer if a player is given a red card, the manual editing will show all
the events related to the player that led to red card and sometimes his activities from
previous matches also shown by manual editors. The automated system should be
capable enough to identify such key events and include in the highlights. Some of
other elements like pre- and post-match ceremony, player’s entry, injuries to players,
etc. should also be captured. Eventually, the machine learning-based methods should
learn to include the style of commentary, team’s jersey, noise removal from common
cheering, series-specific scene transition, and smooth commentary or video cuts that
will potentially reduce the human editor’s work. It is anticipated that this chapter
will turn out as one of the standard references for researchers to actively develop
video summarization algorithms using learning or non-learning approaches.
References
1. Rahman, A. A., Saleem, W., & Iyer, V. V. Driving behavior profiling and prediction in KSA
using smart phone sensors and MLAs. In 2019 IEEE Jordan international joint conference on
Electrical Engineering and Information Technology (JEEIT) (pp. 34–39).
2. Ajmal, M., Ashraf, M. H., Shakir, M., Abbas, Y., & Shah, F. A. (2012). Video summarization:
Techniques and classification. In Computer vision and graphics (Vol. 7594). ISBN: 978-3-642-
33563-1.
3. Sen, A., Deb, K., Dhar, P. K., & Koshiba, T. (2021). CricShotClassify: An approach to
classifying batting shots from cricket videos using a convolutional neural network and gated
recurrent unit. Sensors, 21, 2846. https://doi.org/10.3390/s21082846
4. Halin, A. A., & Mandava, R. (2013, January). Goal event detection in soccer videos via
collaborative multimodal analysis. Pertanika Journal of Science and Technology, 21(2), 423–
442.
5. Amruta, A. D., & Kamde, P. M. (2015, March). Sports highlight generation system based on
video feature extraction. IJRSI (2321–2705), II(III).
6. Bagheri-Khaligh, A., Raziperchikolaei, R., & Moghaddam, M. (2012). A new method for shot
classification in soccer sports video based on SVM classifier. In Proceedings of the 2012 IEEE
Southwest Symposium on Image Analysis and Interpretation (SSIAI). Santa Fe, NM.
7. Baijal, A., Jaeyoun, C., Woojung, L., & Byeong-Seob, K. (2015). Sports highlights generation
based on acoustic events detection: A rugby case study. In 2015 IEEE International Conference
on Consumer Electronics (ICCE) (pp. 20–23). https://doi.org/10.1109/ICCE.2015.7066303
8. Alexey, B., Chien-Yao, W., & Hong-Yuan, M. L. (2020). YOLOv4: Optimal speed and
accuracy of object detection. In arXiv 2004.10934[cs.CV].
9. Chen, F., De Vleeschouwer, C., Barrobés, H. D., Escalada, J. G., & Conejero, D. (2010).
Automatic summarization of audio-visual soccer feeds. In 2010 IEEE international conference
on Multimedia and Expo (pp. 837–842). https://doi.org/10.1109/ICME.2010.5582561
10. Dai, J., Li, Y., He, K., & Sun, J. (2016). R-fcn: Object detection via region-based fully
convolutional networks. In Advances in neural information processing systems (pp. 379–387).
11. Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In 2005
IEEE Computer Society conference on Computer Vision and Pattern Recognition (CVPR ‘05)
(Vol. 1, pp. 886–893). https://doi.org/10.1109/CVPR.2005.177
30 V. Vasudevan and M. S. Gounder
12. Jesse, D., & Mark, G. (2006). The relationship between Precision-Recall and ROC curves.
In Proceedings of the 23rd International Conference on Machine Learning (ICML ‘06) (pp.
233–240). ACM, New York, NY, USA. https://doi.org/10.1145/1143844.1143874
13. Asadi, E., & Charkari, N. M. (2012). Video summarization using fuzzy c-means clustering. In
20th Iranian conference on Electrical Engineering (ICEE2012) (pp. 690–694). https://doi.org/
10.1109/IranianCEE.2012.6292442
14. Ekin, A., Tekalp, A., & Mehrotra, R. (2003). Automatic soccer video analysis and summariza-
tion. IEEE Transactions on Image Processing, 12(7), 796–807.
15. Fani, M., Yazdi, M., Clausi, D., & Wong, A. (2017). Soccer video structure analysis by parallel
feature fusion network and hidden-to-observable transferring Markov model. IEEE Access, 5,
27322–27336.
16. Felzenszwalb, P. F., Girshick, R. B., & McAllester, D. (2010). Cascade object detection with
deformable part models. In 2010 IEEE computer society conference on Computer Vision and
Pattern Recognition (pp. 2241–2248). https://doi.org/10.1109/CVPR.2010.5539906
17. Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010, September). Object
detection with discriminatively trained part-based models. IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence, 32(9), 1627–1645. https://doi.org/10.1109/TPAMI.2009.167
18. Felzenszwalb, P., McAllester, D., & Ramanan, D. (2008). A discriminatively trained, mul-
tiscale, deformable part model. In 2008 IEEE conference on Computer Vision and Pattern
Recognition (pp. 1–8). https://doi.org/10.1109/CVPR.2008.4587597
19. Foysal, M. F., Islam, M., Karim, A., & Neehal, N. (2018). Shot-Net: A convolutional neural
network for classifying different cricket shots. In Recent trends in image processing and pattern
recognition. Springer Singapore.
20. Ghanem, B., Kreidieh, M., Farra, M., & Zhang, T. (2012). Context-aware learning for
automatic sports highlight recognition. In Proceedings of the 21st International Conference
on Pattern Recognition (ICPR2012) (pp. 1977–1980).
21. Girshick, R. B. (2012). From rigid templates to grammars: object detection with structured
models (Ph.D. Dissertation). University of Chicago, USA. Advisor(s) Pedro F. Felzenszwalb.
Order Number: AAI3513455.
22. Girshick, R. B., Felzenszwalb, P. F., & Mcallester, D. A. (2011). Object detection with grammar
models. In Proceedings of the 24th international conference on Neural Information Processing
Systems (NIPS’11) (pp. 442–450). Curran Associates Inc., Red Hook, NY, USA.
23. Girshick, R., & Fast, R.-C. N. N. (2015). 2015 IEEE International Conference on Computer
Vision (ICCV) (pp. 1440–1448). https://doi.org/10.1109/ICCV.2015.169
24. Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2016, January 1). Region-based con-
volutional networks for accurate object detection and segmentation. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 38(1), 142–158. https://doi.org/10.1109/
TPAMI.2015.2437384
25. Gonzalez, A., Bergasa, L., Yebes, J., & Bronte, S. (2012). Text location in complex images. In
IEEE ICPR.
26. Gupta, A., & Muthaiah, S. (2020). Viewpoint constrained and unconstrained Cricket stroke
localization from untrimmed videos. Image and Vision Computing, 100.
27. Gupta, A., & Muthaiah, S. (2019). Cricket stroke extraction: Towards creation of a large-scale
cricket actions dataset. arXiv:1901.03107 [cs.CV].
28. Gupta, A., Karel, A., & Sakthi Balan, M. (2020). Discovering cricket stroke classes in trimmed
telecast videos. In N. Nain, S. Vipparthi, & B. Raman (Eds.), Computer vision and image
processing. CVIP 2019. Communications in computer and information science (Vol. 1148).
Springer Singapore.
29. Arpan, G., Ashish, K., & Sakthi Balan, M. (2021). Cricket stroke recognition using hard and
soft assignment based bag of visual words. In Communications in computer and information
science (pp. 231–242). Springer Singapore. https://doi.org/10.1007/2F978-981-16-1092-2021
30. Hari, R. (2015, November). Automatic summarization of hockey videos. IJARET (0976–6480),
6(11).
A Systematic Review on Machine Learning-Based Sports Video. . . 31
31. Harun-Ur-Rashid, M., Khatun, S., Trisha, Z., Neehal, N., & Hasan, M. (2018). Crick-net: A
convolutional neural network based classification approach for detecting waist high no balls in
cricket. arXiv preprint arXiv:1805.05974.
32. He, J., & Pao, H.-K. (2020). Multi-modal, multi-labeled sports highlight extraction. In 2020
international conference on Technologies and Applications of Artificial Intelligence (TAAI)
(pp. 181–186). https://doi.org/10.1109/TAAI51410.2020.00041
33. He, K., Zhang, X., Ren, S., & Sun, J. (2014). Spatial pyramid pooling in deep convolutional
networks for visual recognition. In European conference on Computer Vision (pp. 346–361).
Springer.
34. Khurram, I. M., Aun, I., & Nudrat, N. (2020). Automatic soccer video key event detection and
summarization based on hybrid approach. Proceedings of the Pakistan Academy of Sciences, A
Physical and Computational Sciences (2518–4245), 57(3), 19–30.
35. Islam, M. R., Paul, M., Antolovich, M., & Kabir, A. (2019). Sports highlights generation using
decomposed audio information. In IEEE International Conference on Multimedia & Expo
Workshops (ICMEW) (pp. 579–584). https://doi.org/10.1109/ICMEW.2019.00105
36. Islam, M., Hassan, T., & Khan, S. (2019). A CNN-based approach to classify cricket bowlers
based on their bowling actions. In 2019 IEEE international conference on Signal Processing,
Information, Communication & Systems (SPICSCON) (pp. 130–134). https://doi.org/10.1109/
SPICSCON48833.2019.9065090
37. Takahiro, I., Tsukasa, F., Shugo, Y., & Shigeo, M. (2017). Court-aware volleyball video
summarization. In ACM SIGGRAPH 2017 posters (SIGGRAPH ‘17) (pp. 1–2). Associa-
tion for Computing Machinery, New York, NY, USA, Article 74. https://doi.org/10.1145/
3102163.3102204
38. Javed, A., Malik, K. M., Irtaza, A., et al. (2020). A decision tree framework for shot
classification of field sports videos. The Journal of Supercomputing, 76, 7242–7267. https://
doi.org/10.1007/s11227-020-03155-8
39. Javed, A., Bajwa, K., Malik, H., Irtaza, A., & Mahmood, M. (2016). A hybrid approach for
summarization of cricket videos. In IEEE International Conference on Consumer Electronics-
Asia (ICCE-Asia). Seoul.
40. Javed, A., Irtaza, A., Khaliq, Y., & Malik, H. (2019). Replay and key-events detection for sports
video summarization using confined elliptical local ternary patterns and extreme learning
machine. Applied Intelligence, 49, 2899–2917. https://doi.org/10.1007/s10489-019-01410-x
41. Jothi Shri, S., & Jothilakshmi, S. (2019). Crowd video event classification using convolutional
neural network. Computer Communications, 147, 35–39.
42. Kanade, S. S., & Patil, P. M. (2013, March). Dominant color based extraction of key frames for
sports video summarization. International Journal of Advances in Engineering & Technology,
6(1), 504–512. ISSN: 2231-1963.
43. Kapela, R., McGuinness, K., & O’Connor, N. (2017). Real-time field sports scene classification
using colour and frequency space decompositions. Journal of Real-Time Image Process, 13,
725–737.
44. Kathirvel, P., Manikandan, S. M., & Soman, K. P. (2011, January). Automated referee whistle
sound detection for extraction of highlights from sports video. International Journal of
Computer Applications (0975–8887), 12(11), 16–21.
45. Khan, A., Shao, J., Ali, W., & Tumrani, S. (2020). Content-aware summarization of broadcast
sports videos: An audio–visual feature extraction approach. Neural Process Letter, 1945–
1968.
46. Kiani, V., & Pourreza, H. R. (2013). Flexible soccer video summarization in compressed
domain. In ICCKE 2013 (pp. 213–218). https://doi.org/10.1109/ICCKE.2013.6682798
47. Kolekar, M. H., & Sengupta, S. (2015). Bayesian network-based customized highlight
generation for broadcast soccer videos. IEEE Transactions on Broadcasting, (2), 195–209.
48. Kolekar, M. H., & Sengupta, S. (2006). Event-importance based customized and automatic
cricket highlight generation. In IEEE international conference on Multimedia and Expo.
Toronto, ON.
32 V. Vasudevan and M. S. Gounder
49. Kolekar, M. H., & Sengupta, S. (2008). Caption content analysis based automated cricket
highlight generation. In National Communications Conference (NCC). Mumbai.
50. Bhattacharya, K., Chaudhury, S., & Basak, J. (2004, December 16–18). Video summarization:
A machine learning based approach. In ICVGIP 2004, Proceedings of the fourth Indian con-
ference on Computer Vision, Graphics & Image Processing (pp. 429–434). Allied Publishers
Private Limited, Kolkata, India.
51. Alex, K., Ilya, S., & Hinton, G. E. (2012). ImageNet classification with deep convolutional
neural networks. In Proceedings of the 25th international conference on Neural Information
Processing Systems, Volume 1 (NIPS’12) (pp. 1097–1105). Curran Associates Inc., Red Hook,
NY, USA.
52. Kumar, R., Santhadevi, D., & Janet, B. (2019). Outcome classification in cricket using deep
learning. In IEEE international conference on Cloud Computing in Emerging Markets CCEM.
Bengaluru.
53. Kumar Susheel, K., Shitala, P., Santosh, B., & Bhaskar, S. V. (2010). Sports video sum-
marization using priority curve algorithm. International Journal on Computer Science and
Engineering (0975–3397), 02(09), 2996–3002.
54. Kumar, Y., Gupta, S., Kiran, B., Ramakrishnan, K., & Bhattacharyya, C. (2011). Automatic
summarization of broadcast cricket videos. In IEEE 15th International Symposium on Con-
sumer Electronics (ISCE). Singapore.
55. Li, Y., Chen, Y., Wang, N., & Zhang, Z. (2019). Scale-aware trident networks for object
detection. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (pp. 6053–
6062). https://doi.org/10.1109/ICCV.2019.00615
56. Li, Z., Peng, C., Yu, G., Zhang, X., Deng, Y., & Sun, J. (2017). Light-head r-cnn: In defense of
two-stage object detector. arXiv preprint arXiv:1711.07264.
57. Lin, T., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid
networks for object detection. In IEEE conference on Computer Vision and Pattern Recognition
(CVPR) (pp. 936–944). https://doi.org/10.1109/CVPR.2017.106
58. Lin, T., Goyal, P., Girshick, R., He, K., & Dollár, P. (2018, July). Focal loss for dense object
detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2), 318–327.
https://doi.org/10.1109/TPAMI.2018.2858826
59. Merler, M., Mac, K. N. C., Joshi, D., Nguyen, Q. B., Hammer, S., Kent, J., Xiong, J., Do, M.
N., Smith, J. R., & Feris, R. S. (2019, May). Cricket automatic curation of sports highlights
using multimodal excitement features. IEEE Transactions on Multimedia, 21(5), 1147–1160.
https://doi.org/10.1109/TMM.2018.2876046
60. Minhas, R., Javed, A., Irtaza, A., Mahmood, M., & Joo, Y. (2019). Shot classification of field
sports videos using AlexNet Convolutional Neural Network. Applied Sciences, 9(3), 483.
61. Mohan, S., & Vani, V. (2016). Predictive 3D content streaming based on decision tree classifier
approach. In S. Satapathy, J. Mandal, S. Udgata, & V. Bhateja (Eds.), Information systems
design and intelligent applications. Advances in intelligent systems and computing (Vol. 433).
Springer. https://doi.org/10.1007/978-81-322-2755-7_16
62. Namuduri, K. (2009). Automatic extraction of highlights from a cricket video using MPEG-
7 descriptors. In First international communication systems and networks and workshops.
Bangalore.
63. Nguyen, N., & Yoshitaka, A. (2014). Soccer video summarization based on cinematography
and motion analysis. In 2014 IEEE 16th international workshop on Multimedia Signal
Processing (MMSP) (pp. 1–6). https://doi.org/10.1109/MMSP.2014.6958804
64. Rafiq, M., Rafiq, G., Agyeman, R., Choi, G., & Jin, S.-I. (2020). Scene classification for sports
video summarization using transfer learning. Sensors, 20, 1702.
65. Raj, R., Bhatnagar, V., Singh, A. K., Mane, S., & Walde, N. (2019, May). Video sum-
marization: Study of various techniques. In Proceedings of IRAJ international conference,
arXiv:2101.08434.
66. Raventos, A., Quijada, R., Torres, L., & Tarrés, F. (2015). Automatic summarization of soccer
highlights using audio-visual descriptors. Springer Plus, 4, 1–13.
A Systematic Review on Machine Learning-Based Sports Video. . . 33
67. Ravi, A., Venugopal, H., Paul, S., & Tizhoosh, H. R. (2018). A dataset and preliminary
results for umpire pose detection using SVM classification of deep features. In 2018 IEEE
Symposium Series on Computational Intelligence (SSCI) (pp. 1396–1402). https://doi.org/
10.1109/SSCI.2018.8628877
68. Redmon, J., & Farhadi, A. (2017). YOLO9000: Better, faster, stronger. In 2017 IEEE
conference on Computer Vision and Pattern Recognition (CVPR) (pp. 6517–6525). https://
doi.org/10.1109/CVPR.2017.690
69. Redmon, J., & Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv preprint
arXiv:1804.02767.
70. Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-
time object detection. In Proceedings of the IEEE conference on Computer Vision and Pattern
Recognition (pp. 779–788).
71. Ren, S., He, K., Girshick, R., & Sun, J. (2016). Faster R-CNN: Towards real-time object
detection with region proposal. arXiv:1506.01497 [cs.CV].
72. Sharma, R., Sankar, K., & Jawahar, C. (2015). Fine-grain annotation of cricket videos. In
Proceedings of the 3rd IAPR Asian Conference on Pattern Recognition (ACPR). Kuala Lumpur,
Malaysia.
73. Shih, H. (2018). A survey of content-aware video analysis for sports. IEEE Transactions on
Circuits and Systems for Video Technology, 28(5), 1212–1231.
74. Shingrakhia, H., & Patel, H. (2021). SGRNN-AM and HRF-DBN: A hybrid machine learning
model for cricket video summarization. The Visual Computer, 38, 2285. https://doi.org/
10.1007/s00371-021-02111-8
75. Shukla, P., Sadana, H., Verma, D., Elmadjian, C., Ramana, B., & Turk, M. (2018). Automatic
cricket highlight generation using event-driven and excitement-based features. In IEEE/CVF
conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Salt Lake City,
UT.
76. Sreeja, M. U., & KovoorBinsu, C. (2019). Towards genre-specific frameworks for video
summarisation: A survey. Journal of Visual Communication and Image Representation (1047–
3203), 62, 340–358. https://doi.org/10.1016/j.jvcir.2019.06.004
77. Su Yuting., Wang Weikang., Liu Jing., Jing Peiguang., and Yang Xiaokang., DS-Net: Dynamic
spatiotemporal network for video salient object detection, arXiv:2012.04886 [cs.CV], 2020.
78. Sukhwani, M., & Kothari, R. A parameterized approach to personalized variable length
summarization of soccer matches. arXiv preprint arXiv:1706.09193.
79. Sun, Y., Ou, Z., Hu, W., & Zhang, Y. (2010). Excited commentator speech detection
with unsupervised model adaptation for soccer highlight extraction. In 2010 international
conference on Audio, Language, and Image Processing (pp. 747–751). https://doi.org/10.1109/
ICALIP.2010.5685077
80. Tang, H., Kwatra, V., Sargin, M., & Gargi, U. (2011). Detecting highlights in sports videos:
Cricket as a test case. In IEEE international conference on Multimedia and Expo. Barcelona.
81. Saba, T., & Altameem, A. (2013, August). Analysis of vision based systems to detect real time
goal events in soccer videos. International Journal of Applied Artificial Intelligence, 27(7),
656–667. https://doi.org/10.1080/08839514.2013.787779
82. Antonio, T.-d.-P., Yuta, N., Tomokazu, S., Naokazu, Y., Marko, L., & Esa, R. (2018, August).
Summarization of user-generated sports video by using deep action recognition features. IEEE
Transactions on Multimedia, 20(8), 2000–2010.
83. Tien, M.-C., Chen, H.-T., Hsiao, C. Y.-W. M.-H., & Lee, S.-Y. (2007). Shot classification of
basketball videos and its application in shooting position extraction. In Proceedings of the
IEEE international conference on Acoustics, Speech and Signal Processing (ICASSP 2007).
84. Vadhanam, B. R. J., Mohan, S., Ramalingam, V., & Sugumaran, V. (2016). Performance
comparison of various decision tree algorithms for classification of advertisement and non-
advertisement videos. Indian Journal of Science and Technology, 9(1), 48–65.
85. Vani, V., Kumar, R. P., & Mohan, S. Profiling user interactions of 3D complex meshes for
predictive streaming and rendering. In Proceedings of the fourth international conference on
Signal and Image Processing 2012 (ICSIP 2012) (pp. 457–467). Springer, India.
34 V. Vasudevan and M. S. Gounder
86. Vani, V., & Mohan, S. (2021). Advances in sports video summarization – a review based
on cricket video. In The 34th international conference on Industrial, Engineering & Other
Applications of Applied Intelligent Systems, Special Session on Big Data and Intelligence
Fusion Analytics (BDIFA 2021). Accepted for publication in Springer LNCS.
87. Viola, P., & Jones, M. (2001). Rapid object detection using a boosted cascade of simple
features. In Proceedings of the 2001 IEEE Computer Society conference on Computer Vision
and Pattern Recognition. CVPR 2001 (p. I-I). https://doi.org/10.1109/CVPR.2001.990517
88. Viola, P., & Jones, M. (2004). Robust real-time face detection. International Journal of
Computer Vision, 57(2), 137–154.
89. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., & Berg, A. C. (2016).
SSD: Single shot multibox detector. In European conference on computer vision (pp. 21–37).
Springer.
90. Xu, W., & Yi, Y. (2011, September). A robust replay detection algorithm for soccer video.
IEEE Signal Processing Letters, 18(9), 509–512. https://doi.org/10.1109/LSP.2011.2161287
91. Khan, Y. S., & Pawar, S. (2015). Video summarization: Survey on event detection and
summarization in soccer videos. International Journal of Advanced Computer Science and
Applications (IJACSA), 6(11). https://doi.org/10.14569/IJACSA.2015.061133
92. Ye, J., Kobayashi, T., & Higuchi, T. Audio-based sports highlight detection by Fourier local
auto-correlations. In Proceedings of the 11th annual conference of the International Speech
Communication Association, INTERSPEECH 2010 (pp. 2198–2201).
93. Hossam, Z. M., Nashwa, E.-B., Ella, H. A., & Tai-hoon, K. (2011). Machine learning-based
soccer video summarization system, multimedia, computer graphics and broadcasting (Vol.
263). ISBN: 978-3-642-27185-4.
94. Zhang, S., Wen, L., Bian, X., Lei, Z., & Li, S. Z. (2018). Singleshot refinement neural network
for object detection. In IEEE CVPR.
95. Zhang, S., Wen, L., Lei, Z., & Li, S. Z. (2021, February). RefineDet++: Single-shot refinement
neural network for object detection. IEEE Transactions on Circuits and Systems for Video
Technology, 31(2), 674–687. https://doi.org/10.1109/TCSVT.2020.2986402
96. Zou, Z., Shi, Z., Guo, Y., & Ye, J. (2019). Object detection in 20 years: A survey. arXiv preprint
arXiv:1905.05055.
Shot Boundary Detection from Lecture
Video Sequences Using Histogram
of Oriented Gradients and Radiometric
Correlation
1 Introduction
to such demands for the users [1]. In this process, a video was first segmented
into successive shots. However, to automate the process of shot segmentation,
the analysis of the subsequent frames for changes in visual content is necessary.
These changes can be abrupt or gradual. After detecting the shot boundaries, key
frames are extracted from each shot. Key frames provide a suitable abstraction and
framework for video indexing, browsing, and retrieval. The usage of key frames
significantly reduces the amount of data required in video indexing and provides an
organizational framework for dealing with video content. The users, while searching
for a video of their interest, browse the videos randomly and view only certain key
frames that matches the content of search query. CBVR has various stages like shot
segmentation, key frame extraction, feature extraction, feature indexing, retrieval
mechanism, and result ranking mechanism [1].
These key frames are used for image-based video retrieval where an image is
given as a query to retrieve a video from a collection of lecture videos. Varieties
of ways have been reported in the literature. The simple method is the pixel-
wise difference between consecutive frames [2]. But it is very sensitive to camera
motion. An approach based on local statistical difference is proposed in [3], which is
obtained by dividing the image into few regions and comparing statistical measures
like mean and standard deviation of the gray levels within the regions of the image.
However, this approach is found to be computationally burden. The most shared
and popular method used for shot boundaries detection is based on histograms
[4–6]. The simplest one computes the gray level or color histogram of the two
images. If the sum of bin-wise difference between the two histograms is above a
threshold, a shot boundary is assumed. It may be noted that these approaches are
relatively stable, but the absence of spatial information may produce substantial
dissimilarities in between the frames and hence incurs a reduction in accuracy.
Mutual information computed from the joint histogram of consecutive frames are
also used to solve such task [7]. The renowned Machine Learning and Pattern
Recognition methods like neural network [8], KNN [9], fuzzy clustering [10, 11],
and support vector machines [12] have also been used for shot boundaries detection.
Shot boundary detection based on orthogonal polynomial method is proposed [13].
Here, orthogonal polynomial function is used to identify the shot boundary in the
video sequence.
In essence, previous works expose that the researchers have proposed numerous
types of features and dissimilarity measures. Many state-of-the-art techniques suffer
from the difficulty of selecting the thresholds and window size. However, such
methods prohibit the accuracy of shot boundary detection by generating false pos-
itives due to illumination change. The next phase after shot detection is key frame
extraction. A key frame is a representative for individual shot. One of the popular
approaches to key frame extraction is using singular value decomposition (SVD)
and correlation minimization [14, 15]. Another method for key frame extraction
is KS-SIFT [16]; it extracts the local visual features using SIFT, represented as
feature vectors, from a selected group of frames from a video shot. KS-SIFT method
analyzes those feature vectors eliminating near duplicate key frames, helping to
keep a compact key frame set. But it takes more computation time, and approach
Shot Boundary Detection from Lecture Video Sequences Using Histogram. . . 37
The block diagram of the proposed shot boundary detection scheme is shown in
Fig. 1. The proposed scheme follows three steps: feature extraction, shot boundary
detection, and key frame extraction. In the proposed scheme initially, HOG features
are extracted from all the frames of the sequence. Here, the extracted HOG feature
vectors from each of frame are compared with subsequent frame by radiometric
correlation [23] measure. Then the local entropy corresponding to the radiometric
correlation is obtained to identify the shot boundaries in the lecture video. In the
next step, the key frames from each shot are extracted by analyzing the peaks and
valleys of the radiometric correlation.
Video
Entropy over
sliding window
Extracted frames
If (entropy <
ith frame (i+1)th frame threshold)
In the proposed scheme, we have used HOG feature for our analysis. HOG
feature is initially suggested for the purpose of object detection [23] in computer
vision and image processing. The method is based on evaluating well-normalized
histograms of image gradient orientations. The basic idea is that object appearance
and shape can often be characterized well by the distribution of intensity gradients
or edge directions without precise knowledge of the corresponding gradient or
edge positions. HOG captures edge or gradient structure that describes the shape.
It does so by representation with an easily controllable degree of invariance and
photometric transformations. Translations or rotations make little difference if they
are much smaller than the spatial or orientation bin size. Since it is gradient based, it
captures the object shape information very well. The essential thought behind HOG
is to describe the object appearance and shape within the image by the distribution
of intensity gradients. The histograms are contrast-normalized by calculating a
measure of the intensity across the image and then normalize all the values. As
comparing the consecutive frames in the video is the key to detect a shot boundary,
using HOG is a good choice as it is computationally faster. The proposed shot
boundary detection algorithm uses radiometric correlation and entropic measure for
shot transition identification. It is discussed in detail in the next section.
The basic idea behind the shot boundary detection in a lecture sequence is to find
the similarity/correlation between the consecutive frames in the video and point
out the discontinuity from it. In this regard, we have considered the radiometric
correlation-based similarity measure to find the correlation between the frames. The
extracted HOG features are compared in between consecutive frames to estimate the
radiometric correlation. Here, it is assumed that the time instant is same as that of
the frame instant. Let the successive frames of a sequence is represented by It (x, y)
−−→
and It − 1 (x, y) and the extracted HOG feature vectors be represented by HOGt and
−−→
HOGt−1 , respectively. Then the radiometric correlation is given by [23]
−−−→ −−−−−→ −−−→ −−−−−→
m HOGt .HOGt−1 − m HOGt m HOGt−1
R (It (x, y) , It−1 (x, y)) = ,
−−−→ −−−−−→
v HOGt v HOGt−1
(1)
−−−→ −−−−−→
where m HOGt .HOGt−1 represents the mean of the product of the extracted
feature vectors and can be obtained as
40 T. Veerakumar et al.
−−−→ 1 n
m HOGt = HOG(t,i) , (3)
n
i=1
and
−−−→ 1 n −−−→ 2
v HOGt = HOG(t,i) − m HOGt . (4)
n
i=1
The radiometric correlation varies in the range [0, 1]. From the radiometric
correlation values obtained, a threshold is required to detect the shot boundary.
The radiometric correlation values for consecutive frames are calculated. So, for
N frames, (N−1) radiometric correlation values can be obtained. Figure 2a shows
the plot of radiometric correlation vs. frames of lecture video 1 sequence.
After obtaining the radiometric correlation, the next step is shot boundary detection.
Now, the aim is to identify the discontinuity point, this radiometric distribution of
the consecutive frames. In Fig. 2a, it can be seen that there is a significant difference
in the radiometric correlation values from one frame to another. However, finding
the discontinuity on these values that correspond to the shot transition is very
difficult. It is also true that keeping a threshold directly on this similarity values
is not a good idea as it varies widely. So we have taken the help of moving window-
based entropy measure on the radiometric correlation. The idea is to rather than
keeping a threshold on the radiometric correlation values, a single dimensional
overlapping moving window may be considered on these radiometric values to
compute the entropic measure. It improves the performance.
A moving window is considered over the radiometric correlation plot. From
the radiometric correlation plot, the entropy is calculated for each location of the
window. In information theory, entropy is used as a measure of uncertainty, and this
Shot Boundary Detection from Lecture Video Sequences Using Histogram. . . 41
Fig. 2 Lecture video 1: (a) radiometric correlation for different frame (b) corresponding entropy
values
gives the average information. Hence, the use of entropy in our work will reduce
the randomness or vagueness of the local radiometric correlation. We calculate the
entropy Em at each point (frame) m of the radiometric correlation values using the
formula
1
Em = pi log , (5)
pi
i∈ηm
σm = σl + σr , (6)
42 T. Veerakumar et al.
where the variance σ m is the total variance at frame position m. σ l and σ r are the
variances computed from the entropic information from the left and right side of m.
Then the threshold value is obtained by finding a point m such that
Th = arg min σj , (7)
j
where j represents the threshold for shot transition. For lecture video 1, we apply
Eq. (7), and the shot boundary is detected at 247th frame. The results of which
are being explained and have two shots, and hence, one shot boundary is detected.
However, we can go for video with more than two shots also. Hence, for a video
with P shots, the total number of threshold (Th) will be (P−1). For automatic shot
−
→
boundary detection, we assumed that j is a vector and represented by j , and to
obtain a threshold, we have considered a maximum number of possible components
−
→ −
→
in j as jp−1 where j = {j1 , j2 , . . . , jP −1 } . The expression for the threshold will
−
→
be represented by a vector, Th = {Th1 , Th2 , . . . , ThP −1 }.
Once the shot boundaries are extracted from a given sequence, there is a need to
extract the key frames to represent each shot. It can be seen from the graph in Fig. 3
that there is variation in similarity measure for a particular shot. The maxima of this
variation represent the frames where there is more similarity of nearest frames. The
idea here is to take the frames where there is a maximum of similarity distribution
to pickup as key frames. These maxima are the peaks of the distributions. If we
can properly isolate these maxima, then we can find the key frames. However, there
will be temporal redundancy in between the consecutive frames of the video; hence,
it is not a good idea to take two maxima that are close to each other. It is to be
noted that most of the shots contain significant variation in radiometric similarity
measure due to noise or illumination change. Hence, before maxima are picked
for shot representation, the similarity distribution corresponding to each frame is
smoothened by a one-dimensional smoothing filter. Using this scheme, the different
key frames for different shots are detected.
Figure 3 shows the key frames for different shots (a total of three) of lecture
video 1 sequence: shot 1 [41, 187, 323], shot 2 [434, 772, 1013, 1249], and shot 3
[1291, 1345, 1394, 1464]. Once the key frames are extracted, it is thus checked if
the visual contents of two consecutive key frames are same. Hence, the radiometric
correlation is obtained between the consecutive key frames and significant key
frames are selected as final key frames for a particular shot. We also applied the
same on lecture video 1, and we obtained the final key frames as [187, 1013, 1291,
1394].
Shot Boundary Detection from Lecture Video Sequences Using Histogram. . . 43
Fig. 3 Location of the shots and key frames of lecture video 1. (a) Location of the shots. (b) Shot
1 key frames: [41, 187, 323]. (c) Shot 2 key frames [434, 772, 1013, 1249]. (d) Shot 3 key frames
[1291, 1345, 1394, 1464]
To assess the effectiveness of the proposed algorithm, the results obtained using the
proposed methodology are compared with those obtained using six different state-
of-the-art techniques. The proposed algorithm is implemented in MATLAB and is
run on Pentium D, 2.8 GHz PC with 2G RAM, and Windows 8 operating system.
Experiments are carried out on several lecture sequences. However, for illustration,
we have provided results on seven test sequences. This section is further divided
into two parts: (i) analysis of results and (ii) discussions and future works. In the
former section, the detailed discussion of visual illustration with different sequences
are discussed. In the later section, the quantitative analysis of the results with the
discussion of the proposed scheme with future works is discussed.
44 T. Veerakumar et al.
Fig. 4 Key frames for lecture video 1 [187, 1013, 1291, 1394] out of 1497 frames and three shots
Four key frames are extracted from the lecture video 1 that are given by the
frame numbers [187, 1013, 1291, 1394]. These extracted key frames are shown
in Fig. 4. Corresponding visual illustration for radiometric correlation and extracted
key frames are shown in Figs. 2 and 3.
Similarly, the radiometric correlation values for the lecture video 2 is taken, the
graph of which is shown in Fig. 5. It may be observed that here, a total of one
shot boundary or two shots are detected. The peaks and valley analysis reveal that
there are four major peaks, and three major peaks are there in shot 1 and shot
2, respectively. The red marks in Fig. 5 shows that the selected maxima (peaks)
are selected as key frames. These key frames are given by the frame numbers as
[18, 78, 143, 228, 277, 389, 467]. However, it may be observed that many key
frames selected from the last stage have large correlation; hence, after refinement
(as discussed in Sect. 2.4), we obtained two key frames as shown in Fig. 6.
The third example considered for our experiments is lecture video 3 sequence.
The radiometric correlation plot with corresponding entropy value plot of this
sequence is shown in Fig. 7. The use of an automated thresholding scheme on
entropic plot has produced two shots for this sequence. Application of key frame
extraction process results in 11 key frames for this sequence. After pruning, we
obtained that six key frames are extracted from this sequence and are shown in
Fig. 8.
Similar experiments are conducted on different other sequences to validate our
results. The fourth example considered for our experiments is lecture video 4
sequence. The radiometric correlation plot with corresponding entropy value is
shown in Fig. 9. The obtained key frames extracted from this sequence are shown in
Fig. 10. This sequence is found to be containing a total of four shots. After proposed
pruning mechanism, it is obtained that a total of four key frames are detected. It may
be noted that the said video contains a view with camera movements/jitter. However,
the proposed scheme has overcome without false detection. A detailed discussion
with example for camera jitter is also provided in Sect. 3.2.
Next example considered for our experiment is lecture video 5 sequence whose
radiometric correlation plot with selected key frames is shown in Figs. 11 and 12.
This sequence has several instances where fade-in and fade-out are there. However,
the proposed scheme effectively able to distinguish the exact number of key frame
as 2. A detailed analysis of the same with example is also provided in Sect. 3.2.
Shot Boundary Detection from Lecture Video Sequences Using Histogram. . . 45
Fig. 5 Radiometric correlation, corresponding entropy plots, and extracted key frames from each
shot for lecture video 2 sequence. (a) Radiometric correlation. (b) Entropy values. (c) Location of
the shots. (d) Key frames of shot 1 [18, 78, 143, 228]. (e) Key frames of shot 2 [277, 389, 467]
The next examples considered are lecture video 6 and lecture video 7 sequences.
The entropic value plot with shot categorization with the selected key frames are
provided in Figs. 13, 14, 15, and 16. In these two sequences, the scene undergoes
46 T. Veerakumar et al.
Fig. 6 Key frames for lecture video 2 [143, 389] out of 505 frames and two shots
zoomed in and out condition. However, this does not affect the results of the
proposed scheme.
All the sequences considered in our experiment are to validate the proposed
scheme in different challenging scenarios: camera with motion, the scene with
different subtopics, and camera with zoomed in and out. A detailed analysis of the
same is provided in the next section.
In this section, we have provided the quantitative analysis of the results with brief
discussions on the advantages/disadvantages and other issues related to the proposed
work. The efficiency of the algorithm is tested with the key frame extraction and
computational complexity. The computational time for the proposed and existing
algorithms is given in Tables 1 and 2. From these tables, it is possible to observe
that the proposed algorithm takes much more computational time than PWPD,
CHBA, and ECR, but these algorithms are found not to be good in the key frame
extraction when compared to the proposed algorithm. The other existing algorithms
like LTD, KS-SIFT, and RPCA results in the key frame extraction are similar to
that of the proposed algorithm. But the proposed algorithm claims much lesser
computational time than those existing algorithms. From that, we can conclude
that the proposed algorithm outperforms in the key frame extraction with less
computational complexity.
Here, it is required to mention that the shot boundary identification from lecture
sequence is a challenging task. The similarity among the frames of the video
contains a large amount of uncertainty due to variation in color, artificial effects
like fade-in and fade-out, illumination changes, object motion, camera jitter, and
zooming and shrinking. The proposed scheme is found to be providing a better
result in this regard for all of these considered scenarios. The performance of the
proposed scheme against each of these scenarios with examples are discussed as
follows:
Figure 17 shows two examples of motion blur condition. The two examples
depict the two frames from two different shots of lecture video 2 and lecture video
Shot Boundary Detection from Lecture Video Sequences Using Histogram. . . 47
Fig. 7 Radiometric correlation, corresponding entropy plots, and extracted key frames from each
shot for lecture video 3 sequence. (a) Radiometric correlation. (b) Entropy values. (c) Location of
the shots. (d) Key frames of the shot 1 [7, 90, 150, 231]. (e) Key frames of the shot 2 [241, 344,
429]. (f) Key frames of the shot 3 [481, 562, 634, 706]
Fig. 8 Key frames for lecture videos 3 [7, 150, 241, 429, 481, 706]
single shot. This happens due to the capability of HOG features used in the proposed
scheme.
Figure 18 shows an example for shots with fade-in and fade-out conditions. The
use of other schemes is found to be not able to distinguish them as two different
shots rather detected them as three different shots: texts in the board, professor, and
fade-in/fade-out frames. However, the proposed scheme well represents them into
two shots for each sequence. This is due to the capability of entropic measures that
diminish the effects of variation in radiometric correlation measure.
Figure 19 shows an example of lecture video 4 sequences where the sequence
undergoes with camera jitter or movements. This is due to the active application of
HOG feature with radiometric similarity measure that can detect them to being part
of a single shot.
A similar analysis is made on the lecture video 4 and lecture video 6 sequences
with a zoomed in and out conditions (shown in Fig. 20). A view variation by camera
zoomed in and zoomed out view is also shown and found to be a single shot by
the proposed scheme shown in Fig. 20. Thanks to the integration of radiometric
similarity with entropic measures to deal with real-life uncertainty for efficiently
detecting the shot transitions in the considered challenging scenarios. Figure 21
shows another example with noise. An application of the proposed scheme never
distinguishes them being part of different shots, whereas existing techniques fail to
do so.
With the above analysis, we found that the proposed scheme is found to be
providing better results against variation in color, artificial effects like fade-in and
fade-out, illumination changes, object motion, camera jitter, zooming and shrinking,
and noisy video scenes. It is to be noted that most of the false detection of key frames
by other considered scheme for comparisons reported in Tables 1 and 2 is due to the
abovesaid effects. The effectiveness of the proposed scheme can be concluded in
two phases. In the first phase, the use of HOG features will try to preserve the shape
information from a given lecture video. The shape information includes details of
the texts in the board, drawings, slides, pictures, teaching professor, etc. from the
lecture video. It is also to be noted that, as reported by literature, the HOG feature
Shot Boundary Detection from Lecture Video Sequences Using Histogram. . . 49
Fig. 9 Radiometric correlation, corresponding entropy plots, and extracted key frames from each
shot for lecture video 4 sequence. (a) Radiometric correlation. (b) Entropy values. (c) Key frames
of the shot 1 and 2 [5, 127, 351, 494]. (d) Key frames of the shot 3 and 4 [502, 647, 808, 922,
1018]
is found to be providing good results against illumination changes, motion blur, and
noisy video scene. This is quite well understood from Figs. 17, 18, 19, 20, and 21. In
the second phase, the radiometric similarity between the frames is computed and the
variation in it is reduced by mapping to entropic scale. This minimizes the effects
of false detection in key frames and effective against the fade-in, fade-out, zoomed
in and out, and other irrelevant effects in the video.
50 T. Veerakumar et al.
Fig. 10 Key frames for lecture video 4 [127, 494, 647, 1018]
Fig. 11 Radiometric correlation, corresponding entropy plots, and extracted key frames from each
shot for lecture video 5 sequence. (a) Radiometric correlation. (b) Entropy values. (c) Key frames
of the shots 1, 2, and 3 [2, 245, 487, 721, 776, 909]
There are few parameters that are used in the proposed scheme and need further
discussions. One of the important parameters used in this article is one-dimensional
window size or the neighborhood used for computation of the entropy from the
radiometric similarity plot. In the proposed scheme, we have used a fixed window
Shot Boundary Detection from Lecture Video Sequences Using Histogram. . . 51
Fig. 13 Radiometric correlation, corresponding entropy plots, and extracted key frames from each
shot for lecture video 6 sequence. (a) Radiometric correlation. (b) Entropy values. (c) Key frames
of the shots 1 and 2 [2128, 481, 1422]
of size (7×1) for all the considered video sequences. However, a variable sized
window may be considered for this. In all the considered sequences, the choice of
window size may affect the performance of the proposed scheme. If the number of
frames in a particular video will be high and a small size window will be chosen,
then there will be many false shot transitions. If the number of frames in a particular
video will be low and a higher size window will be chosen, then there may be
chances of missing few shot transitions. A tabular representation of the performance
of the proposed scheme on all the considered sequences with different window size
is provided in Table 3. The proposed scheme is tested with different window size,
(11 × 1), (9 × 1), (7 × 1), (5 × 1), and (3 × 1), and the number of key frames
detected by each window size are presented in Table 3. It is also observed from this
52 T. Veerakumar et al.
Fig. 15 Radiometric correlation, corresponding entropy plots, and extracted key frames from each
shot for lecture video 7 sequence. (a) Radiometric correlation. (b) Entropy values. (c) Key frames
of the shots 1, 2, and 3 [243, 824, 1236, 1618, 2015, 4143]
Fig. 16 Key frames for lecture video 7 [243, 1236, 2015, 4143]
table that use of (5 × 1), (7 × 1), and (9 × 1) gives almost same results for most
of the sequences in terms of number of key frames detected, whereas the results
obtained by the window size (7 × 1) and (9 × 1) are same. Hence, by the taking
the average of all the results concluded by manual trial and error basis infers that a
use of (7 × 1) window size is able to provide an affordable result. Hence, we fixed
it to size (7 × 1). It is to be noted that all experiments are made on frame size of
320 × 240.
In this article, all the results reported in Figs. 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
14, 15, 16, 17, 18, 19, 20, 21 and Tables 1, 2, 3 for comparison with the proposed
scheme are developed by the authors in the working lab using MATLAB software.
All the considered technique codes are developed in an optimized manner so as to
validate the proposed scheme in same scale balance.
Table 1 Comparison of different lecture video with existing algorithms
Method Lecture video 1 (# frames = 1497) Lecture video 2 (# frames = 505) Lecture video 3 (# frames = 737) Lecture video 4 (# frames = 1025)
# Key Key # Key Key # Key Key # Key
frame(s) frame # CT in sec. frame(s) frame # CT in sec. frame(s) frame # CT in sec. frame(s) Key frame # CT in sec.
PWPD [2] 4 [246, 771, 182.84 1 [240] 13.45 2 [240, 480] 66.02 5 [127, 394, 255.22
1024, 1295] 847, 981,
1018]
CHBA [4] 4 [246, 771, 456.93 5 [90, 134, 34.99 10 [66, 180, 524.19 5 [112, 506, 502.81
1024, 1295] 180, 240, 240, 273, 647, 901,
270] 303, 420, 1001]
480, 531,
600, 681]
ECR [14] 7 [246, 771, 985.66 2 [12, 240] 31.29 9 [130, 240, 560.33 8 [102, 409, 1027.92
1024, 1295, 480, 601, 647, 709,
1314, 1321, 605, 606, 811, 905,
1326] 609, 613, 992, 1013]
642]
LTD [5] 6 [186, 558, 1208.36 5 [79, 143, 68.94 7 [66, 150, 665.87 7 [27, 323, 1278.45
838, 943, 277, 389, 240, 429, 419, 647,
1189, 1266] 467] 480, 642, 709, 899,
706] 1001]
KS-SIFT [16] 6 [186, 558, 1319.57 5 [9, 143, 223, 72.37 7 [66, 180, 705.69 5 [127, 480, 1899.14
751, 943, 389, 467] 273, 420, 617, 712,
1076, 1189] 480, 681, 1022]
706]
RPCA [17] 7 [186, 558, 1328.66 6 [9, 143, 223, 75.22 7 [10, 150, 719.02 8 [102, 399, 1928.83
838, 943, 277, 389, 240, 429, 619, 700,
1076, 1189, 467] 481, 600, 833, 909,
1266] 706] 999, 1020]
Proposed 4 [186, 772, 845.36 2 [143, 389] 45.96 6 [7, 150, 241, 410.73 4 [127, 494, 1021.22
1289, 1392] 429, 481, 647, 1018]
706]
54
The proposed scheme is mainly designed for lecture video segmentation or shots
boundary detection in lecture video sequences. It is to be noted that a lecture
sequence mostly have two or three different kinds of frames or shots that include the
face of the professor, writing texts/slides, and hand of the professor. The transitions
between these shots in a video occur in the fashion of face of the professor to hand,
hand to texts in the board, board to hand, and then again hand to professor’s face. In
few cases, it may happen that the view may change from text to face of the professor
and back to the text. Hence, before starting of all new topic or subtopic, most of the
video undergoes a transition like old topic/subtopic to face of the professor to new
topic/subtopic. Hence, the proposed scheme detects them as three different shots.
However, in rare cases, it may happen that the transition of frames may happen like
old topic/subtopic to new topic/subtopic. In this case, it will be difficult to identify
the shot transitions. Segments of radiometric correlation plot and corresponding
entropic values plot of a shot (contains a combination of two subtopics) are shown
in Fig. 22. The proposed scheme fails to separate both contents into two different
shots. This is because the significant change in scene view is not reflected in the
radiometric similarity. Hence, entropic plot fails to distinguish them. One way to
solve such issue is to split the radiometric correlation plot into the different part
and then entropy values may be computed locally at each location. Figure 23 shows
such an example, where the entropic plot can be easily separable at the change in
topic/subtopic region of the video. This is a preliminary result on this. The choice
of splitting the radiometric plot is manual. In the future, we would like to work
more on this issue. The proposed scheme is mainly used to identify the gradual shot
transition. In the future, we would like to develop some techniques that will be able
to determine the soft transitions.
56
Fig. 19 Detected as part of single shot with camera movements and jitter
Fig. 20 Detected as single shot for zoomed in and out condition with view variation for different
video
Fig. 22 Radiometric
correlation plot and entropic
values for a shot with a
combination of two subtopics
4 Conclusions
In this article, a shot boundary detection and key frame extraction technique for
lecture video sequences, using an integration of HOG, and radiometric correlation
with entropic-based thresholding scheme are proposed. In the proposed approach,
58 T. Veerakumar et al.
Fig. 23 Radiometric
correlation plot and obtained
split entropic values
the advantages of HOG feature are explored to describe each frame effectively. The
similarities between the n-dimensional extracted HOG features are compared to the
consecutive image frames using radiometric correlation measure. The radiometric
correlation for the complete video is found to have a significant amount of
uncertainty due to variation in color, illumination, and camera and object motion.
To deal with these uncertainties, the entropic thresholding is adhered to it to find
the shot boundaries. After detection of the shot boundaries, the key frame from each
shot is obtained by analyzing the peaks and valleys of the entropic associated pdf
of the radiometric correlation measures. The proposed scheme is tested on several
lecture sequences, and results on seven lecture video sequences are reported here.
The results obtained by the proposed scheme are compared against six existing state-
of-the-art techniques by considering the computational time and shot detection. It is
obtained that the proposed scheme is found to be better.
References
1. Hu, W., Xie, N., Li, L., Zeng, X., & Maybank, S. (2011). A survey on visual content-based
video indexing and retrieval. IEEE Transactions on Systems, Man, and Cybernetics, Part C:
Applications and Reviews, 41, 797–819.
2. Zhang, H. J., Kankanhalli, A., & Smoliar, S. W. (1993). Automatic partitioning of full-motion
video. ACM/Springer Multimedia System, 1, 10–28.
3. Huang, C. L., & Liao, B. Y. (2001). A robust scene-change detection method for video
segmentation. IEEE Transactions on Circuits and Systems for Video Technology, 11, 1281–
1288.
4. Borecsky, J. S., & Rowe, L. A. (1996). Comparison of video shot boundary detection
techniques. Proceedings of SPIE, 2670, 170–179.
5. Grana, C., & Cucchiara, R. (2007). Linear transition detection as a unified shot detection
approach. IEEE Transactions on Circuits and Systems for Video Technology, 17, 483–489.
6. Patel, N. V., & Sethi, I. K. (1997). Video shot detection and characterization for video
databases. Pattern Recognition, 30, 583–592.
7. Cernekova, Z., Pitas, I., & Nikou, C. (2006). Information theory-based shot cut/fade detection
and video summarization. IEEE Transactions on Circuits and Systems for Video Technology,
16, 82–91.
8. Lee, M. H., Yoo, H. W., & Jang, D. S. (2006). Video scene change detection using neural
network: Improved ART2. Expert Systems and Applications, 31, 13–25.
9. Cooper, M., & Foote, J. (2005). Discriminative techniques for keyframe selection. In Proceed-
ings of ICME (pp. 502–505). Amsterdam, The Netherlands.
10. Haoran, Y., Rajan, D., & Chia, L. T. (2006). A motion-based scene tree for browsing and
retrieval of compressed video. Information Systems, 31, 638–658.
Shot Boundary Detection from Lecture Video Sequences Using Histogram. . . 59
11. Cooper, M., Liu, T., & Rieffel, E. (2007). Video segmentation via temporal pattern classifica-
tion. IEEE Transactions on Multimedia, 9, 610–618.
12. Duan, F. F., & Meng, F. (2020). Video shot boundary detection based on feature fusion and
clustering technique. IEEE Access, 8, 214633–214645.
13. Abdulhussain, S. H., Ramli, A. R., Mahmmod, B. M., Saripan, M. I., Al-Haddad, S. A. R., &
Jassim, W. A. (2019). Shot boundary detection based on orthogonal polynomial. Multimedia
Tools and Applications, 78(14), 20361–20382.
14. Lei, S., Xie, G., & Yan, G. (2014). A novel key-frame extraction approach for both video
summary and video index. The Scientific World Journal, 1–9.
15. Bendraou, Y., Essannouni, F., Aboutajdine, D., & Salam, A. (2017). Shot boundary detection
via adaptive low rank and SVD-updating. Computer Vision and Image Understanding, 161,
20–28.
16. Barbieri, T. T. S., & Goularte, R. (2014). KS-SIFT: a keyframe extraction method based on
local features. In IEEE International Symposium on Multimedia (pp. 13–17). Taichung.
17. Dang, C., & Radha, H. (2015). RPCA-KFE: Key frame extraction for video using robust
principal component analysis. IEEE Transactions on Image Processing, 24, 3742–3753.
18. Dalal, N., & Triggs, B. (2005). Histogram of oriented gradients for human detection.
Proceedings of CVPR, 1, 886–893.
19. Spagnolo, P., Orazio, T. D., Leo, M., & Distante, A. (2006). Moving object segmentation by
background subtraction and temporal analysis. Image and Vision Computing, 24, 411–423.
20. Zabih, R., Miller, J., & Mai, K. A. (1995). A feature-based algorithm for detecting and
classifying scene breaks. In Proceedings of ACM Multimedia (pp. 189–200). San Francisco,
CA.
21. Singh, A., Thounaojam, D. M., & Chakraborty, S. (2020, June). A novel automatic shot
boundary detection algorithm: Robust to illumination and motion effect. Signal, Image Video
Process., 14(4), 645–653.
22. Subudhi, B. N., Veerakumar, T., Esakkirajan, S., & Chaudhury, S. (2020). Automatic lecture
video skimming using shot categorization and contrast based features. Expert Systems with
Applications, 149, 113341.
23. Shen, R. K., Lin, Y. N., Juang, T. T. Y., Shen, V. R. L., & Lim, S. Y. (2018, March). Automatic
detection of video shot boundary in social media using a hybrid approach of HLFPN and
keypoint matching. IEEE Transactions on Computational Social Systems, 5(1), 210–219.
Detection of Road Potholes Using
Computer Vision and Machine Learning
Approaches to Assist the Visually
Challenged
1 Introduction
It can be challenging for blind people to move around different places indepen-
dently. The presence of potholes, curbs, and staircases is a hindrance for blind
people to travel to various places freely without having to rely on others. The
necessity in identifying the potholes, curbs, and other obstacles on the pathway has
led many researchers to build smart systems to assist blind people. Various smart
systems incorporated in the walking stick, wearable system, etc. are being proposed
in achieving the aim of pothole detection for blind users.
The proposed system is a vision-based experimental study that employs machine
learning classification with computer vision techniques and a deep learning object
detection model to detect potholes with improved precision and speed. In machine
learning classification with computer vision approach, the images are preprocessed
and features extraction methods such as HOG (Histogram of Oriented Gradients)
and LBP (Local Binary Pattern) are applied with an assumption that use of a fusion
of feature vector of HOG and LBP feature descriptors will improve the classification
performance. Various classification models are implemented and compared using
performance evaluation metrics and methodologies. The process is extended to
pothole localization for the images that are classified as pothole images. The proof
of the hypothesis, i.e., the use of a fusion of feature extraction methods will improve
the performance of the classification model, is derived. The second approach is
pothole detection using a deep learning model. Through the years, deep learning
has proven to provide reliable solutions to real-world problems involving computer
vision and image analysis. The convolutional neural network in deep learning plays
a vital role in extracting features and classifying the data precisely. In this approach,
YOLO v3 model is implemented for the pothole detection system. The results of
the detection of potholes are analyzed, and the efficiency of the proposed system in
outdoor real-time navigation for visually challenged people is studied.
2 Related Works
Mae M. Garcillanosa et al., [1] implemented a system to detect and report the
presence of potholes using image processing techniques. The system was installed
in a vehicle with a camera and Raspberry Pi that will monitor the pavements. The
processing was performed on the real-time video at a rate of 8 frames per second.
Canny edge detection, contour detection, and final filtering were carried out on each
video frame. The location and image of the pothole are captured when the pothole is
detected, which can later be viewed. The system achieved an accuracy of 93.72% in
pothole detection, but an improvement was required in recognizing the normal road
conditions. The total processing time was 0.9967 seconds for video frames with
the presence of potholes and 0.8994 seconds for video frames with normal road
conditions.
Aravinda S. Rao et al. [2] proposed a system to detect potholes, staircases,
and curbs using a systematic computer vision algorithm. An Electronic Travel Aid
(ETA) equipped with a camera and a laser was employed to capture the pathway. The
camera was mounted on the ETA with an angle of 30◦ –45◦ between the camera and
the vertical axis and a distance of 0.5 meters between the camera and the pathway.
The Canny edge detection algorithm and Hough transform were used to process
each frame in the video to detect the laser lines. The output of the Hough transform
that depicts the number of intersecting lines was transformed into the Histogram of
Intersections (HoI) feature. The Gaussian Mixture Model (GMM) learning model
was utilized to detect whether the pathway is safe or unsafe. The system gave an
accuracy of over 90% in detecting the potholes. Since the system uses laser patterns
to identify the potholes, it can only be used during the nighttime.
Kanza Azhar et al. [3] proposed a system to detect the presence of potholes for
proper maintenance of the roadways. For classifying pothole/non-pothole images,
HOG (Histogram of Oriented Gradients) representation of the input image was
generated. The HOG feature vector was provided to the Naïve Bayes classifier as
it has higher scalability and strong independent nature. For the images classified
as an image containing pothole(s), localization of pothole(s) was carried out using
a technique called graph-based segmentation using normalized cut. The system
attained an accuracy of 90%, precision of 86.5%, recall of 94.1%, and a processing
time of 0.673 seconds.
The core idea of the research work by Muhammad Haroon Yousaf et al. [4]
is to detect and localize the potholes in an input image. The input image was
converted from RGB color space to grayscale and resized to 300 × 300 pixels.
The system was implemented using the following steps: feature extraction, visual
Detection of Road Potholes Using Computer Vision and Machine Learning. . . 63
portable device attached to the foldable walking stick was assembled with the
following: ATmega328 8-bit microcontroller, HC-SR04 ultrasonic distance sensor,
signal conditioner, pressure sensor, speaker, android device, walking stick, buzzer
(piezoelectric), and power supply (using Li-Ion rechargeable batteries - 2500 mAh,
AA1.2V X 4). The pressure sensor was attached to the bottom end of the walking
stick. When the user strikes the walking stick on the ground, the reading is taken
from the pressure sensor as well as the ultrasonic sensor. Using the Pythagoras
theorem, a predefined value is set for the value that would be sensed by the ultrasonic
sensor. If the currently sensed value exceeds that predefined value, the system
informs the user that there is a presence of obstacles like potholes. If the currently
sensed value is lesser than the predefined value, the system informs the user that
there is a presence of obstacles like speed breakers. The result of sensitivity of object
detection exceeded 96%.
It can be noted from the previous works that there is a scope of improvement
in detection accuracy as well as processing speed, and the false-negative outcomes
in the detection results can be reduced. Most of the related works are targeted for
periodic assessment and maintenance of the roadways in which the system takes
high runtime, whereas the pothole detection for the visually challenged requires
the system to perform with high speed and accuracy that swiftly alerts the user
if any pothole is detected. Thus, the main idea behind the proposed approach is
to develop a precise, fast pothole detection system that is effective and beneficial
for the visually challenged. Two approaches (machine learning algorithm with
computer vision techniques and a deep learning model) were implemented using
suitable machine learning and deep learning models for real-time pothole detection.
The system is trained with pothole images of various shapes and textures to provide
a broad solution. In the case of machine learning algorithm and computer vision
approach, the system performs localization of pothole region only if the image
is classified as a pothole image. This step helps in improving the computational
efficiency of the system as it reduces the number of false-negative outcomes by the
system.
3 Methodologies
Mx = [−1 0 1] (1)
⎤ ⎡
−1
My = ⎣ 0 ⎦ (2)
1
Equations 3 and 4 are used to determine the value of the gradient magnitude “g”
and gradient angle “θ .”
g= gx2 + gy2 (3)
gy
θ = tan−1 (4)
gx
Assume that the images are resized to a standard size of 200 × 152 pixels and the
parameters such as pixels per cell, cells per block, and number of orientations are
set to (8,8), (2,2), 9 respectively. Thereby, each image is divided into 475 (25 × 19)
nonoverlapping cells of 8 × 8 pixels. In each cell, the magnitude values of the 64
pixels are binned and cumulatively added into nine buckets of gradient orientation
histogram (Fig. 2).
12000
Gradient Magnitude
10000
8000
6000
4000
2000
0
0 20 40 60 80 100 120 140 160 180
Orientation
Detection of Road Potholes Using Computer Vision and Machine Learning. . . 67
A block of 2 × 2 cells is slid across the image. In every block, the corresponding
histogram of each of the four cells is normalized into a 36 × 1 element vector. This
process is repeated until the feature vector of the entire image is computed. The
prime benefit of the HOG feature descriptor is its capability of extracting the basic
yet meaningful information of an object such as shape, outline, etc. It is simpler,
less powerful, and faster in computation compared to deep learning object detection
models.
LBP Feature Descriptor
Local Binary Patterns (LBP) feature descriptor is mainly used for texture classifica-
tion. To compute the LBP feature vector, neighborhood thresholding is computed for
each pixel in the image, and the existing pixel value is replaced with the threshold
result. For example, the image is divided into 3 × 3 pixel cells as shown in Fig.
3. The pixel value of the eight neighbors is compared with the value of the center
pixel (value = 160). If the value of the center pixel is greater than the pixel value
of the neighbor, the neighboring pixel takes the value “0”; otherwise, it is “1.” The
resultant eight-bit binary code is converted into a decimal number and stored as
the center pixel value. This procedure is implemented for all the pixels in the input
image. A histogram is computed for the image with the number of bins ranging
from 0 to 255 where each bin denotes the frequency of that value. The histogram is
normalized to obtain a one-dimensional feature vector.
The main advantages of the LBP feature descriptor are computational simplicity
and discriminative nature. Such properties in a feature descriptor are highly useful
in real-time settings.
68 U. Akshaya Devi and N. Arulanand
TP + TN
Accuracy = (5)
TP + TN + FP + FN
TP
Precision = (6)
TP + FP
TP
Recall = (7)
TP + FN
Detection of Road Potholes Using Computer Vision and Machine Learning. . . 69
precision × recall
F1 score = 2 × (8)
precision + recall
where TP, TN, FP, FN corresponds to true positive, true negative, false positive, and
false negative instances, respectively. Accuracy is the number of correctly predicted
instances out of all the instances. Precision quantifies the number of positively
predicted instances that actually belong to the positive class. Recall quantifies the
number of correctly predicted positive instances made out of all positive instances
in the dataset. The F1 score also called F-score or F-Measure provides a single score
that balances both precision and recall. It can be described as a weighted average of
precision and recall.
In addition to the performance metrics, the AUC-ROC curve is plotted for binary
classifiers. ROC curve (Receiver Operating Characteristic curve) is a probability
curve that is plotted with FPR (false-positive rate) against TPR (true-positive rate or
recall). The AUC (Area Under the Curve) score defines the capability of the model
to distinguish between the positive class and negative class. The score usually ranges
from 0.0 to 1.0 where a score of 0.0 denotes the inability of the model to distinguish
between positive/negative classes and a value of 1.0 denotes the strong ability of the
model to distinguish between positive/negative classes.
Localization of Potholes
The three steps in localization of potholes are pre-segmentation using k-means
clustering, construction of Region Adjacency Graph (RAG), and normalized graph
cut. In the pre-segmentation stage, the image is segmented using k-means clustering.
The result of this step will give the centroid of all the segmented clusters. In
the second step, the Region Adjacency Graph is constructed using mean colors.
The obtained clusters are represented as nodes where any two adjacent nodes are
separated by an edge in the RAG. The nodes that are similar in color are merged,
and the value of edges is set as the difference in the average of RGB of the adjacent
nodes. On the Region Adjacency Graph, a two-way normalized cut is performed
recursively as step 3. Thereby, the result will contain a set of nodes where any two
points in the same node have a high degree of similarity and any two points in
different nodes have a high degree of dissimilarity. As a result, the pothole region
can be clearly differentiated from the other regions from the image.
YOLO V3 Model
The YOLO (You Look Only Once) object detection algorithm is based on a single
deep convolutional neural network called Darknet-53 architecture. The YOLO v3
model is viewed as a single regression problem where a single neural network is
70 U. Akshaya Devi and N. Arulanand
Fig. 4 YOLO v3 model (reproduced from Joseph redmon et al. 2016) [12]
For each bounding box, values of x, y, width, height, and confidence score are
predicted. The x and y values represent the center coordinates of the bounding
box with respect to the grid cell. The product of conditional class probabilities
(P(Classi |Object)) and the individual bounding box confidence scores gives the
confidence scores of each class in the bounding box. This score indicates the
probability of the presence of a class in the box and how well the predicted box
fits the object. The main advantages of the YOLO object detection algorithm are the
fast processing of images in real-time and low false detections.
4 Implementation
The proposed work was implemented on Intel Core i5 1.60 GHz CPU with 8 GB
RAM. To implement the machine learning models with computer vision techniques,
the Jupyter notebook Web application was used to write and execute the python
code. Pothole detection dataset from Kaggle was used as dataset 1 [10]. The size
of the dataset was 197 MB containing 320 pothole images and 320 non-pothole
images (640 images in total). Dataset 2 was created manually (using Google image
Detection of Road Potholes Using Computer Vision and Machine Learning. . . 71
Fig. 5 (a) Selected region of interest (ROI). (b) After setting the pixels of the region external to
the ROI to 0
search) with 504 images consisting of 252 pothole images and 252 non-pothole
images. The size of the dataset was 14.5 MB. OpenCV (Open-Source Computer
Vision Library) is an open-source library that is mainly used for programming real-
time applications that involve image processing and computer vision models. In this
work, the OpenCV library was used to read an image from the source directory,
convert it from RGB to grayscale, resize the image, and filter the image.
To ensure that all the images have a standard size, the scale of the images was
resized to 128 × 96 pixels. Since the images will be divided into 8 × 8 patches
during the feature extraction stage, a size of 128 × 96 pixels is preferable. The
RGB images in the dataset contain three layers of pixel values ranging from 0 to
255, and hence, it is computationally complex. Thus, RGB to grayscale conversion
was performed to ensure simpler computational complexity. The Gaussian filter
(Gaussian blur) is a widely used image filtering technique to reduce noise and
intricate details. This low-pass blurring filter that smooths the edges and removes
noise from an image is considered to be efficient for thresholding, edge detection,
and finding contours in an image. Thus, it will improve the efficiency of the pothole
localization procedure during region clustering and construction of the Region
Adjacency Graph (RAG). A Gaussian filter of kernel size 5 was applied to the image.
The pathway/road in the input image is the only required portion to determine the
presence of potholes (Fig. 5a). The remaining portion of the image was selected as a
polygonal region, and the pixel values were set to 0 (Fig. 5b). Therefore, the portion
of the road/pathway was selected as the region of interest. The HOG features and
a fusion of HOG and LBP features were extracted. These features were applied to
various machine learning classifiers to classify the pothole and non-pothole images.
The Adaboost, Gaussian Naïve Bayes, Random Forest, and Support Vector
Machine algorithms were selected for the proposed work. The train/test set was
split in the ratio of 70:30. To find optimum parameters for the classifiers, the grid
search algorithm was used. The grid search algorithm chooses the hyperparameters
by employing exhaustive search on the set of parameters given for the classification
model. It estimates the performance for every combination of the given parameters
and chooses the best performing combination of hyperparameters. The RBF (radial
72 U. Akshaya Devi and N. Arulanand
basis function) kernel SVM was selected using the grid search method. The values of
hyperparameters c and gamma were set to 100 and 1, respectively. For the Random
Forest classifier, the values of hyperparameters such as n_estimators (total number
of trees), criterion, max_depth (maximum depth of the tree), min_samples_leaf
(minimum number of instances needed to be at a leaf node), and min_samples_split
(minimum number of instances needed to split an internal node) were set to
100, “gini” (Gini impurity), 5, 5, 5 respectively. For the Adaboost classifier, the
values of hyperparameters such as n_estimators (maximum number of estimators
required) and learning rate were set to 200 and 0.2, respectively. The machine
learning classifiers were evaluated using a cross-validation method. Subsequently,
the pothole localization was performed for the positively predicted images.
To implement the deep learning model, the google colaboratory notebook with
a single GPU was utilized. The size of the initial dataset was 270 MB with 1106
labeled pothole images [11]. Image data augmentation techniques, which processes
and modifies the original image to create variations of that image, were employed
on the images of the initial dataset. The techniques such as horizontal flip, change of
image contrast, and incorporation of Gaussian noise were adopted to synthetically
expand the size of the dataset. The resultant images of various data augmentation
operations are shown in Fig. 6.
The data augmentation methods benefit the deep learning model as larger training
data leads to an enhanced generalization of the neural network, reduction of
overfitting, and improvement in real-time detections. The dataset obtained after data
augmentation was 773 MB in size with 2500 pothole images and 2500 non-pothole
images. The size of the input images was 416 × 416 pixels. The object labels in each
image were represented using a text file containing five parameters: object class, x-
center, y-center, width, and height. The object class is an integer number given for
each object with the value starting from 0 to (number of classes-1). The x-center,
y-center, width, and height are float values relative to the width and height of the
image. The dataset was split into train/test set with a ratio of 70:30. The number
of iterations was set as 6000, and batch size for training and testing was set as 64
and 1, respectively. Various performance metric such as precision, recall, F1 score,
mean Average Precision (mAP), and prediction time was inferred with the test set.
5 Result Analysis
In the approach of machine learning and computer vision, the classification report
comprising of accuracy, precision, recall, and F1 score was generated and tabulated
(Table 1) for all the models. To estimate the classification model accurately, the k-
fold cross-validation method was utilized. In k-fold cross-validation, the dataset is
divided into k equal-sized partitions. The training of the classifier is performed for
k−1 partitions, and the remaining one partition is used for testing score of the “k”
classifications is averaged and used for performance estimation. In this work, the
machine learning models were evaluated using the 10-fold cross-validation method.
Detection of Road Potholes Using Computer Vision and Machine Learning. . . 73
Fig. 6 (a) Original image and the resultant images of (b) horizontal flip, (c) contrast change, and
(d) Gaussian noise addition
Table 1 Classification performance report for different feature sets as input (dataset 1)
HOG features Combination of HOG and LBP features
ML classifiers Accuracy Precision Recall F1 score Accuracy Precision Recall F1 score
Adaboost 86.66% 87% 87% 87% 96.66% 97% 97% 97%
Naïve-Bayes 85.33% 87% 85% 85% 90.66% 91% 91% 91%
Random Forest 87.33% 88% 87% 87% 95.33% 96% 95% 95%
SVM 90.66% 91% 91% 91% 92.66% 93% 93% 93%
The average accuracy scores acquired using the 10-fold cross-validation for all the
classifiers are shown in Table 2.
The ROC curve (Receiver Operating Characteristic curve) is plotted for models
that use only the HOG feature set and models that use a fusion of HOG and LBP
feature sets (Fig. 7).
The AUC scores are computed from the ROC curves and the results were
tabulated (Table 3). It can be noted that the AUC score for all the classification
models that use a fusion of HOG and LBP feature set is above 90%. The
performance improvement does prove that the adaption of fused HOG and LBP
features for the classification model helps achieve better results.
74
0.8
0.8
True Positive Rate
0.4 0.4
0.8 0.8
True Positive Rate
0.4 0.4
0.2 SVM
0.2 SVM
Random Forest Random Forest
Naive Bayes Naive Bayes
0.0 Adaboost 0.0 Adaboost
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
False Positive Rate False Positive Rate
Fig. 7 (a) ROC curve for the classification models that uses (a) HOG feature set extracted from
images of dataset 1. (b) Fusion of HOG and LBP feature set extracted from images of dataset 1.
(c) HOG feature set extracted from images of dataset 2. (d) Fusion of HOG and LBP feature set
extracted from images of dataset 2
The Adaboost classification algorithm shows the best performance among all the
classifiers. Further, the exact location of the pothole region must be determined and
highlighted. Therefore, the normalized graph cut segmentation using RAG (Region
Adjacency Graph) was employed for pothole localization in positively classified
images. Figures 8 and 9 depicts the process and result of pothole detection using
classification and localization.
In the deep learning approach, detection results of the YOLO v3 model run on
the test data are shown in Table 4. The prediction time for YOLO v3 was 26.90
milliseconds. The sample output of pothole detection by the YOLO v3 model is
shown in Fig. 10.
Based on the outcome of classification using HOG features and fusion of HOG
and LBP features, it is evident that the fusion of HOG and LBP features improves
the classification performance of the machine learning models.
The classification results of machine learning algorithms convey that the
Adaboost classifier with the HOG and LBP feature set outperforms all the other
classifiers. For creating a bounding box around the pothole region, localization of
potholes was performed using normalized graph cut using RAG (Region Adjacency
Graph). The overall detection time for this approach is approximately 0.35 seconds.
Detection of Road Potholes Using Computer Vision and Machine Learning. . . 77
Fig. 8 A step-by-step illustration of normalized graph cut segmentation using Region Adjacency
Graph (RAG)
Fig. 9 Sample output of Adaboost classification and normalized graph cut segmentation using
RAG for detection of potholes
This approach of pothole detection does not require high processors like GPU to
run smoothly. However, the results have few false positives during the localization
of potholes.
The YOLO v3 model achieved a mean Average Precision (mAP) of 88.01% and
a faster inference time to detect the pothole(s) in an image. With a prediction time of
26.90 milliseconds, the model can process up to 37 frames per second. However, the
requirement of higher processing power and disk space makes the model unlikely
to be used in low-power edge devices.
78 U. Akshaya Devi and N. Arulanand
Fig. 10 Sample output of pothole detection performed using YOLO v3 model for pothole and
non-pothole images
6 Conclusion
References
1. Garcillanosa, M. M., Pacheco, J. M. L., Reyes, R. E., & San Juan, J. J. P. (2018). Smart
detection and reporting of potholes via image-processing using Raspberry-Pi microcontroller.
In 10th international conference on knowledge and smart technology (KST), Chiang Mai,
Thailand. 31 Jan–3 Feb, 2018.
2. Rao, A. S., Gubbi, J., Palaniswami, M., & Wong, E. (2016). A vision-based system to detect
potholes and uneven surfaces for assisting blind people. In IEEE international conference on
communications (ICC), Kuala Lumpur, Malaysia. 22–27 May, 2016.
3. Azhar, K., Murtaza, F., Yousaf, M. H., & Habib, H. A. (2016). Computer vision based detection
and localization of potholes in Asphalt Pavement images. In IEEE Canadian conference on
electrical and computer engineering (CCECE), Vancouver, BC, Canada. 15–18 May, 2016.
4. Yousaf, M. H., Azhar, K., Murtaza, F., & Hussain, F. (2018). Visual analysis of asphalt
pavement for detection and localization of potholes. Advanced Engineering Informatics,
Elsevier, 38, 527–537.
5. Ouma, Y. O., & Hahn, M. (2017). Pothole detection on asphalt pavements from 2D-colour
pothole images using fuzzy c-means clustering and morphological reconstruction. Automation
in construction, Elsevier, 83, 196–211.
6. Kang, B.-H., & Choi, S.-I. (2017). Pothole detection system using 2D LiDAR and camera. In
Ninth international conference on ubiquitous and future networks (ICUFN), Milan, Italy. 4–7
July, 2017.
7. Buza, E., Omanovic, S., & Huseinovic, A. (2013). Pothole detection with image processing
and spectral clustering. In Recent advances in computer science and networking, 2013.
8. Ping, P., Yang, X., & Gao, Z. (2020). A deep learning approach for street pothole detection. In
IEEE sixth international conference on big data computing service and applications, Oxford,
UK. 3–6 Aug, 2020.
9. Ray, A., & Ray, H. (2019). Smart portable assisted device for visually impaired people. In
International conference on intelligent sustainable systems (ICISS), Palladam, India, 21–22
Feb, 2019.
10. Atuyla kumar. (2020). Kaggle pothole detection dataset. figsharehttps://www.kaggle.com/
atulyakumar98/pothole-detection-dataset
11. Atikur Rahman Chitholian. (2020). YOLO v3 pothole detection dataset. figsharehttps://
public.roboflow.com/object-detection/pothole
12. Redmon, J. (2016). You only look once: Unified, real-time object detection. In IEEE conference
on computer vision and pattern recognition (CVPR), Las Vegas, NV, USA. 27–30 June, 2016.
Shape Feature Extraction Techniques for
Computer Vision Applications
1 Introduction
E. F. I. Raj ()
Department of Electrical and Electronics Engineering, Dr. Sivanthi Aditanar College of
Engineering, Tiruchendur, Tamil Nadu, India
M. Balaji
Department of Electrical and Electronics Engineering, SSN College of Engineering, Chennai,
Tamil Nadu, India
in this stage to improve image quality by reducing noise and making images clearer
for measuring current features [9]. The feature extraction stage extracts feature from
preprocessed images to make the recognition task easier and more accurate. Many
future extracting techniques are available to extract the important features of the
object present in the image. The retrieved features are then saved in a database.
The classifier will then utilize the database to look for and identify a comparable
image based on the input image attributes. Among all of these procedures, feature
extraction is among the most important for enabling object detection simpler and
more precisely [10].
Shape feature extraction is important in various applications, including (1)
shape retrieval, (2) shape recognition and classification, (3) shape alignment and
registration, and (4) shape estimation and classification. Shape retrieval is the
process of looking for full shapes that seem to be identical to a query shape in a large
database of shapes [11]. In general, all shapes that are within a specific distance
of the query, or the first limited shapes with the shortest distance, are calculated.
Shape recognition and classification is the process of determining if a given shape
resembles a model well or which database class is the most comparable. The
processes of converting or interpreting one shape to match other shapes completely
or partially are known as shape alignment and registration [12]. Estimation and
simplification of shapes reduce the number of elements (points, segments, etc.)
while maintaining similarity to the original.
2 Feature Extraction
The layout, texture, color, and shape of an object are used by the majority of image
retrieval systems. Its shape defines the physical structure of an object. Moment,
region, border, and so on can all be used to depict it. These depictions can be used
to recognize objects, match shapes, and calculate shape dimensions. The structural
patterns of surfaces of cloth, grass, grain, and wood are examples of texture.
Normally, it refers to the repeating of basic texture pieces known as Texel. A Texel
is made up of many pixels that are placed in a random or periodic pattern. Artificial
textures are usually periodic or deterministic, but natural textures are often random.
Linear, uneven, smooth, fine, or coarse textures are all possibilities. Texture can be
divided into two types in image analysis: statistical and structural. The textures in
a statistical approach are random. Textures in the structural approach are entirely
structural and predictable, repeating according to certain deterministic or random
placement principles. Another approach is also proposed in the literature, which
combines statistical and structural analysis. Mosaic models are what they’re called.
It represents a geometrical process that is random. Although texture, color, and
shape are key aspects of image retrieval, they are rendered useless when the image
in the database or the input image lacks such qualities. An example is that the query
image is a lighter version with simply white and black lines.
Shape Feature Extraction Techniques for Computer Vision Applications 83
Kim et al. [13] propose a novel watermarking algorithm for grayscale text document
images. Edge image matching is a common comparison technique in computer
vision and retrieval of the image. This edge directions histogram is an important tool
for object detection in images with the absence of color information and identical
color information [14]. For this feature extraction, the edge is extracted using the
Canny edge operator, and the related edge directions are then quantized into 72 bins
of 50 each [15]. Histograms of edge directions (HED) can also be used to represent
shapes.
This detector has been utilized in a wide range for many additional images matching
applications, demonstrating its effectiveness for efficient motion tracking [16].
Although these feature detectors are commonly referred to as corner detectors, they
are capable of detecting any image region with significant gradients in all directions
at a prearranged scale. This approach is ineffective for matching images of varying
sizes because it is sensitive to variations in image scale.
84 E. F. I. Raj and M. Balaji
In the Angular Radial Partitioning (ARP) approach, the edge detection is conducted
after the images stored in the database are transformed to gray scale [22]. Surround-
ing circles partition the edge image to achieve scale invariance, surrounding circles
are found with the intersection points of the edge, and for the feature extraction
technique employed in the picture retrieval comparison procedure, angles are
measured. The approach takes advantage of an object’s edge’s surrounding circle
to generate a number of radial divisions for that object’s edge image; therefore,
equidistant circles are created to extract the features required for scale after creating
the surrounding circle invariance.
Shape Feature Extraction Techniques for Computer Vision Applications 85
A histogram made up of edge pixels is called the Edge Histogram Descriptor (EHD)
[29]. It is an excellent texture signature approach that can also be used to match
images. But the main drawback is it is a rotation-variant approach. In its texture
section, the standard MPEG-7 defines the EHD [30]. This technique is beneficial
for image-to-image matching. This descriptor, on the other hand, is ineffective at
describing rotation invariance.
The shape is a critical fundamental feature that is used to describe the content of
an image. However, as a result of occlusion, noise, and arbitrary distortion, the
shape of the object is frequently corrupted, complicating the object recognition
problem. Shapes are represented using shape characteristics that are either based on
the boundary plus the inside content or on the shape boundary information. Object
identification uses a variety of shape features, which are evaluated based on how
well they allow users to retrieve comparable forms from the database.
86 E. F. I. Raj and M. Balaji
4 Shape Signature
The shape signature refers to the one-dimensional shape feature function obtained
from the shape’s edge coordinates. The shape signature generally holds the per-
spective shape property of the object. Shape signatures can define the entire
shape; they are also commonly used as a preprocessing step before other feature
88 E. F. I. Raj and M. Balaji
Centroid distance function (CDF) is defined as the distance of contour points from
the shape’s centroid (x0 , y0 ) and is represented by Eq. (1) [37].
r(n) = (x(n) − x0 )2 + (y(n) − y0 )2 (1)
The centroid is located at the coordinates (x0 , y0 ), which are the average of the x
and y coordinates for all contour points. A shape’s boundary is made up of a series
of contour or boundary points. A radius is a straight line that connects the centroid
to a point on the boundary. The Euclidean distance is used in the CDF model to
capture a shape’s radii lengths from its centroid at regular intervals as the shape’s
descriptor [38]. Let be the regular interval (in degrees) between two radii (Fig.
2). K = 360/ then gives the number of intervals. All radii lengths are normalized
by dividing by the longest radius length from the extracted radii lengths.
Moreover, without sacrificing generality, assume that the intervals are considered
clockwise from the x-axis. The shape descriptor can then be represented as a vector,
as illustrated in Eq. 2. Figure 3 illustrates the centroid distance function approach
plot of a shape boundary.
Shape Feature Extraction Techniques for Computer Vision Applications 89
S = {r0 , rθ , r2θ , . . . r(k−1)θ (2)
This method has some advantages and disadvantages. This method is translation-
independent due to the deduction of centroid, which designates the shape’s position
from edge coordinates. It is the main advantage. The main drawback is that this
method fails to properly depict the shape if there are multiple boundary points at the
same interval.
The chord length function (CLF) is calculated from the shape contour without using
a reference point [39]. As shown in Fig. 3, the CLF of each contour point C is
the shortest distance between C and the other contour point C’ such that line CC’
90 E. F. I. Raj and M. Balaji
is orthogonal to the tangent vector at C. This method also has some merits and
demerits. The important merit is this method is not a translation variant, and it
addresses the issue of biased reference points (the fact that the centroid is frequently
biased by contour defections or noise). The demerit is the chord length function
is extremely sensitive to noise, and even smoothed shape boundaries can cause an
extreme burst in the signature.
In area function (AF) approach, when the contour points along the shape edge are
changed, the area of the triangle modeled by two consecutive contour points and
the centroid changes as well [40]. It yields an area function that can be thought of
as a shape representation. It is illustrated in Fig. 4. Let An denote the area between
consecutive edge points Pn, Pn+1, and the centroid C. The area function approach
and its plot of a shape boundary are shown in Figs. 5 and 6.
The shape is an important visual and emerging feature for explaining image content.
One of the most difficult problems in developing effective content-based image
retrieval is the usage of object shape [41]. Because determining the similarity
between shapes is difficult, a precise explanation of shape content is impossible.
Thus, in shape-based image retrieval, two steps are critical: shape feature extraction
and similarity calculation among the extracted features. Some of the real-time
Shape Feature Extraction Techniques for Computer Vision Applications 91
Step 3: Convert all images to binary so that the fruit pixels are 1 s and the residual
pixels are 0 s [43], as shown in Fig. 8.
Step 4: The Canny edge detector is an edge detection operator that detects a wide
range of edges in images using a multistage approach. The Canny edge detection
algorithm [44] is used to extract the fruit contour, as shown in Fig. 9.
Step 5: For each image, compute the centroid distance [45]. Figure 10 depicts Fig.
9’s centroid distance plot.
Step 6: Euclidean distance measurement is used to compare the centroid distance
between training and testing images [46].
Step 7: The test fruit image is distinguished from the training images by the smallest
difference [47].
Shape Feature Extraction Techniques for Computer Vision Applications 93
The shape feature can be used in a variety of ways to recognize leaves [48]. The
following is an example of a leaf recognition algorithm that uses the shape feature:
Step 1: First, gather some images of various sorts of leaves with varying shapes. A
leaf is depicted in Fig. 11.
Step 2: The images are classified into two parts: training and testing.
Step 3: Convert all images to binary, with the fruit pixels being 1 s and the residual
pixels being 0 s (Fig. 12) [43].
Step 4: Following that, the leaf contour is extracted using the clever edge detection
algorithm (Fig. 13) [44]. It is an image processing approach that identifies points
in a digital image that have discontinuities or sharp changes in brightness.
Step 5: Calculate the seven Hu moments [49] associated with each image. Figure 14
depicts the plot of Fig. 13’s seven Hu moment values.
Step 6: Euclidean distance measurement is used to compare moments between
training and testing images.
Step 7: The test leaf image is distinguished from the training images by the smallest
difference [47].
94 E. F. I. Raj and M. Balaji
There are two images in this case: the test image and the other of which is the target
image. The test image depicts a scene of flowers in front of a window, with the target
image (flower) to be found using scale-invariant feature transform (SIFT).
Step 1: Input target image (flower) – (Fig. 15a).
Step 2: Input test image (scene with cluttered objects) – (Fig. 15b).
Step 3: By using SIFT, find 100 strongest points in the target image – (Fig. 16a).
Step 4: By using SIFT, find 200 strongest points in the test image – (Fig. 16b).
Step 5: Calculate normatively matched spots by comparing the two images – (Fig.
17).
Step 6: Calculate a point that is precisely matched – (Fig. 18).
Step 7: A polygon should be drawn around the region of exactly matching points –
(Fig. 19).
96 E. F. I. Raj and M. Balaji
Fig. 16 (a) 100 strongest points in the target image, (b) 200 strongest points in the test image
6 Recent Works
There are so many recent works reported in the literature regarding shape feature
extraction in computer vision. The same approach can be used in many recent
applications like robotics, fault detection, autonomous vehicle management system,
industry 4.0 framework, medical applications, etc. Here, we listed a few of them for
our reference.
Yang et al. [50] explained Fish Detection and Behavior Analysis Using Vari-
ous Computer Vision Models in Intelligent Aquaculture, and Foysal et al. [51],
Application for Detecting Garment Fit on a Smartphone Using Computer Vision
Approach. In [52–54], the authors detailed various autonomous vehicle management
systems. A comprehensive review of vehicle detection, traffic light recognition by
autonomous vehicles in the city environment, and pothole detection in the roadways
for such vehicles are also explained. In [55], Das et al. provided parking area
patterns from autonomous vehicle positions using an aerial image by computer
vision using mask R-Convolutional Neural Network.
Devaraja et al. [56] explained computer vision-based grasping of the robotic
hands used in industries. Adding to this, the authors brief about shape recognition
by autonomous robots in an industrial environment. In [57], the authors detailed
the computer vision-based robotic equipment used in the medical field and their
importance in surgeries. In [58], the author describes robotic underwater vehicles
that use computer vision to monitor deepwater animals. The high efficiency of the
system can be attained through employing machine learning techniques along with
computer vision.
In [59], the authors detailed the computer vision-enabled support vector
machines assisted fault detection in industrial texture. Cho et al. [60] explained
the fault analysis and fault detection in a wind turbine system using an artificial
neural network along with a Kalman filter using computer vision approaches. In
[61, 62], the author detailed fault detection in the aircraft wings and sustainable
98 E. F. I. Raj and M. Balaji
The present work is focusing more on the shape feature extraction technique used
in computer vision applications. Various feature extraction techniques are also
explained in detail. Histogram-based image retrieval feature extraction approaches
used in computer vision include the Edge Histogram Descriptor and histograms
of edge directions. The eigenvector approach, unlike scale variant, rotation, or
translation, is particularly sensitive to changes in individual pixel values. ARP is
invariant in terms of scale and rotation. The EPNI method is invariant in terms of
scale and translation but not in terms of rotation. Noise affects the color histogram.
But the color histogram approach is insensitive to rotation and translation.
Shape description and representation approaches are divided into two categories:
contour-based approaches and region-based approaches. Both sorts of approaches
are further subdivided into global and structural techniques. Although contour-based
Shape Feature Extraction Techniques for Computer Vision Applications 99
techniques are more popular than region-based techniques, they still have significant
drawbacks. The region-based approaches can circumvent these restrictions. Shape
signatures are frequently utilized as a preprocessing step before the extraction of
other features. The most significant one-dimensional feature functions are also
presented in the current work. Some of the real-time feature extraction and object
recognition applications used in computer vision are explained in detail. In addition
to that, the latest recent works related to shape feature extraction with computer
vision are also listed.
References
1. Bhargava, A., & Bansal, A. (2021). Fruits and vegetables quality evaluation using computer
vision: A review. Journal of King Saud University-Computer and Information Sciences, 33(3),
243–257.
2. Zhang, L., Pan, Y., Wu, X., & Skibniewski, M. J. (2021). Computer vision. In Artificial
intelligence in construction engineering and management (pp. 231–256). Springer.
3. Dong, C. Z., & Catbas, F. N. (2021). A review of computer vision–based structural health
monitoring at local and global levels. Structural Health Monitoring, 20(2), 692–743.
4. Iqbal, U., Perez, P., Li, W., & Barthelemy, J. (2021). How computer vision can facilitate
flood management: A systematic review. International Journal of Disaster Risk Reduction,
53, 102030.
5. Torralba, A., Murphy, K. P., Freeman, W. T., & Rubin, M. A. (2003, October). Context-
based vision system for place and object recognition. In Computer vision, IEEE international
conference on (Vol. 2, pp. 273–273). IEEE Computer Society.
6. Liang, M., & Hu, X. (2015). Recurrent convolutional neural network for object recognition.
In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3367–
3375).
7. Kortylewski, A., Liu, Q., Wang, A., Sun, Y., & Yuille, A. (2021). Compositional convolutional
neural networks: A robust and interpretable model for object recognition under occlusion.
International Journal of Computer Vision, 129(3), 736–760.
8. Alom, M. Z., Hasan, M., Yakopcic, C., Taha, T. M., & Asari, V. K. (2021). Inception recurrent
convolutional neural network for object recognition. Machine Vision and Applications, 32(1),
1–14.
9. Cisar, P., Bekkozhayeva, D., Movchan, O., Saberioon, M., & Schraml, R. (2021). Computer
vision based individual fish identification using skin dot pattern. Scientific Reports, 11(1), 1–
12.
10. Saba, T. (2021). Computer vision for microscopic skin cancer diagnosis using handcrafted and
non-handcrafted features. Microscopy Research and Technique, 84(6), 1272–1283.
11. Li, Y., Ma, J., & Zhang, Y. (2021). Image retrieval from remote sensing big data: A survey.
Information Fusion, 67, 94–115.
12. Lucny, A., Dillinger, V., Kacurova, G., & Racev, M. (2021). Shape-based alignment of the
scanned objects concerning their asymmetric aspects. Sensors, 21(4), 1529.
13. Kim, Y. W., & Oh, I. S. (2004). Watermarking text document images using edge direction
histograms. Pattern Recognition Letters, 25(11), 1243–1251.
14. Bakheet, S., & Al-Hamadi, A. (2021). A framework for instantaneous driver drowsiness
detection based on improved HOG features and Naïve Bayesian classification. Brain Sciences,
11(2), 240.
15. Heidari, H., & Chalechale, A. (2021). New weighted mean-based patterns for texture analysis
and classification. Applied Artificial Intelligence, 35(4), 304–325.
100 E. F. I. Raj and M. Balaji
16. Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International
Journal of Computer Vision, 60(2), 91–110.
17. Linde, O., & Lindeberg, T. (2012). Composed complex-cue histograms: An investigation of the
information content in receptive field based image descriptors for object recognition. Computer
Vision and Image Understanding, 116(4), 538–560.
18. Hazgui, M., Ghazouani, H., & Barhoumi, W. (2021). Evolutionary-based generation of rotation
and scale invariant texture descriptors from SIFT keypoints. Evolving Systems, 12, 1–13.
19. Shapiro, L. S., & Brady, J. M. (1992). Feature-based correspondence: An eigenvector approach.
Image and Vision Computing, 10(5), 283–288.
20. Park, S. H., Lee, K. M., & Lee, S. U. (2000). A line feature matching technique based on an
eigenvector approach. Computer Vision and Image Understanding, 77(3), 263–283.
21. Schiele, B., & Crowley, J. L. (2000). Recognition without correspondence using multidimen-
sional receptive field histograms. International Journal of Computer Vision, 36(1), 31–50.
22. Chalechale, A., Mertins, A., & Naghdy, G. (2004). Edge image description using angular radial
partitioning. IEE Proceedings-Vision, Image and Signal Processing, 151(2), 93–101.
23. Chalechale, A., & Mertins, A. (2002, Oct). An abstract image representation based on
edge pixel neighborhood information (EPNI). In EurAsian conference on information and
communication technology (pp. 67–74). Springer.
24. Wang, Z., & Zhang, H. (2008, July). Edge linking using geodesic distance and neighborhood
information. In 2008 IEEE/ASME international conference on advanced intelligent mechatron-
ics (pp. 151–155). IEEE.
25. Chakravarti, R., & Meng, X. (2009, April). A study of color histogram based image retrieval.
In 2009 sixth international conference on information technology: New generations (pp. 1323–
1328). IEEE.
26. Liu, G. H., & Wei, Z. (2020). Image retrieval using the fused perceptual color histogram.
Computational Intelligence and Neuroscience, 2020, 8876480.
27. Mohseni, S. A., Wu, H. R., Thom, J. A., & Bab-Hadiashar, A. (2020). Recognizing induced
emotions with only one feature: A novel color histogram-based system. IEEE Access, 8,
37173–37190.
28. Chaki, J., & Dey, N. (2021). Histogram-based image color features. In Image Color Feature
Extraction Techniques (pp. 29–41). Springer.
29. Park, D. K., Jeon, Y. S., & Won, C. S. (2000, November). Efficient use of local edge histogram
descriptor. In Proceedings of the 2000 ACM workshops on multimedia (pp. 51–54).
30. Alreshidi, E., Ramadan, R. A., Sharif, M., Ince, O. F., & Ince, I. F. (2021). A comparative
study of image descriptors in recognizing human faces supported by distributed platforms.
Electronics, 10(8), 915.
31. Virmani, J., Dey, N., & Kumar, V. (2016). PCA-PNN and PCA-SVM based CAD systems for
breast density classification. In Applications of intelligent optimization in biology and medicine
(pp. 159–180). Springer.
32. Chaki, J., Parekh, R., & Bhattacharya, S. (2016, January). Plant leaf recognition using
a layered approach. In 2016 international conference on microelectronics, computing and
communications (MicroCom) (pp. 1–6). IEEE.
33. Tian, Z., Dey, N., Ashour, A. S., McCauley, P., & Shi, F. (2018). Morphological segmenting and
neighborhood pixel-based locality preserving projection on brain fMRI dataset for semantic
feature extraction: An affective computing study. Neural Computing and Applications, 30(12),
3733–3748.
34. Chaki, J., Parekh, R., & Bhattacharya, S. (2018). Plant leaf classification using multiple
descriptors: A hierarchical approach. Journal of King Saud University-Computer and Infor-
mation Sciences, 32, 1158.
35. AlShahrani, A. M., Al-Abadi, M. A., Al-Malki, A. S., Ashour, A. S., & Dey, N. (2018).
Automated system for crops recognition and classification. In Computer vision: Concepts,
methodologies, tools, and applications (pp. 1208–1223). IGI Global.
36. Chaki, J., & Parekh, R. (2012). Designing an automated system for plant leaf recognition.
International Journal of Advances in Engineering & Technology, 2(1), 149.
Shape Feature Extraction Techniques for Computer Vision Applications 101
37. Dey, N., Roy, A. B., Pal, M., & Das, A. (2012). FCM based blood vessel segmentation method
for retinal images. arXiv preprint arXiv:1209.1181.
38. Chaki, J., & Parekh, R. (2011). Plant leaf recognition using shape based features and neural
network classifiers. International Journal of Advanced Computer Science and Applications,
2(10), 41.
39. Kulfan, B. M. (2008). Universal parametric geometry representation method. Journal of
Aircraft, 45(1), 142–158.
40. Dey, N., Das, P., Roy, A. B., Das, A., & Chaudhuri, S. S. (2012, Oct). DWT-DCT-SVD
based intravascular ultrasound video watermarking. In 2012 world congress on information
and communication technologies (pp. 224–229). IEEE.
41. Zhang, D., & Lu, G. (2001, Aug). Content-based shape retrieval using different shape
descriptors: A comparative study. In IEEE international conference on multimedia and expo,
2001. ICME 2001 (pp. 289–289). IEEE Computer Society.
42. Patel, H. N., Jain, R. K., & Joshi, M. V. (2012). Automatic segmentation and yield measurement
of fruit using shape analysis. International Journal of Computer Applications, 45(7), 19–24.
43. Gampala, V., Kumar, M. S., Sushama, C., & Raj, E. F. I. (2020). Deep learning based image
processing approaches for image deblurring. Materials Today: Proceedings.
44. Deivakani, M., Kumar, S. S., Kumar, N. U., Raj, E. F. I., & Ramakrishna, V. (2021). VLSI
implementation of discrete cosine transform approximation recursive algorithm. Journal of
Physics: Conference Series, 1817(1), 012017 IOP Publishing.
45. Priyadarsini, K., Raj, E. F. I., Begum, A. Y., &Shanmugasundaram, V. (2020). Comparing
DevOps procedures from the context of a systems engineer. Materials Today: Proceedings.
46. Chaki, J., Dey, N., Moraru, L., & Shi, F. (2019). Fragmented plant leaf recognition: Bag-
of-features, fuzzy-color and edge-texture histogram descriptors with multi-layer perceptron.
Optik, 181, 639–650.
47. Chouhan, A. S., Purohit, N., Annaiah, H., Saravanan, D., Raj, E. F. I., & David, D. S. (2021).
A real-time gesture based image classification system with FPGA and convolutional neural
network. International Journal of Modern Agriculture, 10(2), 2565–2576.
48. Lee, K. B., & Hong, K. S. (2013). An implementation of leaf recognition system using leaf
vein and shape. International Journal of Bio-Science and Bio-Technology, 5(2), 57–66.
49. Chaki, J., & Parekh, R. (2017, Dec). Texture based coin recognition using multiple descriptors.
In 2017 international conference on computer, electrical & communication engineering
(ICCECE) (pp. 1–8). IEEE.
50. Yang, L., Liu, Y., Yu, H., Fang, X., Song, L., Li, D., & Chen, Y. (2021). Computer vision
models in intelligent aquaculture with emphasis on fish detection and behavior analysis: A
review. Archives of Computational Methods in Engineering, 28(4), 2785–2816.
51. Foysal, K. H., Chang, H. J., Bruess, F., & Chong, J. W. (2021). SmartFit: Smartphone
application for garment fit detection. Electronics, 10(1), 97.
52. Abbas, A. F., Sheikh, U. U., AL-Dhief, F. T., & Haji Mohd, M. N. (2021). A comprehensive
review of vehicle detection using computer vision. Telkomnika, 19(3), 838.
53. Liu, X., & Yan, W. Q. (2021). Traffic-light sign recognition using capsule network. Multimedia
Tools and Applications, 80(10), 15161–15171.
54. Dewangan, D. K., & Sahu, S. P. (2021). PotNet: Pothole detection for autonomous vehicle
system using convolutional neural network. Electronics Letters, 57(2), 53–56.
55. Das, M. J., Boruah, A., Malakar, J., & Bora, P. (2021). Generating parking area patterns
from vehicle positions in an aerial image using mask R-CNN. In Proceedings of international
conference on computational intelligence and data engineering (pp. 201–209). Springer.
56. Devaraja, R. R., Maskeliūnas, R., & Damaševičius, R. (2021). Design and evaluation of
anthropomorphic robotic hand for object grasping and shape recognition. Computers, 10(1),
1.
57. Esteva, A., Chou, K., Yeung, S., Naik, N., Madani, A., Mottaghi, A., et al. (2021). Deep
learning-enabled medical computer vision. NPJ Digital Medicine, 4(1), 1–9.
58. Katija, K., Roberts, P. L., Daniels, J., Lapides, A., Barnard, K., Risi, M., et al. (2021). Visual
tracking of deepwater animals using machine learning-controlled robotic underwater vehicles.
102 E. F. I. Raj and M. Balaji
1 Introduction
A picture describes a scene efficiently and conveys the information in better way.
Human visual perception aids the human to interpret more details from an image.
Almost 90% of the data processed by human brain is visual data, and this helps
human brain to respond and process visual data 60,000 times better than any other
form of data. Image processing systems need representation of image in digital form.
A digital image is a two-dimensional array of numbers, where the numbers represent
the intensity values of the image at various spatial locations. These pixels possess
spatial coherence that can be inherited by performing arithmetic operations such
as addition, subtraction, etc. The statistical manipulations of the pixel values help
to develop an image processing technique for a variety of applications. Most of
the techniques employ feature extraction as one of the steps. A variety of features
such as colour, shape, and textures can be extracted from digital images. Among
these features, texture features such as fine, coarse, smooth, grained, etc., play an
important role.
R. Anand ()
Department of ECE, Sri Eshwar College of Engineering, Coimbatore, India
T. Shanthi · R. S. Sabeenian
Research Member in Sona SIPRO, Department of ECE, Sona College of Technology, Salem,
India
e-mail: [email protected]; [email protected]
S. Veni
Department of Electronics and Communication Engineering, Amrita School of Engineering,
Coimbatore, India
e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 103
B. V. Kumar et al. (eds.), Smart Computer Vision, EAI/Springer Innovations in
Communication and Computing, https://doi.org/10.1007/978-3-031-20541-5_5
104 R. Anand et al.
are performed. They amassed the Amazon dataset at first and then performed
preprocessing for stop words and special characters abstraction. They applied phrase
level, single word, and multiword feature cull or extraction techniques. Ingenuous
Bayes is used as the classifier. They concluded that Ingenuous Bayes gives better
results for phrase level than a single word and multiword. The main cons of
this chapter are that they used only an Ingenuous Bayes classifier algorithm from
which we cannot get an ample result. In paper [8], it has utilized more facile
algorithms so it is facile to understand. The system gives high precision on SVM,
so it cannot work felicitously on the astronomically immense dataset. They used
support vector machine (SVM), logistic regression, and decision trees method. In
paper [9], tf-idf is utilized here as a supplemental experiment. It can prognosticate
rating by utilizing a bag of words. But classifiers used here are only few. They used
root mean square error and linear regression model. So, those are some cognate
works mentioned above, and we endeavoured to make our work more efficient
by culling best conceptions from them and applied those together. In our system,
we used a sizably voluminous amount of datasets to give efficient results and
make better decisions. Moreover, we have utilized active learning approach to label
datasets that can dramatically expedite many machine learning tasks. Our system
additionally consists of several types of feature extraction methods. To the best
of our erudition, our proposed approach gave higher precision than the subsisting
research works. Strength and weakness of statistical approach methods for texture
image classifications of the proposed work are shown in Table 1.
Table 1 Strength and weakness of statistical approach methods for texture image classifications
Methods Strengths Weakness
Morphological Good efficient aperiodic image 1. Morphological operations are not
operation [5] texture applicable for periodic images.
Autocorrelation 1. It overcomes illumination 1. Real-time applications for large
method [6] distortion, and it is robust to noise. images need high computations.
2. Low computational complexity 2. Not suitable for all kinds of
textures
Grey-level 1. Spatial relationship of pixels with 1. High Computational time.
co-occurrence different 10 statistical computations 2. Optimum movement vector is
matrix [7] 2. Contrast, Energy, Homogeneity, problematic.
Mean, Standard Deviation, Entropy, 3. It requires feature selection
RMS, Variance, Smoothness, IDM procedure.
4. Accuracy depends on the offset
3. Accuracy rate will be high
rotation.
2 GLCM
The Grey-Level Co-occurrence Matrix is a square matrix that is obtained from the
input image. The dimension of the GLCM matrix is equal to the number of grey
levels in the input image. For example, an 8-bit image will have 256 grey levels
ranging from [0 255]. For such an image, the GLCM matrix will have 256 rows
and 256 columns with each row/column representing one of the intensity values.
The second-order statistics are attained by considering a set of pixels related to
each other in positive three dimensions. The Grey-Level Co-occurrence Matrices
provide rare mathematical statistics on the texture. GLCM matrix of image depends
on the direction and offset values. The direction can be anyone among the eight
possible directions as shown in Fig. 1. The offset represents the distance between
pixels. If the distance between the pixels is 1, the immediate neighbouring pixel in
the direction is taken for consideration. By this way, several GLCM matrices can be
obtained from a single image that is shown in Fig. 1.
GLCM matrix is a square matrix that has the same number of rows and the same
number of columns with positive numbers only. GLCM matrix is N .× N matrix,
where N denotes the number of possible grey levels in an image. For example, a
2-bit image will have four grey levels (0–3) and results in a GLCM matrix of size
4.×4. The rows and columns are the grey values (0–7). Consider the following image
.f (x, y) of size 5 .× 5 with its grey-level representation given in Figs. 2 and 3.
The matrix .G, θ, d = G0, 1 represents the GLCM matrix. The first row
corresponds to the grey value 0, and next rows to the grey values 1, 2, and 3.
Fig. 1 Co-occurrence matrix 135’ [D, -D] 90’ [-D, -0] 45’ [-D, D]
1 0 2 2 1
1 0 0 1 2
1 3 1 1 3 f (x, y) =
0 1 1 1 3
0 2 2 1 2
Fig. 2 The intensity values and its corresponding grey levels for an image segment .f (x, y) with
four grey levels
Similarly the first column corresponds to the grey value 0, and the next columns
to the grey values 1, 2, and 3.The first element in the first row of the GLCM matrix
gives the count of occurrence of the grey value 0 in the neighbourhood of zero
direction.
Observing at the input matrix, the pair .(0, 0) occurs only at one point; hence, the
first cell in the GLCM matrix equals to the value 1. The second element in the first
row of the GLCM matrix gives the count of occurrence of the grey value 1 in the
neighbourhood of 0 direction. Observing at the input matrix, the pair .(0, 1) occurs
at two points; hence, the second element in the GLCM matrix equals to the value
2. Similarly, the third and fourth elements are calculated based on the occurrence of
the pairs (0,2) and (0,3). The second row of the GLCM matrix is computed based on
the occurrence of the pairs (1,0), (1,1), (1,2), and (1,3). The third row is based on the
occurrence of the pairs (2,0), (2,1), (2,2), and (2,3), and the fourth row is based on
the occurrence of the pairs (3,0), (3,1), (3,2), and (3,3). The resulting GLCM matrix
G 0,1 is given in Fig. 4.
GLCM matrix with single offset is not sufficient for image analysis. For example,
the GLCM matrix with zero offset is not adequate to extract information from an
image with vertical details. The input image may contain details in any direction;
hence, GLCM matrices with different offset values and different distance values are
computed from a single image, and the average of all these matrices is utilized
for further analysis. Each and every value in this matrix is divided by the total
number of pairs available in the input matrix to get a normalised GLCM matrix,
i.e., for an image with L grey levels each and every element of the average matrix
is to be divided by (L .× L .− 1). The normalised GLCM matrix g (m, n) can be
used to extract several features from the image. These features are elaborated in the
upcoming section.
108 R. Anand et al.
1 0 2 2 1
1 0 0 1 2
f (x, y) = 1 3 1 1 3
0 1 1 1 3
0 2 2 1 2
1 0 2 2 1 1 2 2 0
1 0 0 1 2
f (x, y) = 1 3 1 1 3
0 1 1 1 3
0 2 2 1 2
GTd
1 0 2 2 1
1 0 0 1 2
f (x, y) = 1 3 1 1 3
0 1 1 1 3
0 2 2 1 2
1 0 2 2 1
f (x, y) = 1 0 0 1 2
1 3 1 1 3
0 1 1 1 3
0 2 2 1 2
0 2 2 0
0 1 0 0
A demonstration of the LATEX 2ε class file for EAI Endorsed Transactions 109
Let .g(m, n) represent the normalised matrix, with N number of grey levels and .μx ,
.σx and .μy , .σy are mean and standard deviation of the marginal probability matrices
.Px (m)andPy (n), respectively.
N−1
Px (m) =
. g(m, n) (1)
n=0
N−1
Py (n) =
. g(m, n). (2)
n=0
The mean values of the marginal probability matrices .P x(m) and .P y.(n) are given
as
N−1
N−1
μx =
. m g (m, n) (3)
m=0 n=0
N−1
μx =
. m Px (m) (4)
n=0
N−1
N−1
μy =
. n g (m, n) (5)
n=0 m=0
N−1
μy =
. nPy (m). (6)
n=0
The standard deviation values of the marginal probability matrices .Px (m) and
Py (n) are given as
.
N−1
N−1
σx2 =
. (m − μx )2 g (m, n) (7)
m=0 n=0
N−1
2 N−1
σy2 =
. n − μy g (m, n) . (8)
n=0 m=0
N−1
N−1
Px+y (l) =
. g (m, n), (9)
m=0 n=0
110 R. Anand et al.
N−1
N−1
Px−y (l) =
. g (m, n), (10)
m=0 n=0
2.2.1 Energy
The energy (E) is computed as the sum of squares of the elements in the GLCM
matrix. It returns the value in the range of [0–1]. Energy value of 1 indicates that the
image is the constant value. It also reveals about the uniformity of the image that is
shown in Eq. 11.
(N−1)
(N−1)
Energy (E) =
. (g(m, n))2 . (11)
(m=0) (n=0)
2.2.2 Entropy
N−1
N−1
.Entropy (En) = − g (m, n) × log (g (m, n)) (12)
m=0 n=0
2N
SEn
. =− Px+y (m) log(Px+y (m)) (13)
m=2
N−1
DEn = −
. Px−y (m) log(Px−y (m)) (14)
m=0
A demonstration of the LATEX 2ε class file for EAI Endorsed Transactions 111
2.2.5 Contrast
N−1
N−1
Contrast (C) =
. (m − n)2 g (m, n). (15)
m=0 n=0
2.2.6 Variance
This statistic measures the heterogeneity, and it is strongly correlated with first-
order statistical variable such as standard deviation. It returns a high value for the
elements that greatly differs from the average value of .g(m, n). It is also referred
to as the sum of squares. Variance can be calculated for image .g(m, n) using the
following equation, and .μ indicates the mean of an input image.
N−1
N−1
V ariance (V ) =
. (m − μ)2 g (m, n). (16)
m=0 n=0
2N
SV
. = (m − SEn)2 P x+y (m). (17)
m=2
N−1
DV
. = m2 P x−y (m). (18)
m=0
N−1
N−1 g (m, n)
H omogeneity (H ) =
. . (19)
m=0 n=0
1 + |m − n|2
The input image .g(m, n) is highly correlated between the adjacent pixels; then we
can say the image is auto-correlated (the autocorrelation of input data with itself
after shifting one pixel). The correlation measures the linear dependency between
the pixels at the respective location, and it can be calculated by the following
equation:
N−1
(m × n) g (m, n) − (μx × μy )
Corr =
. sumN−1 . (20)
m=0
(σx × σy )
n=0
Root mean square (RMS) measures the standard deviation of the pixel intensities. It
does not depend upon any angular frequency or spatial distribution of contrast of an
input image. Mathematically, it can be expressed as
1 N−1
N−1
2
.RMS Contrast (RC) =
Iij − I¯ , (21)
N 2
m=0 n=0
Cluster shade measures the unevenness in the input matrix and gets the information
about the uniformity in the image. Disproportionate images result in higher cluster
shade values. The cluster shade is computed using the following equation:
N−1
N−1 3
CS =
. m + n − μx − μy × g (m, n). (22)
m=0 n=0
Cluster prominence is also used to measure the asymmetric nature of the image.
Higher cluster prominence indicates that the image is less symmetric. Smaller
A demonstration of the LATEX 2ε class file for EAI Endorsed Transactions 113
variance in grey levels in the image results in lower cluster prominence value.
N−1
N−1 4
CP =
. m + n − μx − μy × g (m, n). (23)
m=0 n=0
Texture features of an image are calculated considering only one band at a time.
Channel information can be consolidated using PCA before calculating texture
features. Texture features of an image can be used for both supervised and
unsupervised image classification. A classification method, Random Forest [10],
builds multiple models by using various bootstrapped feature sets. The following
stages are included in this algorithm: To construct a single tree in the ensemble, the
algorithm boots up the training set many times, after which it applies the fresh set to
create a single tree. To identify the optimal split variable and new features, a random
selection of features is drawn from the training sets every time the sample of tree is
divided. The random forest took extra time to calculate the validation procedure, but
it had acceptable performance. These methods are compared with KNN and SVM
to fight this problem. In the technique suggested by Shanthi et al. [1], the classifier
called K-nearest neighbours (KNN) is used next. Using “a” and “o” input letters in
this 2-dimensional feature space, this method may then identify “c”, another feature
vector that must be analysed. When faced with this scenario, it identifies the K-
nearest neighbours without respect to labels. Figure 3 shows the classes “a” and “o”
in the image; for our purposes, imagine the number 3 next to them. The objective
of the algorithm is to discover which class “c” belongs to. “c” needs to have its
three neighbours recognised, since k is 3. Of the three adjacent places, one is a
“a”, while the other two are “o”. “o” has two votes, while “a” has one. Class “o”
will be attributed to the “c” vector. When K is equal to 1, the class is defined by
the first closest neighbour of the element. Computation time for KNN prediction is
extremely long; however, training for KNN prediction is quicker than that of random
forest. Despite improved training timings, it takes more processing resources to
calculate data in the higher dimensions. And last, this chapter examines how well
these algorithms perform when compared to SVM. This method identifies the most
comparable observation to the one we are trying to predict and that observation
serves as a reasonable proxy for an answer since it helps us determine the most
likely response by averaging the values around the observation.
Finding the answer requires the algorithm to locate neighbours to calculate the
integer number or k value. Smaller values of k will force the algorithm to adapt
to the data we are using, putting it at danger of overfitting and allowing it to fit
complicated borders between classes. Bigger K values distance themselves from the
ups and downs of actual data and result in smoother class separators in data. KNN
prediction takes a lot of time to compute, yet it can train in a fraction of the time of
114 R. Anand et al.
Support Vectors
random forest. The HSI method for training may run faster but is more demanding
on memory. In this chapter, we have used support vector machine (SVM) [11, 12]
for texture image classification. This method falls under the category of supervised
machine learning [9].
Support vector machine was introduced in 1992 as a supervised machine leaning
algorithm. This algorithm gained its popularity because of its higher accuracy rate
and minimum error rate. SVM is one of the best examples for “Kernel Method” that
is the key areas of machine learning that is shown in Fig. 5. The idea behind SVM is
to make use of nonlinear mapping function .φ that transforms data in input space to
data in feature space in such a way that it becomes a linearly separable that is shown
in Fig. 5 [2].
The SVMs then automatically discover the optimal separating hyperplane, which
is nothing but a complex decision surface. The equation of hyperplane is derived
from a line equation .y = ax + b ever; even though hyperplane is a line, its
equation is shown in below [13, 14], where “w & x” are the vectors, and it can be
computed by dot matrix of these two vectors that is shown in Eq. 24.
w T x = 0.
. (24)
Any hyperplane can be framed as set of points(x), which can be satisfies the
optimum point with minimum of .w. x + b = 0. Two such hyperplanes are
chosen, and based on the values obtained, they are classified as class 1 and class 2
as given in Eqs. 25 and 26.
w. x + b ≥ 1 or xi having class 1
. (25)
Here, optimizing problem may occur because the goal is to maximize the margin
among all the possible hyperplanes meeting the constraints. The hyperplane with
A demonstration of the LATEX 2ε class file for EAI Endorsed Transactions 115
the small .w|| is chosen because of the biggest margin it provides. The pseudocode
for optimizing problem is given in Eq. 27.
1
2
minimizew,b w
2
(27)
subject to y (i) (w T xi + b) ≥ 1 .
.
The solution to the above equation is computing the values of .(w, b), with minimum
possible margin. The equation that satisfies the constraints will be considered as the
equation of the optimal hyperplane.
4 Dataset Description
The dataset from Centre for Image Analysis in Swedish University of Agriculture
Sciences [3], Uppsala university, has been used in this chapter. Totally, 4480 images
of 28 different texture classes were taken using Canon EOS 550d DSLR camera,
which is shown in Fig. 6. Each texture class has around 160 images, in that 112
images are used for training and 48 images are used for testing, which is shown
in Table 2. The Fig. 7 shows the complete flowchart of the proposed method for
texture image classifications using GLCM features. First step, the segmented image
is resized to [576 .× 576].
5 Experiment Results
error. When type-2 error is high, it implies that a greater proportion of individuals
with illness are classified as healthy, which may result in serious consequences [15–
17]. Table 4 illustrates the confusion matrix for a multiple class issue. For each class,
the entities TP, TN, FP, and FN may be assessed using the following equations:
118 R. Anand et al.
4
= Xii −X 11 = X11 + X 22 +X33 +X 44 − X 11
. i=1 (28)
= X22 +X 33 +X 44
3. FP of Class 1
4
= Xi1 −−X 11 = X11 + X 21 +X 31 +X 41 − X11
. i=1 (29)
= X21 +X 31 +X 41
4. FN of Class 1
4
= X1i −−X 11 = X11 + X 12 +X 13 +X 14 − X11
. i=1 (30)
= X12 +X13 +X 14
A demonstration of the LATEX 2ε class file for EAI Endorsed Transactions 119
These are the points that help us build our SVM. The performance of the
proposed system is measured in terms of sensitivity, specificity, accuracy, precision,
false positive rate, and false negative rate. The sensitivity and specificity are
important measures in classification. The accuracy of the system represents the
exactness of the system with respect to classification. To be precise, measurements
must be somewhat near to one another for the same object. The overall classification
accuracy of the proposed system is around 99.4% with the precision of 92.4%. The
false negative rate and false positive rate are very minimum in the range of 0.003 and
0.085, respectively. The sensitivity and specificity of the system are around 91.5 and
99.7%. The texture classes 2, 4, 5, 7, 9, 12, and 19 have been classified with good
accuracy and precision when compared to other classes which is shown in Table 5.
Additionally, true positive rate (TPR), memory, and likelihood of detection are used.
It is a metric for true positives. It provides precise measurements of the test’s amount
or completeness.
5.1.2 Specificity
Specificity is also referred to as a genuine negative rate (TNR). It quantifies the true
downsides. Sensitivity improves type-1 error reduction.
False positive rate (FPR) is also known as false alarm rate. It is the ratio of
misclassified to total negative samples (Table 5).
120 R. Anand et al.
FP
False Positive rate 0 1
TN + FP
FN
False negative rate 0 1
TP + FN
TP
Precision 1 0
TP + FP
Negative Predictive TN
value 1 0
TN + FN
2 • TP
F1-Score 1 0
2TP + FP + FN
False negative ratio (FNR) is also called as miss rate. It is the ratio of misclassified
to total positive samples.
A system’s performance is measured in terms of its efficiency. The efficiency
with which a system solves a classification issue is quantified using metrics such
as sensitivity, specificity, false positive rate, false negative rate, accuracy, precision,
and F1 score [18, 19]. The formulas used to compute the parameters are given,
along with their reference values shown in Figs. 8 and 9, and the comparison of our
method as shown in Table 6 and individual class performance metrics are shown in
Table 7.
A demonstration of the LATEX 2ε class file for EAI Endorsed Transactions 121
6 Conclusion
20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 45 3 0 0 0 0 0 0 0
21 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 44 1 0 0 0 0 0 0
22 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 46 0 0 0 0 0 0
23 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 45 0 0 0 0 0
24 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 42 4 0 0 0
25 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 43 3 0 0
26 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 46 0 0
123
27 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 44 2
28 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 43
124 R. Anand et al.
References
1. Shanthi, T., Sabeenian, R. S., Manju, K., Paramasivam, M. E., Dinesh, P. M., & Anand, R.
(2021). Fundus image classification using hybridized GLCM features and wavelet features.
ICTACT Journal of Image and Video Processing, 11(03), 2345–2348.
2. Veni, S., Anand, R., & Vivek, D. (2020). Driver assistance through geo-fencing, sign board
detection and reporting using android smartphone. In: K. Das, J. Bansal, K. Deep, A. Nagar, P.
Pathipooranam, & R. Naidu (Eds.), Soft computing for problem solving. Advances in Intelligent
Systems and Computing (Vol. 1057). Singapore: Springer.
3. Kylberg, G. The Kylberg Texture Dataset v. 1.0, Centre for Image Analysis, Swedish University
of Agricultural Sciences and Uppsala University, External report (Blue series) No. 35.
Available online at: http://www.cb.uu.se/gustaf/texture/
4. Anand, R., Veni, S., & Aravinth, J. (2016) An application of image processing techniques for
detection of diseases on brinjal leaves using k-means clustering method. In 2016 International
Conference on Recent Trends in Information Technology (ICRTIT). IEEE.
5. Sabeenian, R. S., & Palanisamy, V. (2009). Texture-based medical image classification of
computed tomography images using MRCSF. International Journal of Medical Engineering
and Informatics, 1(4), 459.
6. Sabeenian, R. S., & Palanisamy, V. (2008). Comparison of efficiency for texture image
classification using MRMRF and GLCM techniques. Published in International Journal of
Computers Information Technology and Engineering (IJCITAE), 2(2), 87–93.
7. Haralick, R. M., Shanmugam, K., & Dinstein, I. H. (1973). Textural features for image
classification. IEEE Transactions on Systems, Man, and Cybernetics, 6, 610–621.
8. Varma, M., & Zisserman, A. (2005). A statistical approach to texture classification from single
images. International Journal of Computer Vision, 62(1–2), 61–81.
9. Suykens, J. A. K., & Vandewalle, J. (1999). Least squares support vector machine classifiers.
Neural Processing Letters, 9(3), 293–300.
10. Shanthi, T., Sabeenian, R. S. (2019). Modified AlexNet architecture for classification of
diabetic retinopathy images. Computers and Electrical Engineering, 76, 56–64.
11. Sabeenian, R. S., Paramasivam, M. E., Selvan, P., Paul, E., Dinesh, P. M., Shanthi, T., Manju,
K., & Anand, R. (2021). Gold tree sorting and classification using support vector machine
classifier. In Advances in Machine Learning and Computational Intelligence (pp. 413–422).
Singapore: Springer.
12. Shobana, R. A., & Shanthi, D. T. (2018). GLCM based plant leaf disease detection using
multiclass SVM. International Journal For Research & Development In Technology, 10(2),
47–51.
13. Scholkopf, B., & Smola, A. J. (2001). Learning with kernels: support vector machines,
regularization, optimization, and beyond. MIT press.
14. Bennett, K. P., & Demiriz, A. (1999). Semi-supervised support vector machines. In Advances
in Neural Information Processing Systems (pp. 368–374).
15. Anand, R., Shanthi, T., Nithish, M. S., & Lakshman, S. (2020). Face recognition and
classification using GoogLeNET architecture. In: Das, K., Bansal, J., Deep, K., Nagar, A.,
Pathipooranam, P., & Naidu, R. (Eds.), Soft computing for problem solving. Advances in
Intelligent Systems and Computing (Vol. 1048). Singapore: Springer.
16. Shanthi, T., Sabeenian, R. S., & Anand, R. (2020). Automatic diagnosis of skin diseases using
convolution neural network. Microprocessors and Microsystems, 76, 103074.
17. Hall-Beyer, M. (2000). GLCM texture: A tutorial. In National Council on Geographic
Information and Analysis Remote Sensing Core Curriculum 3.
A demonstration of the LATEX 2ε class file for EAI Endorsed Transactions 125
18. Shanthi, T., Anand, R., Annapoorani, S., & Birundha, N. (2023). Analysis of phonocardiogram
signal using deep learning. In D. Gupta, A. Khanna, S. Bhattacharyya, A. E. Hassanien,
S. Anand, & A. Jaiswal (Eds.), International conference on innovative computing and
communications (Lecture notes in networks and systems) (Vol. 471). Springer. https://doi.org/
10.1007/978-981-19-2535-1_48
19. Kandasamy, S. K., Maheswaran, S., Karuppusamy, S. A., Indra, J., Anand, R., Rega, P., &
Kathiresan, K. (2022). Design and fabrication of flexible Nanoantenna-based sensor using
graphene-coated carbon cloth. Advances in Materials Science & Engineering.
Progress in Multimodal Affective
Computing: From Machine Learning
to Deep Learning
1 Introduction
Emotions and sentiments play a significant role in our day-to-day lives. They help
in decision-making, learning, communication, and handling situations. Affective
computing is a technology that aims to detect, perceive, interpret, process, and
replicate emotions from the given data sources by using different type of techniques.
The word “affect” is a synonym for “emotions.” Affective computing technology
is a human-computer interaction system that detects the data captured through
cameras, microphones, and sensors and provides the user’s emotional state. The
advancement in signal processing and AI has led to the development of usage
of affective computing in medical, industry, and academia alike for detecting
and processing affective information from the data sources [5]. Emotions can be
recognized either from one type of data or more than one type of data. Hence,
affective computing can be classified broadly into two types. They are unimodal
affective computing and multimodal affective computing. Figure 1 depicts the
overview of affective computing.
Unimodal systems are those in which the emotions are recognized from one
type of data. Generally, human beings rely on multimodal information more than
unimodal. This is because one can understand a person’s intention by looking at
his/her facial expression when he/she is speaking. In this case, both the audio and
video data provide more information than the information that is provided from
one type of data. For example, during an online class, the teacher can interpret
M. Chanchal ()
Department of Computer Science and Engineering, Amrita School of Engineering, Coimbatore,
Amrita Vishwa Vidyapeetham, Coimbatore, India
B. Vinoth Kumar
Department of Information Technology, PSG College of Technology, Coimbatore, India
e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 127
B. V. Kumar et al. (eds.), Smart Computer Vision, EAI/Springer Innovations in
Communication and Computing, https://doi.org/10.1007/978-3-031-20541-5_6
128 M. Chanchal and B. Vinoth Kumar
Understanding
Sensing human Recognizing the
and modelling Affect expression
affect response affect response
affect
Emotive
Human
Image
Audio
Video
Physiological
signals
Data
source
more accurately if the students have understood the class or not by both looking
at student’s expression and also by asking their feedback, rather than just asking
them only their feedback.
The way people express their opinion varies from person to person. One person
may express his/her opinion more verbally while other person may express his/her
opinion through expression [12]. Thus, a model that can interpret emotion for any
type of person is required. This is when multimodal affective computing plays a
major role. Unimodal systems are the building block of multimodal systems. The
multimodal system outperforms the unimodal system since more than one type of
data are used for interpretation. The multimodal affective computing structure is
presented in Fig. 2.
Till date, only a very limited survey analysis has been done on multimodal
affective computing. Also, the previous studies do not concentrate specifically on
the machine learning and deep learning approaches. With the advancement in AI
techniques, a number of machine learning and deep learning algorithms can be
applied for multimodal affective computing. The objective of this chapter is to
provide a clear idea on the various machine learning and deep learning methods
used for multimodal affect computing. In addition to it, the details about the various
datasets, modalities, and fusion techniques have been elaborated. The remaining part
Progress in Multimodal Affective Computing: From Machine Learning to Deep. . . 129
of this chapter is organized with multiple sections. Section 2 is to present about the
available datasets, Sect. 3 is to elaborate about the various features used for affect
recognition, Sect. 4 explains about the various fusion techniques, Sect. 5 describes
about the various machine learning and deep learning techniques for multimodal
affect recognition, Sect. 6 is for discussion, and finally, Sect. 7 concludes the chapter.
2 Available Datasets
In literature, two types of datasets were found. They are publicly available dataset
and dataset collected from subjects based on the predecided concept. In the latter,
subjects were selected based on the tasks that need to be performed, and respective
data were collected for further processing. This section describes the publicly
available datasets for multimodal affective computing. Various kind of datasets for
multimodal affect analysis datasets are discussed in Table 1.
The DEAP dataset [1] was collected from 32 subjects who were watching 40 videos
clips that stimulated emotions. It contained 1-min-long video clippings. Based on
that, EEG signals and Peripheral Physiological signals (PPS) were captured. The
PPS signal includes both electromyographic (EMG) and EOG data. It had four
emotions like valence, arousal, liking, and dominance and have a scale of 1–9.
The AMIGOS database [4] was collected from subjects using two different exper-
imental settings mainly for mood, personality, and affect research purpose. In the
first experimental setting, 40 subjects watched 16 short videos. Each video varied
between 51 and 150 s. In second experimental setting, some subjects watched four
long videos of different scenarios like individually and as groups. The wearable
sensors were used to get the EEG, ECG, and GSR signals in this dataset. Also, it
contains face and depth data that were collected using separate equipment.
male and 41.6% of female. The duration of the video clippings ranges from 1 to 19
s with the average duration being 3.3 s.
The SEED IV dataset [30] contains four annotated emotions like happy, sad, fear,
and neutral. Forty-four subjects were used for this out of which 22 were female
college students. They were asked to assess their emotions when watching the
film clips as either sad, happy, fear, or neutral with rating from −5 to 5 for two
dimensions: arousal and valence. The valence scale ranges from sad to happy and
132 M. Chanchal and B. Vinoth Kumar
arousal scale ranges from calm to excited. At the end, 72 film clips were selected
that had the highest match among the subjects. The duration of each clip was 2 min.
It is a subset of audiovisual depression language corpus [9]. The dataset had 300
video clips, which was recorder using Web cameras and microphones when people
were having computer interactions. One to four recordings of all the subjects were
taken with a gap of 2 weeks between two recordings. The length of the video
clips was between 6 s and 4 min. This dataset contains subjects with age 18–
63 and average age being 31.5 years. The BDI-II depression scale ranges from 0
to 63, where 0–10 is normal, 11–16 is mild depression, 17–20 is borderline, 21–
30 is moderate depression, 31–40 is severe depression, and above 40 is extreme
depression. The highest that was recorded was 45.
Sentiment Analysis in the Wild (SEWA) dataset [26] contains audio and video
recordings that were collected from Web cameras and microphones, and also natural
emotions like arousal and valence. This dataset included a total of 64 subjects with
ages ranging from 18 to 60 years where training set were with 36 subjects, validation
set with 14 subjects, and testing set with 16 subjects. They were paired (a total of 32
pairs) and made to watch commercial videos and were asked to discuss the content
of the video with their partner for a limit of 3 min. The dataset includes text, audio,
and video data. Six German-speaking annotators (three males and three females)
annotated the dataset for arousal and valence.
It is an extension of AVEC 2017 database [7]. The AVEC 2017 is like the SEWA
dataset of German culture with 64 subjects, having 36 for training, 14 for validation,
and 16 for testing. In AVEC 2018 dataset, the testing set is added with new subject
of Hungarian culture with same age as the German culture. This dataset includes
both audio and video recordings. It annotated three emotions: arousal, valence, and
preference for the commercial products. All emotions were annotated with scale
ranging from −1 to +1. The duration of recordings was 40 s to 3 min. The emotions
were annotated for every 100 ms.
Progress in Multimodal Affective Computing: From Machine Learning to Deep. . . 133
The Distress Analysis Interview Corpus depression dataset [22] includes clinical
interviews used for diagnosis of psychological conditions like anxiety, depression,
and posttraumatic stress disorder. It included audio and video recordings and
questionnaire response from the interviews conducted by virtual interviewer called
Ellie that was controlled by a human interviewer in another room. It contains 189
sessions of interviews. Each interview contains audio file of interview session, 68
facial points of subjects, HoG (Histogram of oriented Gradients) facial feature, head
pose, eye features, file having continuous facial action, and file containing subjects
voice and transcript file of interview. All features except the transcript file are time-
series data.
The University of Virginia (UVA) Toddler dataset [23] has 192 videos each of 45–60
min. It is collected from 61 child care centers having toddlers of 2–3 years old. The
videos are recorded using digital camera with integrated microphone. Each video
includes a day of preschool including individual and group activity, outdoor plays,
and sharing meals. They included singing, reading, playing with blocks and toys,
and so on. Each session includes an average of 1.7 teachers and 7.59 students. It
includes video, audio along with background
Noise, and head and body pose.
The Measures of Effective Teaching (MET) dataset [23] is one of the Classroom
Assessment Scoring System (CLASS)-coded video dataset. It includes 16000
videos where 3000 teachers were teaching language, mathematics, arts, and science
in both middle and elementary schools of USA (six districts). The data were
collected using 360◦ cameras integrated with microphone, which was placed in the
center of the classroom to capture both the teachers and students properly.
Subject
Physiological Behavioral
EEG Audio
ECG Video
GSR Textual
PPS Facial expression
computing. The modalities acquired from a subject fall into two broad categories:
They are physiological and behavioral modalities. In this section, the primary focus
is given on the audio, visual, textual, facial expression, and biological signals
detection, along with their techniques. In the abovementioned modalities, audio,
visual, textual, and facial expression falls under behavioral category and biological
signal is nothing but the physiological signals. Figure 3 shows the modality
categories.
Audio is one medium for capturing the emotions of a user. OpenSMILE toolkit
is one popular method for extraction of audio features like pitch, intensity of
utterance, bandwidth, pause duration, and perceptual linear predictive coefficients
(PLP) [19]. Mel-frequency cepstral coefficients (MFCC) [28] is the most popular
audio extraction method. Nowadays, for better extraction of audio features, more
deep neural networks are used.
DenseNet, VGG Face, MobileNet, and HRNet can also be used for extracting better
visual features [25].
For affect computing, text features play a vital role. Textual features are of two
types. They are Crowd-Sourced Annotation (CSA) and DISfluency and Non-verbal
Vocalization (DIS-NV). The DIS-NV are done by manual annotations. The CSA
features are extracted by removing the stop words like “a,” “and,” “the,” etc.
and then lemmatizing the remaining word using Natural Language Toolkit [19].
Parts of Speech (PoS), n gram features, and TF-IDF (Term frequency-Inverse
Document Frequency) are useful features for emotion recognition. Google English
word embedding (Word2Vec) [12] and Global Vectors (GloVec) are also used to
extract textual features.
Facial expression can be captured using the AdaBoost algorithm using the Haar
eigen value [8]. Chehra algorithm can be used to locate the facial points in the image
frame [11]. With these facial points, the face feature can be further extracted using
the Facial Action Unit (AU) recognition algorithm [17]. The LBP-TOP can be used
to extract the face pictures or features [14]. The facial landmark can be detected in
an image using Openface, which is an open-source tool [22].
Multimodal affect computing involves fusion of various modalities that are cap-
tured. In order to do the analysis, the modalities are combined using various fusion
techniques. Fusion of various data provides enormous information and thus achieves
136 M. Chanchal and B. Vinoth Kumar
a result with a very good accuracy. There are two main levels of fusion. They are
feature-level fusion or early-fusion and decision-level fusion or late fusion. Also,
there are other fusion techniques like hierarchical fusion, model-level fusion, and
score-level fusion. This section describes the various fusion techniques.
The hierarchical fusion techniques use different multimodal feature sets at its
different level of hierarchy [19]. For example, the set of ideas or perceived emotion
annotation types of features are used in the lower layers of a model, whereas abstract
features like text, audio, or video features are used in the higher layers. This method
fuses two-stream network at different level of hierarchy to improve the performance
of emotion recognition.
Eun-Hye Jang et al. [10] proposed a method for fear-level detection using phys-
iological measures like skin conductance level and response (SCL, SCR), heart
rate (HR), pulse transit time (PTT), fingertip temperature (FT), and respiratory rate
(RR). The task was performed using the data collected from 230 subjects who
were asked to watch fear-inducing video clips. Correlation and linear regression
among the physiological measures were performed to check the fear intensity.
ML techniques like nonparametric spearman’s rank correlation coefficient was
used. SCR and HR were positively correlated to the intensity, whereas the SCL,
RR, and FT were negatively correlated. It showed an accuracy of 92.5% on fear-
inducing clips. Oana Balan et al. [2] proposed an automation fear-level detection
and acrophobia virtual therapy system. It used galvanic skin response (GSR),
heart rate (HR), and the values of electroencephalography (EEG) from subjects
who played acrophobia game and who were undergoing vivo therapy and virtual
reality therapy. Two classifiers were used: one to determine the present fear level
and another one to determine the game level that needs to be played next. ML
techniques like Support Vector Machine, Random Forest, k-Nearest Neighbors,
138 M. Chanchal and B. Vinoth Kumar
Table 2 (continued)
Reference Dataset Feature used Models used Result obtained
Sandeep Nallan Data collected Acoustic, SVM Recall% better
Chakravarthula from 62 couples lexical, and by 13–20%
et al. [3] behavioral
features
Nathan L. Data collected Posture data and SVM and neural Kappa score of
Henderson et al. from 119 electrodermal network multimodal was
[6] subjects activity data better.
Papakostas M et Data collected Visual and RF, Gradient F1 score of
al. [21] from 45 subjects physiological Boosting multimodal was
information classifier, and better.
SVM classifier
Anand UVA toddler Audio features Resnet, Pearson Resnet –
Ramakrishnan et dataset and MET and face image correlation, and correlation
al. [23] dataset of both the Spearman values
teacher and correlation Positive – 0.55
students Negative – 0.63
Pearson
correlation
Positive – 0.36
Negative – 0.41
Spearman
correlation
Positive – 0.48
Negative – 0.53
Dongmin Shin et Data collected EEG and ECG Bayesian BN had highest
al. [24] from 30 subjects signals network (BN), accuracy
SVM, and MLP (98.56%).
SCL, SCR skin conductance level and response, HR heart rate, PTT pulse transit time, FT fingertip
temperature, RR respiratory rate, GSR galvanic skin response, EEG electroencephalography, ECG
electrocardiographic, SVM Support Vector Machine, RF Random Forest, KNN k-Nearest Neighbors
Linear Discriminant Analysis, and four deep neural network models were used. The
models were compared based on the accuracy, and it was found that DNN model
had highest accuracy of 79.12% for player-independent and SVM had an accuracy
of 89.5% for player-dependent modality for two scales, whereas for four scales, high
accuracy was seen for KNN (52.75%) and SVM (42.5%).
Seul-Kee Kim et al. [13] proposed a method to determine the fear of crime using
multimodality based on the data collected from the subjects by showing them clips
of real pedestrian environments. The features like electroencephalographic (EEG),
electrocardiographic (ECG), and galvanic skin response (GSR) signals were used.
To compare the difference of fear of crime between the two groups (i.e., Low Fear
of crime Group (LFG) and High Fear of crime Group (HFG)), techniques like
independent t-tests or Mann-Whitney U tests were used. To compare the fear of
crime based on the video clips that were provided to the subject, NOVAs or Kruskal-
Wallis tests were used. The values were compared by setting up a significance
level of p < 0.05. Cheng-Hung Wang et al. [27] proposed a method for multimodal
140 M. Chanchal and B. Vinoth Kumar
Table 3 (continued)
Reference Dataset Feature used Models used Result obtained
Wei-Long Zheng Data collected EEG and eye Bimodal deep Accuracy –
et al. [30] from 44 subjects moments auto-encoder 85.11%
(BDAE) and
SVM
Shiqing Zhang RML database, Audio and video 3D-CNN with Accuracy:
et al. [29] eNTERFACE05 features DBN RML database
database, and (80.36%),
BAUM-1 s eNTERFACE05
database database
(85.97%) and
B4AUM-1 s
(54.57%)
Eesung Kim et IEMOCAP Acoustic and Deep neural WAR – 66.6
al. [12] dataset lexical features network (DNN) UAR – 68.7
Huang Jian et al. AVEC 2018 Visual, acoustic, LSTM-RNN Arousal –
[7] dataset and textual 0.599–0.524
features Valence –
0.721–0.577
Liking –
0.314–0.060
Qureshi et al. DAIC-WOZ Acoustic, visual, Attention-based Accuracy –
[22] depression and textual fusion network 60.61%
dataset features with deep neural
network (DNN)
Luntian Mou et Data collected Eye features and Attention-based Accuracy –
al. [18] from 22 subjects vehicle and CNN-LSTM 95.5%
environmental network
data
Panagiotis SEWA dataset Text, audio, and Attention-based Arousal – 69%
Tzirakis et al. video features fusion strategies Valence – 78.3%
[26]
GSR galvanic skin response, EEG electroencephalography, ECG electrocardiographic, SVM Support
Vector Machine, LSTM Long Short-Term Memory, DBN Deep Belief Network, RNN Recurrent
Neural Network, SVR Support Vector Regression
emotion computing for tutoring system. It used textual features and facial expression
collected from 136 subjects. The technique used was t-test. Also, to determine the
significance level, Cohen’s d standard was used. This model did a comparison of test
results obtained from normal Internet teaching group and affective teaching group
statistics. Pretest and posttest were conducted for both the groups, and it was found
that posttest value of emotional teaching group produced a moderate to higher effect
value (0.71) and closer significance value.
Jose Maria Garcia-Garcia et al. [5] proposed a multimodal affect computing
method to improve the users experience on educational software application. Facial
expression, key strokes, and speech features were the features used. The method
used was the t-test to compare the mean of all the datasets and to test the null
142 M. Chanchal and B. Vinoth Kumar
hypothesis. The test was done using two types of system: one with emotion
recognition application and other one without emotion recognition. The System
Usability Scare (SUS) score was used for determining which system performs better.
And it was found that one with emotion recognition had better SUS score with lesser
attempts required (60% less) and using less help. Shahla Nemati et al. [20] proposed
a method for hybrid latent space data fusion technique for emotion recognition.
Video, audio, and text features were used from the DEAP dataset. SVM and Naive
Bayes were used as a classifier. Feature-level fusion and decision-level fusion are
employed in this model using Marginal Fisher Analysis (MFA) to cross-modal
factor analysis (CFA) and canonical correlation analysis (CCA). In feature-level
fusion, SVM classifier outperforms Naïve Bayes classifier. But in decision-level
fusion, it is mainly dependent on the type of classifier.
Javier Marín-Morales et al. [16] proposed a method for emotion recognition
using the brain and heartbeat dynamics like electroencephalography (EEG) and
heart rate variability (HRV). The data was collected from a total of 60 subjects.
SVM-RFE and LOSO cross-validation techniques were used for emotion recogni-
tion. Two predictions were done, one for arousal and another one for valence. The
features were extracted from HRV, EEG band power, and EEG MPS. It was found
that the arousal dimension attained an accuracy of 75%, and valence had an accuracy
of 71.21%. Li Ya et al. [14] proposed a multimodal emotion recognition challenge
using the audio and video features of CHEVAD 2.0 dataset. SVM classifier was
used for emotion recognition. Two fusion techniques were compared. They are
decision-level fusion and feature-level fusion. Among the two fusion techniques,
decision-level fusion with 35.7% in MAP was better than feature-level fusion, which
had only 21.7% in MAP. Also, its results were compared with the individual feature
predictions. It was found that with audio alone and video alone, it had 39.2% and
21.7%, respectively.
Deger Ayata et al. [1] proposed music recommendation system based on
emotions using the GSR (galvanic skin response) signals and PPG (photo plethys-
mography) signals obtained from 32 subjects in DEAP emotion dataset. The features
are extracted from these signals and fused using feature-level fusion technique.
The classifiers are fed with the feature vector to obtain the arousal and valence
values. KNN, Random Forest, and decision tree methods are used for emotion
identification. The arousal and valence accuracy were compared with using only
GSR signal, only PPS signals, and multimodal features. It was found that accuracy
of fused method had better accuracy for both arousal (72.06%) and valence
(71.05%). Asim Jan et al. [9] proposed a method for automatic depression-level
analysis using the audio and visual features. Two methods are compared. Feature
Dynamic History Histogram (FDHH) algorithm is a fusion technique to produce
dynamic feature vector. Motion history histogram (MHH) is used to get the features
of visual data and then fuse it with audio data. PLS regression and LR techniques
have been used for determining the correlation between the feature space and for
the depression scale. On comparison FDHH was better with less MAE and RMSE
values.
Progress in Multimodal Affective Computing: From Machine Learning to Deep. . . 143
Sandeep Nallan Chakravarthula et al. [3] proposed the suicidal risk prediction
among the military couples using the conversation among the couples. It used
features like acoustic, lexical, and behavioral aspects from the couple’s conversation
that was collected from a total of 62 couples, having a total of 124 people.
The model was used to check three scenarios: none, ideation, and attempt. The
recall% of the proposed system was 135–20% better than the chance. Principal
Component Analysis (PCA) was done to get only the important features. Support
Vector Machine was used as a classifier for the determination of risk prediction.
Nathan L. Henderson et al. [6] proposed a method for affect detection for game-
based learning environment. The posture data and Q-sensors were used to get the
electrodermal activity data. The data was collected from a total of 119 subjects who
were involved in the TC3Sim training. Two types of fusion techniques were tested:
feature-level and decision-level fusion. Based on these data, classifiers like Support
Vector Machine and neural network were used to determine the student’s affective
states. The results were compared using Kappa score taking only EDA data, only
posture data, and then multimodal data. The classifier performance was improved
when the EDA data was combined with posture data.
Papakostas M et al. [21] proposed a method for understanding and categorization
of driving distraction. It made use of visual and physiological information. The data
was collected from 45 subjects who were exposed to four different distractions
(three cognitive and one physical). Both early fusion and late fusion were tested.
It was tested for two class (mental and physical distraction) and four class (text,
cognitive task, listening to radio, GPS interaction). The two class and four class
test results were compared for visual features alone, physiological features alone,
and early fusion and late fusion. Classifiers like Random Forest (RF) with about
100 decision trees, Gradient Boosting classifier, and SVM classifier with linear and
RBF kernel were used for determining the driver distraction. In both two class and
four class, the visual features only performance was 15% comparatively in F1 score
and thus cannot be used in stand-alone mode.
Anand Ramakrishnan et al. [23] proposed a method for automatic classroom
observation. It used the audio features, face image of both the teacher and students
from the UVA toddler dataset, and MET dataset. It is used to determine the
Classroom Assessment Scoring System’s (CLASS) positive and negative aspect.
Classifier like Resnet, Pearson correlation, and Spearman correlation is used. Using
Resnet, the correlation values were 0.55 and 0.63 for positive and negative. The
Pearson correlation resulted in correlation values of 0.36 and 0.41 on positive and
negative aspects, respectively, for UVA dataset. In MET dataset, the Spearman
correlation was compared with Pearson correlation, and Spearman correlation
values were better for both positive and negative (0.48 and 0.53). Dongmin Shin
et al. [24] developed an emotion recognition system using EEG and ECG signals. It
recognized six types of feelings: amusement, fear, sadness, joy, anger, and disgust.
The noise was removed from the signals to create the data table. The classifier used
is the Bayesian network (BN) classifier. Also, BN classifier was compared with
MLP and SVM. All three classifiers accuracy was found for EEG signals alone and
also EEG signal with ECG signal. It was seen that BN result of multimodal modal
had the highest accuracy of 98.56%, which was 35.78% increase in the accuracy.
144 M. Chanchal and B. Vinoth Kumar
Michal Muszynski et al. [19] proposed a method for recognizing the emotions
that were induced when watching movies. Audiovisual features, lexical features,
physiological reactions like galvanic skin response (GSR), and ACCeleration
signals (ACC) were used from the LIRIS-ACCEDE dataset. In order to determine
the emotion from the multimodal signals, LSTM, DBN, and SVR models were
compared against each other for arousal and valence in basis of MSE, Pearson
correlation coefficient (CC), and concordance correlation coefficient (CCC) for both
unimodal and multimodal. LSTM outperformed SVR and DBN with MSE (A –
0.260, V – 0,070), CC (A – 0.251, V – 0.266), and CCC (A – 0.111, V – 0.143),
where A stands for arousal and V for valence. Joaquim Comas et al. [4] proposed
a method for emotion recognition using the facial and physiological signals like
ECG, EEG, and GSR from the AMINGOS dataset. Deep learning techniques like
CNN (Convolution Neural Network) is used for emotion recognition. BMMN (bio
multi model network) is used to estimate the affect state using the features extracted
using the Bio Auto encoder (BAE). Three networks are tested: BMMN that uses the
features directly, BMMN-BAE1 that uses only the latent features extracted using
BAE, and BMMN-BAE2 that used latent features along with the essential features.
The BMMN-BAE2 model outperformed all the other models with accuracy for
accuracy of 87.53% and valence of 65.05%.
Jiaxin Ma et al. [15] proposed an emotion recognition system using the EEG
signals and physiological signals of the DEAP dataset. The dataset was compared
for deep LSTM network, residual LSTM network, and Multimodal Residual
LSTM (MM-ResLSTM) network for both arousal and valence emotions. The MM-
ResLSTM outperformed the other two methods with an accuracy of 92.87% for
arousal and 92.30% for valence. Also, the proposed method was tested against
some state of art methods like SVM, MESAE, KNN, LSTM, BDAE, and DCCA.
Among all the method, MM-ResLSTM had better accuracy. Panagiotis Tzirakis
et al. [26] proposed a method for emotion recognition using the audio and video
features of RECOLA dataset. The audio and video features were extracted using
the ResNet. These extracted features were used for emotion recognition using the
LSTM network. The proposed model was compared against few other state-of-
the-art methods like the Output-Associative Relevance Vector Machine Staircase
Regression (OA RVM-SR) and strength modeling system proposed by Han et al. for
both arousal and valence prediction. The proposed method outperformed all other
methods with an accuracy of 78.9% for arousal and 69.1% for valence using raw
signals of both audio and video and 78.8% for arousal and 73.2% for valence using
raw signals of audio and raw and geometric signals of video.
Seunghyun Yoon et al. [28] proposed a multimodal speech emotion recognition
system using the text and audio features from IEMOCAP dataset. This model is
used to identify four emotions like happy, sad, angry, and neutral. The Multimodal
Dual Recurrent Encoder (MDRE) containing two RNNs is used for the prediction
of speech emotions. The proposed model is compared against the Audio Recurrent
Progress in Multimodal Affective Computing: From Machine Learning to Deep. . . 145
Encoder (ARE) and Text Recurrent Encoder (TRE) using the weighted average
precision (WAP) score. It was found that the MDRE model had better WAP values
of 0.718; thus, accuracy is ranging from 68.8% to 71.8%. Trisha Mittal et al. [17]
proposed a multiplicative multimodal emotion recognition (M3ER) system that uses
facial, text, and speech features. This was done using the IEMOCAP and CMU-
MOSEI dataset. Deep learning models were used for feature extraction to remove
the inefficient signals, and finally, an LSTM is used for emotion classification. The
result of M3ER were compared with the already existing SOTA methods using the
F1 score and Mean Accuracy (MA) score. Having modality check on ineffective
modality of the dataset causes an increase of 2–5% in F1 and 4–5% in MA, and
when dataset undergoes a proxy feature regeneration step, it led to a further increase
of 2–7% in F1 and 5–7% in MA for M3ER model, which was better than SOTA
model.
Siddharth et al. [11] proposed a multimodal affective computing using the EEG,
ECG, GSR, and frontal videos of the subject from AMINGOS dataset. The features
are extracted using the CNN-VGG network. The Extreme Learning Machine (ELM)
along with 10-fold cross-validation and sigmoid function were used to train the
emotions like arousal, valence, liking, and dominance at a scale of 1–9. The
features are tested for emotion classification individually and also as multimodal.
By combining EEG and frontal videos, the accuracy was 52.51%, which is better
than the accuracy obtained individually for the features. By combining the features
like GSR and ECG, the accuracy was 38.28%. Wei-Long Zheng et al. [30] proposed
a model for emotion recognition using EEG signals and eye movements. The
data was collected from 44 subjects. Bimodal deep auto-encoder (BDAE) was
used to extract the shared features of both EEG and eye moments. The Restricted
Boltzmann machines (RBMs) were used, one for EEG and another for eye moments
to extract the features. Finally, an SVM was used as a classifier to do the emotion
classification. The modal was tested with accuracy for individual features and
for multimodal. The multimodal had an accuracy of 85.11%, which was better
compared to EE signals alone (70.33%) and eye movements alone (67.82%).
Shiqing Zhang et al. [29] proposed a method for emotion recognition using audio
and visual features. The model was tested on RML database, the acted eNTER-
FACE05 database, and the spontaneous BAUM-1s database. The CNN and 3D-CNN
are used to capture the audio and video features, respectively. The result from these
networks is fed to a DBN along with a fusion network to produce the fused features.
Linear SVM is used as a classifier for emotion classification. The proposed model
uses fusion technique along with CNN along with DBN to build the fusion network.
The model is tested and compared with unimodal features and different fusion meth-
ods like feature level, score level, and FC for all three datasets. Among all, the pro-
posed method with DBN outperformed with highest accuracy for all three datasets.
The accuracies were RML database (80.36%), eNTERFACE05 database (85.97%),
and B4AUM-1s (54.57%). Eesung Kim et al. [12] proposed a method for emotion
recognition using acoustic and lexical features of IEMOCAP dataset. The emotion
recognition was compared using the weighted average recall (WAR) and UAR. The
deep neural network (DNN) is used for feature extraction and also as a classifier.
146 M. Chanchal and B. Vinoth Kumar
The proposed model was compared with the results obtained using only lexical
features and also few state-of-the-art methods like LLD+MMFCC+BOWLexicon,
LLD+BOWCepstral+GSVmean+BOW+eVector, LLD+mLRF, and Hierarchical
Attention Fusion Model. Of all the models, the proposed model had greater WAR
and UAR value of 66.6 and 68.7 respectively. Huang Jian et al. [7] proposed a
model for emotion recognition using the visual, acoustic, and textual features. These
features are used from AVEC 2018 dataset. The features are extracted and fused
using both feature-level and decision-level fusion and compared. LSTM-RNN is
used to train the model with the features extracted and emotion classification is
performed. The comparison is done for unimodal (using only visual, only audio,
and only textual features). But multimodal features had better prediction of emotions
like arousal, valence, and liking. The German part of the dataset had good perfor-
mance in proposed multimodal with values 0.599–0.524 for arousal, 0.721–0.577
for valence, and 0.314–0.060 for liking. For the Hungarian part, the performance
was good for textual features. Qureshi et al. [22] proposed a method for estimation
of depression level in an individual using multimodality like acoustic, visual, and
textual features. The features were extracted from DAIC-WOZ depression dataset.
An attention-based fusion network is used, and deep neural network (DNN) is
used for classification of depression in PHQ-8 score scale. RMSE, MAE, and
accuracy were used to test the Depression Level Regression (DLR) and Depression
Level Classification (DLC). To test the multimodality, two-based one single-task
representation learning (ST-DLR-CombAtt and ST-DLC-CombAtt) and two others
based on multitask representation learning (MT-DLR-CombAtt and MT-DLC-
CombAtt) were used. The multimodal had better classification accuracy of 60.61%.
Luntian Mou et al. [18] proposed a model for determining the driver stress
level using the eye features and vehicle and environmental data. The data were
collected from a total of 22 subjects. The stress level was classified into three
classes: low, medium, and high. An attention-based CNN-LSTM network is used as
a classifier. The proposed model is compared with other state-of-the-art multimodal
methods where the handcrafted features are used and also some unimodal method.
It was seen that the attention-based CNN-LSTM network outperformed all the other
state-of-the-art methods with an accuracy of 95.5%. Panagiotis Tzirakis et al. [26]
proposed affect computing using the text, audio, and video features from the SEWA
dataset. ML techniques like Concordance Correlation Coefficient (ρc) was used to
determine the agreement level between the prediction and also in determination of
correlation coefficient with their mean square difference. The SEWA dataset was
compared with proposed model for single feature alone, also with different fusion
strategies like concatenation, hierarchical attention, self-attention, residual self-
attention, and cross-modal self-attention and cross-modal hierarchical self-attention.
The two emotions that are tested for are arousal and valence. It was also compared
with few state-of-the-art methods. It was seen that the model outperformed for text,
visual, and multimodality.
Progress in Multimodal Affective Computing: From Machine Learning to Deep. . . 147
6 Discussion
The advancement in human computer interaction has led to the use of multimodal
analysis from unimodal analysis for affective computing. Also, the use of more
modality for affect detection can be more appropriate rather than using only single
feature. Earlier, only still images were used for affective computing. Nowadays,
the advancement in technology has led to the usage of audio and video formats for
affect detection. From the above study, it can be seen that a number of datasets are
available for audio, video, and textual data. But very few datasets are available for
biological signals. Most of the biological signals are obtained by getting the data
directly from the subjects based on the experiment that needs to be performed. In
multimodal affective computing, fusion techniques play a major role. A number of
fusion techniques has been discussed in the above section. Since more than one
modality are used in multimodal affective computing, fusion techniques are applied
to those technique. The fusion techniques are determined based on the dataset and
the model that is selected. The most commonly used fusion technique is feature-
level fusion.
One problem with the publicly available datasets is that it contains only posed
expression or acted expressions. Choosing an appropriate dataset is a challenging
task. In many cases, more naturalistic data are used. The features extracted from
these data may be numerous, and hence, feature reduction is essential. Only
nonredundant and relevant data are required for further processing and to increase
the speed and processing of the affect computation algorithm. Also, an appropriate
classification algorithm needs to be selected based on the dataset.
Initially, a number of machine learning techniques had been applied for affective
computing. But the advancement in AI has led to the usage of deep learning
techniques for affective computing. From our literature survey, it is clear that
most of the studies that involved emotion recognition used modalities like audio,
video, or textual data. But for studies that were application oriented like stress-
level detection, fear-level detection, education sector, or so on, more of biological
signals were used. This is because the physiological response of the human body
helps in determining these kinds of expressions much better rather than using only
audio, video, and textual data. The physiological responses were collected using
the sensors. For affective computation, some methods had used features that were
manually extracted, whereas some studies have used deep learning techniques for
feature extraction.
As the survey demonstrates, there are various research challenges in multimodal
affective computing. One important sector would be to focus on application-oriented
studies that could be helpful in real-world applications. Manually extracted features
and features extracted by deep learning can be compared to determine which would
give a better result in affect computation. Another aspect of future work would be
usage of biological signals for these kinds of affect computation. These biological
signals speak more about a person; hence, it can be helpful more in medical field.
If these biological signals are used with other modalities, then it would be a major
advance in many medical research fields.
148 M. Chanchal and B. Vinoth Kumar
7 Conclusion
This chapter explains about the brief overview of affective computing and how
emotions are recognized. A brief introduction of unimodal and multimodal affective
computing has been discussed in this chapter. A clear study on the available
dataset, its modality, and emotions in each dataset has also been explained. In
addition, the various features used for affect recognition and fusion techniques
are elaborated. The machine learning and deep learning techniques for affect
recognition are explained, along with the discussion on what features were used
for affect recognition and against what other techniques the proposed methodology
was compared. Also, a few challenges in this research field have been identified.
They are to use real-time dataset for the study, to have more investigation before
capturing the data for study, more understanding about selecting the model, and
also to extend the research to application-oriented studies.
References
1. Ayata, D., Yaslan, Y., & Kamasak, M. E. (2018). Emotion based music recommendation system
using wearable physiological sensors. IEEE Transactions on Consumer Electronics, 64(2),
196–203.
2. Bălan, O., Moise, G., Moldoveanu, A., Leordeanu, M., & Moldoveanu, F. (2020). An
investigation of various machine and deep learning techniques applied in automatic fear level
detection and acrophobia virtual therapy. Sensors, 20(2), 496.
3. Chakravarthula, S. N., Nasir, M., Tseng, S. Y., Li, H., Park, T. J., Baucom, B., et al. (2020,
May). Automatic prediction of suicidal risk in military couples using multimodal interaction
cues from couples conversations. In ICASSP 2020–2020 IEEE international conference on
Acoustics, Speech and Signal Processing (ICASSP) (pp. 6539–6543). IEEE.
4. Comas, J., Aspandi, D., & Binefa, X. (2020, November). End-to-end facial and physiological
model for affective computing and applications. In 2020 15th IEEE international conference
on Automatic Face and Gesture Recognition (FG 2020) (pp. 93–100). IEEE.
Progress in Multimodal Affective Computing: From Machine Learning to Deep. . . 149
5. Garcia-Garcia, J. M., Penichet, V. M., Lozano, M. D., Garrido, J. E., & Law, E. L. C.
(2018). Multimodal affective computing to enhance the user experience of educational software
applications. Mobile Inf Syst, 2018.
6. Henderson, N. L., Rowe, J. P., Mott, B. W., & Lester, J. C. (2019). Sensor-based data fusion for
multimodal affect detection in game-based learning environments. In EDM (workshops) (pp.
44–50).
7. Huang, J., Li, Y., Tao, J., Lian, Z., Niu, M., & Yang, M. (2018, October). Multimodal
continuous emotion recognition with data augmentation using recurrent neural networks. In
Proceedings of the 2018 on audio/visual emotion challenge and workshop (pp. 57–64).
8. Huang, Y., Yang, J., Liao, P., & Pan, J. (2017). Fusion of facial expressions and EEG for
multimodal emotion recognition. Computational Intelligence and Neuroscience, 2017, 1.
9. Jan, A., Meng, H., Gaus, Y. F. B. A., & Zhang, F. (2017). Artificial intelligent system for
automatic depression level analysis through visual and vocal expressions. IEEE Transactions
on Cognitive and Developmental Systems, 10(3), 668–680.
10. Jang, E. H., Byun, S., Park, M. S., & Sohn, J. H. (2020). Predicting individuals’ experienced
fear from multimodal physiological responses to a fear-inducing stimulus. Advances in
Cognitive Psychology, 16(4), 291.
11. Jung, T. P., & Sejnowski, T. J. (2018, July). Multi-modal approach for affective computing. In
2018 40th annual international conference of the IEEE Engineering in Medicine and Biology
Society (EMBC) (pp. 291–294). IEEE.
12. Kim, E., & Shin, J. W. (2019, May). Dnn-based emotion recognition based on bottleneck
acoustic features and lexical features. In ICASSP 2019-2019 IEEE international conference
on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6720–6724). IEEE.
13. Kim, S. K., & Kang, H. B. (2018). An analysis of fear of crime using multimodal measurement.
Biomedical Signal Processing and Control, 41, 186–197.
14. Li, Y., Tao, J., Schuller, B., Shan, S., Jiang, D., & Jia, J. (2018, May). Mec 2017: Multimodal
emotion recognition challenge. In 2018 first Asian conference on Affective Computing and
Intelligent Interaction (ACII Asia) (pp. 1–5). IEEE.
15. Ma, J., Tang, H., Zheng, W. L., & Lu, B. L. (2019, October). Emotion recognition using
multimodal residual LSTM network. In Proceedings of the 27th ACM International conference
on multimedia (pp. 176–183).
16. Marín-Morales, J., Higuera-Trujillo, J. L., Greco, A., Guixeres, J., Llinares, C., Scilingo, E.
P., et al. (2018). Affective computing in virtual reality: Emotion recognition from brain and
heartbeat dynamics using wearable sensors. Scientific Reports, 8(1), 1–15.
17. Mittal, T., Bhattacharya, U., Chandra, R., Bera, A., & Manocha, D. (2020, April). M3er:
Multiplicative multimodal emotion recognition using facial, textual, and speech cues. In
Proceedings of the AAAI conference on artificial intelligence (Vol. 34, No. 02, pp. 1359–1367).
18. Mou, L., Zhou, C., Zhao, P., Nakisa, B., Rastgoo, M. N., Jain, R., & Gao, W. (2021). Driver
stress detection via multimodal fusion using attention-based CNN-LSTM. Expert Systems with
Applications, 173, 114693.
19. Muszynski, M., Tian, L., Lai, C., Moore, J., Kostoulas, T., Lombardo, P., et al. (2019).
Recognizing induced emotions of movie audiences from multimodal information. IEEE
Transactions on Affective Computing, 12, 36–52.
20. Nemati, S., Rohani, R., Basiri, M. E., Abdar, M., Yen, N. Y., & Makarenkov, V. (2019). A
hybrid latent space data fusion method for multimodal emotion recognition. IEEE Access, 7,
172948–172964.
21. Papakostas, M., Riani, K., Gasiorowski, A. B., Sun, Y., Abouelenien, M., Mihalcea, R.,
& Burzo, M. (2021, April). Understanding driving distractions: A multimodal analysis on
distraction characterization. In 26th international conference on Intelligent User Interfaces
(pp. 377–386).
22. Qureshi, S. A., Saha, S., Hasanuzzaman, M., & Dias, G. (2019). Multitask representation
learning for multimodal estimation of depression level. IEEE Intelligent Systems, 34(5), 45–
52.
150 M. Chanchal and B. Vinoth Kumar
23. Ramakrishnan, A., Zylich, B., Ottmar, E., LoCasale-Crouch, J., & Whitehill, J. (2021). Toward
automated classroom observation: Multimodal machine learning to estimate class positive
climate and negative climate. IEEE Transactions on Affective Computing.
24. Shin, D., Shin, D., & Shin, D. (2017). Development of emotion recognition interface using
complex EEG/ECG bio-signal for interactive contents. Multimedia Tools and Applications,
76(9), 11449–11470.
25. Tzirakis, P., Chen, J., Zafeiriou, S., & Schuller, B. (2021). End-to-end multimodal affect
recognition in real-world environments. Information Fusion, 68, 46–53.
26. Tzirakis, P., Trigeorgis, G., Nicolaou, M. A., Schuller, B. W., & Zafeiriou, S. (2017). End-to-
end multimodal emotion recognition using deep neural networks. IEEE Journal of Selected
Topics in Signal Processing, 11(8), 1301–1309.
27. Wang, C. H., & Lin, H. C. K. (2018). Emotional design tutoring system based on multimodal
affective computing techniques. International Journal of Distance Education Technologies
(IJDET), 16(1), 103–117.
28. Yoon, S., Byun, S., & Jung, K. (2018, December). Multimodal speech emotion recognition
using audio and text. In 2018 IEEE Spoken Language Technology Workshop (SLT) (pp. 112–
118). IEEE.
29. Zhang, S., Zhang, S., Huang, T., Gao, W., & Tian, Q. (2017). Learning affective features with
a hybrid deep model for audio–visual emotion recognition. IEEE Transactions on Circuits and
Systems for Video Technology, 28(10), 3030–3043.
30. Zheng, W. L., Liu, W., Lu, Y., Lu, B. L., & Cichocki, A. (2018). Emotionmeter: A multimodal
framework for recognizing human emotions. IEEE transactions on cybernetics, 49(3), 1110–
1122.
Content-Based Image Retrieval Using
Deep Features and Hamming Distance
1 Introduction
Rapid evolution of smart devices and social media applications resulted in large
volume of visual data. The exponential growth of visual content demands for models
that can effectively index and retrieve relevant information according to the user’s
requirement. Image retrieval is broadly used in Web services to search for similar
images. Text-based queries are used to retrieve images during its early ages [6],
which required large scale of manual annotations. In this context, content-based
image retrieval systems gained popularity as one of the hot research topic since
1990s. Content-based image retrieval systems use visual features of an image to
retrieve similar images from the database. Most of the state-of-the-art CBIR models
extract low-level feature representations such as color descriptors [1, 2], shape
descriptors [3, 4], and texture descriptors [5, 8] for image retrieval. Usage of low-
level image features causes CBIR to be a heuristic technique. The major drawback
of classical CBIR models is the semantic gap between the feature representation
and user’s retrieval concept. The low-level semantics fails to reduce the semantic
gap when the image database grows large. Also, similarities among different classes
increase as the dimensionality of the database increases. Figure 1 further illustrates
the shortcomings of the classical CBIR models, which use low-level image features.
In Fig. 1, both a and b shares similar texture and color although they belong to two
different classes (Africa and beach). Figure 1c, d have different color and texture
even though they belong to same class (mountains). Figure 1e, f shows images from
the same class (Africa) with different shape, texture, and color.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 151
B. V. Kumar et al. (eds.), Smart Computer Vision, EAI/Springer Innovations in
Communication and Computing, https://doi.org/10.1007/978-3-031-20541-5_7
152 R. T. Akash Guna and O. K. Sikha
Fig. 2 Illustration of a
general CBIR architecture
[21]
improved the accuracy of state-of-the-art CBIR systems since the base model is
heavily trained on large volume of image data. Lin Kevin et al. [22] generated binary
hash codes using CNNs for fast image retrieval. This method was highly scalable
to increase in the dataset size. Putzu [25] in 2020 introduced a CBIR system using
relevance feedback mechanism where the users are expected to give their feedback
on misclassification with respect to the retrieval results. Based on the user-level
feedback, CBIR model alters the parameters or similarity measures for getting better
accuracy. The major drawback of relevant feedback-based CBIR models is that the
accuracy of those systems purely depends on the feedback provided by the user.
If the user fails to give proper feedback, then the system may fail. Some common
applications of CBIR [32–36].
The primary objective of this chapter is to investigate the effectiveness of high-
level semantic features computed by deep learning models for image retrieval. The
major contributions of this chapter are follows:
• Transfer learned deep features are used as high-level image representation for
CBIR.
• Applicability of Hamming distance as a distance metric for deep feature vectors
is explored.
• Clustering dataset before retrieval to induce faster retrieval is experimented.
The organization of the chapter is as follows. Section 2 describes the background
of CNN. Proposed model is detailed in Sect. 3. Sections 4 and 5 detailed the
dataset used for experimentation and results obtained, respectively. Finally, the
paper concludes with Sect. 6.
Convolutional Neural Networks (CNNs) are deep learning networks that detect
visual patterns present in the input images. Fukushima introduced the concept
of Convolutional Neural Networks (CNNs) in 1980 that was initially named as
“Neocognitron” [7] since it resembled the working of cells. Figure 3 shows the
basic architecture of a simple Convolutional Neural Network (CNN). Deep neural
networks are capable of computing high-level features that can distinguish objects
more precisely than classical features. Since CNN “doesn’t need a teacher,” it
automatically finds features suitable to distinguish and retrieve the images. A basic
CNN can have six layers as shown in Fig. 3: input layer, convolutional layer, RELU
layer, pooling layer, dense layer, and output classification layer.
1. Input layer: This layer holds the input raw image data for processing.
2. Convolution layer: It extracts features from the input image by convolving with
filters of various size. Hyper parameters like stride (number of pixels that a
kernel/filter can skip) can be tuned to get better accuracy.
154 R. T. Akash Guna and O. K. Sikha
3. ReLU layer: Rectified Linear Unit layer acts as a thresholding layer that converts
any value less than zero as 0.
4. Pooling layer: It reduces the feature map dimension to avoid the possibility of
overfitting.
5. Dense layer: Feature map obtained from the pooling layer is flattened into a
vector form, which is then fed into the final classification layer.
6. Output/classification layer: Predicts the final class of the input image. Here, the
number neurons are equal to the classes.
3 Proposed Model
This section describes the proposed CNN-based model for high-level feature
extraction and image retrieval in detail. The CNN model being used is described
first followed by the methodology to extract feature vectors from the model, and
then the techniques used to retrieve similar images are described. In this work, a
reduced InceptionV3 Network [9] is chosen as the feature extractor. The generalized
architecture of the proposed CBIR framework is shown in Fig. 4, and the steps
followed are described below:
1. The high-level feature representation for the database images is calculated by
passing it through the pretrained CNN (Inception V3) model.
2. The features are then clustered into N clusters using an adaptive K-means
clustering algorithm.
3. For a query image, the feature vector is calculated by passing it into the pretrained
CNN model.
4. The least distant cluster of the feature vector is found.
5. Least distinct images are retrieved from that cluster by using similarity measures.
Content-Based Image Retrieval Using Deep Features and Hamming Distance 155
The proposed model explored a pretrained inception network for extracting high-
level feature representation for the candidate images. The model is transfer learned
with pretrained weights for ImageNet [24] classification. Transfer learning [14] is
a well-known technique through which a pretrained model for a similar task can
be used to train another model. Transferred weights used for training the model
improvises the quality of features captured in a short span of time. The model is
initially trained like a classification model; thus, the addition of transferred weights
to the model gave a surge in the results produced.
The proposed model consists of a chain of ten inception blocks. The input tensor
to an inception block is convoluted in four different paths, and the output of those
paths are concatenated together. This model is chosen owing to its capability to
deduce different features using a single input tensor. An inception block has four
different paths. Path 1 consists of three convolutional blocks, path 2 consists of
two convolutional blocks, path 3 consists of an Average Pooling Layer followed
by a convolutional block, and path 4 consists of a single convolutional block. A
convolutional block has a convolutional layer, batch normalization layer [12], and
a ReLU [13] activation layer stacked in the above order. The architecture of an
156 R. T. Akash Guna and O. K. Sikha
inception block is visualized in Fig. 5. Following the inception blocks, a global max
pooling layer and three deep layers are present. The dimension of the final deep
layer is equal to the number of different classes present in the image database used
for retrieval. The activation of the final layer is a SoftMax activation function that
normalizes the values of the final deep layer to a scale of 0–1.
Deep layers of inception network were explored as a feature descriptor. The feature
vector obtained from the deep layers 1–3 is denoted as DF1, DF2, and DF3,
respectively. Figure 6 depicts the feature extraction from the deep layers of inception
network. Ji Wan compared the effectiveness of feature vectors extracted from the
deep layers (1–3) of inception network in [1]. Their study concludes that the
penultimate layer (DF2) produced better results compared to DF1 and DF3. Inspired
by the work of Ji Wan, feature vector extracted from DF2 is used in this work.
Figure 8 shows the intermediate results received when calculating deep features
from intermediate layers of Inception Resnet for all classes of Wang’s dataset. The
extracted feature vectors are then stored as csv files.
3.3 Clustering
The feature vector obtained from the inception model is then fed into a clustering
module to perform an initial grouping. A K-means [15] clustering algorithm is used
for clustering the extracted features. The objective of introducing clustering is to
reduce the searching time for a query image. K-means clustering is an iterative
algorithm that tries to separate the feature vectors into K nonoverlapping groups
using expectation-maximum approach. It uses Euclidean distance to measure the
distance between a feature vector and centroid of a cluster and assigns the feature
vector to the cluster that is least distanced from the feature vector. The centroid
of that cluster is then updated with the feature vector added to the group. The
mathematical representation of K-means clustering is given as:
m
K 2
J = wij x i − μj (1)
i=1 j =1
where J is the objective function, wij = 1 for feature vector xi if it belongs to cluster
j; otherwise, wij = 0. Also, μj is the centroid of the cluster of xi .
Content-Based Image Retrieval Using Deep Features and Hamming Distance 157
Convolutional
Layers
DF1
[0,7.4,0,0,2.1,0,-1.75...0]
DF2
[0,3.1,0,0,1.1,0,15.75...0]
DF3
[0,0,0,1,0,0,0,...0]
Relevant images from the database are retrieved by calculating the distance between
the input image feature vector and feature vectors stored in the database. This
work compares two distance metrics, Euclidean distance and Hamming distance,
for calculating the similarity.
Content-Based Image Retrieval Using Deep Features and Hamming Distance 159
Euclidean distance [16] represents the shortest distance between two points. It is
given as square root of summation of squared distances.
n
E.D = (xi − yi )2 (2)
i=1
where n is the dimension of the feature vectors, and xi and yi are elements of the
feature vectors x and y, respectively.
Hamming distance [17] measures the similarity between two feature vectors.
Hamming distance for two feature vectors is the number of positions at which
corresponding characters are different.
n
H.D = ∼ (xi = yi ) (3)
i=1
where n is the dimension of the feature vectors, and xi and yi are elements of the
feature vectors x and y, respectively. xi = yi = 1 if both are equal, else 0.
4 Dataset Used
This section describes the datasets used for experimentation. Wang’s dataset is
a tailor-made dataset for content-based image analysis and its larger version: the
COREL-10000 dataset is chosen for the analysis. Wang’s dataset consists of 1000
images divided into 100 images per classes, and the classes of Wang’s dataset are
African tribe, beach, bus, dinosaur, elephant, flower, food, mountain, Rome, and
horses.
The COREL dataset comprises of 10,000 images downloaded from COREL
photo gallery and is widely used for CBIR applications [18–20]. The dataset
comprises of 100 classes with 100 images in each class. Figure 7 shows sample
images from Wang’s dataset and COREL dataset (Fig. 8).
160 R. T. Akash Guna and O. K. Sikha
Since we are having more than one category of images, we use average precision.
Average precision is defined as:
n
Precision [k]
k=0
Average precision =
Number of categories(n)
Content-Based Image Retrieval Using Deep Features and Hamming Distance 161
Fig. 8 The features that were computed by the intermediate convolutional layers of the feature
extractor CNN
162 R. T. Akash Guna and O. K. Sikha
Fig. 8 (continued)
Content-Based Image Retrieval Using Deep Features and Hamming Distance 163
On retrieving 40 images per class from the Wang’s dataset using Euclidean distance,
an average precision of 0.946 is obtained. Seven classes (out of 10) were retrieved
with a precision greater than 0.95. When retrieving 40 images from COREL dataset,
the average precision received is 0.961 in which 92 classes have a precision of 0.95.
Average precision of 0.944 is achieved when retrieving 50 images per class from
Wang’s dataset using Euclidean distance. Seven out of the 10 classes were retrieved
with a precision of above 0.95. With an average precision of 0.955, 91 classes had a
precision of above the threshold of 0.95.
Retrieval of 60 images per class from Wang’s dataset resulted in images being
retrieved. With an average precision of 0.94, 5 classes of a total of 10 classes had a
precision of above 0.95. Retrieval of 60 images from COREL dataset had an average
precision of 0.952, while 87 of 100 classes had a precision of above 0.95 during
retrieval.
When retrieving 70 images per class on the Wang’s dataset, Euclidean distance
retrieved images with an average precision of 0.932. The number of classes with
precision greater than 0.95 is 5. When the retrieval is done on the COREL dataset,
the average precision in which the images were retrieved is 0.95 and 84 classes out
of 100 had a precision of above 0.95.
164 R. T. Akash Guna and O. K. Sikha
Retrieval of 40 images per class from Wang’s dataset resulted in the average
precision of 0.957 while eight classes were retrieved with a precision of above 0.95.
On retrieving the same number of images from the COREL dataset, we received
an average accuracy of 0.946 while retrieving 90 classes with a precision of above
0.95.
When 50 images per class were retrieved from Wang’s dataset using Hamming
distance with an average precision of 0.956, eight classes were retrieved with a
precision of above 0.95. When retrieved from the COREL dataset, the average
accuracy was 0.943, and 90 classes were retrieved with a precision of above 0.95.
Retrieval of 60 images per class from the Wang’s dataset resulted in images being
retrieved with an average precision of 0.942, and eight classes had a precision of
above 0.95. Retrieving the same number of images from COREL dataset resulted in
the average accuracy being 0.941, and 86 of 100 classes had a precision of above
0.95 during the retrieval.
When retrieving 70 images per class from the Wang’s dataset, Hamming distance
retrieved images with an average precision of 0.926 while 7 classes were retrieved
with a precision of greater than 0.95. Hamming distance produced an average
precision of 0.94 from the COREL dataset while retrieving 85 classes with precision
of above the threshold value of 0.95.
Table 1 summarizes the average precision obtained using Euclidean and Ham-
ming distance on images from Wang’s dataset and COREL dataset. From the table,
it is evident that the deep features obtained from the proposed model is effective on
image retrieval. The transition of retrieving 40 images to 70 images from Wang’s
dataset using Euclidean distance caused the average precision to reduce by 1.4%
while the number of classes with precision above 0.95 reduced from 7 classes to
5 classes. During the transition from 40 to 70 images in the COREL dataset, the
precision got reduced by 1.1% while the number of classes above the threshold
reduced from 92 to 84 classes. Hamming distance produced a change of 3.1% on
Wang’s dataset, but the number of classes above threshold just reduced from 8 to 7.
Content-Based Image Retrieval Using Deep Features and Hamming Distance 165
Table 1 Average precision obtained for image retrieval using Euclidean distance and Hamming
distance for images from Wang’s dataset and COREL dataset
Euclidean distance Hamming distance
Number of images retrieved Number of images retrieved
Wang’s dataset 40 50 60 70 40 50 60 70
Average precision 0.946 0.944 0.944 0.932 0.957 0.956 0.942 0.926
CORELdataset 40 50 60 70 40 50 60 70
Average precision 0.961 0.95 0.95 0.95 0.946 0.943 0.941 0.95
Fig. 9 Graphical
3.0 corel euc
representation of decrease in corel ham
Percentage of change of
1.5
1.0
0.5
0.0
50.0 52.5 55.0 57.5 60.0 62.5 65.0 67.5 70.0
Number of Images Retrieved
On the retrieval of 10 images from each class as shown in Fig. 11A, B, both of
the distance metrices showed a 100 percent retrieval precision. In the retrieved
images, a few images were commonly retrieved by both of the metrices, and a few
166 R. T. Akash Guna and O. K. Sikha
Fig. 10 Graphical Representation of Precision Obtained for COREL dataset (A, B) and for
Wang’s Dataset (C, D). (a) Graphical representation of precision on retrieval of N number of
images using (A) Euclidean distance and (B) Hamming distance from COREL dataset. The X
axis represents the classes of COREL classes and Y axis represents precision. (b) Graphical
representation of precision on retrieval of N number of images using (A) Euclidean distance and
(B) Hamming distance from Wang’s dataset. The X axis represents the classes of COREL classes
and Y axis represents precision
classes had higher number of images that were similar and few classes had minimal
similarity parring the input image. The number of same images retrieved by different
metrices from each class is visualized in Fig. 12. These similarities in images show
that some classes like beaches and mountains have certain internal clusters with
unique features that make the retrieval more efficient. Classes like Rome, flowers,
dinosaurs, and horses although retrieved with a precision of 100%, the number of
same images retrieved showed that these classes have inseparable images within the
Content-Based Image Retrieval Using Deep Features and Hamming Distance 167
Fig. 10 (continued)
classes. The primary goal of content-based image retrieval is to retrieve the images
most similar by its content. When we look at the images retrieved using Hamming
distance and Euclidean distance, we found certain subtle differences between the
images retrieved, and Hamming distance gave high precision than Euclidean as the
number of images increases as shown in Fig. 8.
Difference in Horse Class:
Consider the retrieval of horse class as in Fig. 11A–H and Fig. 11B–H. The quey
image has two horses, a white horse with brown foal and a brown horse. All of the
images retrieved by Hamming distance retrieved images containing the same horse
and foal (refer to Fig. 11A–H), but when retrieved using Euclidean distance, images
containing multiple horses and images containing horses of different colors were
168 R. T. Akash Guna and O. K. Sikha
also retrieved (refer to Fig. 11B–H). Figure 13 shows the distribution of subclasses
of the retrieved horses images.
Difference in Flower Class:
The input image provided for retrieval from the flowers class had red petals as
in Fig. 11A–G, and leaves are visible in the background. The images retrieved
using Hamming distance had red or reddish-pink petals in all the retrieved images,
and leaves were visible (refer to Fig. 11A–H). The images retrieved by Euclidean
distances showed wider variety of colors such as red, reddish-pink, pink, orange,
and yellow. In all retrieved images, leaves were visible but were not visible in
a substantial amount as seen in the input image and the images retrieved using
Hamming distance (refer to Fig. 11B–H). Figure 14 shows the number of retrieved
images belonging to each subcategory of flower class.
Difference in Dinosaurs Class:
Dinosaur class has two major subclusters that differ only by its orientation. The
dinosaurs in the first cluster faces to left while other dinosaurs face to the right.
The input image provided to the retrieval system had a dinosaur oriented toward the
right. One major characteristic of the dinosaur is the dinosaur’s long neck. All the
images retrieved by Hamming distance was oriented toward the right, and also it can
be noticed that all the dinosaurs retrieved had long necks (refer to Fig. 11A–G). A
handful of images retrieved by the Euclidean distance were either oriented toward
the left or had a shorter neck (refer to Fig. 11B–G). Figure 15 shows the statistics of
the number of images from each subclass of the dinosaurs category.
Difference in Rome Class:
The input image to the Rome class contained an image of the colosseum. Hamming
distance was able to retrieve only one image of the colosseum out of 10 retrieved
images from Rome category (refer to Fig. 11A–I), whereas Euclidean distance was
able to retrieve more number of images of the colosseum from the Rome class
(refer to Fig. 11B–I). The statistics of the number of images containing colosseum
retrieved by Euclidean and Hamming distance is shown in Fig. 16.
Content-Based Image Retrieval Using Deep Features and Hamming Distance 169
Fig. 11 Results on retrieving 10 images using a sample image from each category of Wang’s
dataset. Results for each class contains input image and 10 retrieved images. The classes are
represented in the order: (A) Africa (B) beach (C) mountains (D) bus (E) dinosaurs (F) elephants
(G) flowers (H) horses (I) Rome (J)
170 R. T. Akash Guna and O. K. Sikha
Fig. 11 (continued)
Content-Based Image Retrieval Using Deep Features and Hamming Distance 171
Fig. 11 (continued)
5
Number of Same Images
0
Africa Beach Mountains Bus Dinosaurs Elephant Flowers Horses Rome Food
Classes-Wangs Dataset
Number of images
7
6
5
4
3
2
1
0
Similar Multiple Horse Wrong Colour
Sub groups
7
6
5
4
3
2
1
0
Red Reddish-Pink Pink Oranges Yellow
Sub groups
To further evaluate the effectiveness of deep features for image retrieval, obtained
results are compared against state-of-the-art CBIR models with classical features
reported in the literature. CBIR models proposed by Lin et al. [27], Irtaza et al. [26],
Wang et al. [28], and Walia et al. [29, 30] are considered for comparison. CBIR
system based on CNN proposed by Hamreras et al. [31] is also compared with
proposed model. Table 3 compares the proposed deep feature-based CBIR model
with state-of-the-art classical feature-based models in terms of precision. From the
table, it can be inferred that deep features-based image retrieval produced good
results compared to other models under consideration across all the classes. Table
4 compares the average precision achieved by the proposed model against the five
SOA models under consideration. Dinosaurs class was retrieved with an average
precision of 99.1, which was the maximum average precision received among the
Content-Based Image Retrieval Using Deep Features and Hamming Distance 173
Number of images
7
6
5
4
3
2
1
0
Right+Tall Left+Tall Rigt+Short Left+Short
Sub groups
7
6
5
4
3
2
1
0
With Colosseum Without Colosseum
Sub groups
SOA models. While horses, flowers, and bus classes have an average precision of
81.0, 84.95, and 73.85, Rome, food, elephants, beach, and mountain classes have
an average precision of less than 60%. When compared to the SOA models, our
proposed model produces a greater average precision for all the 10 classes of Wang’s
dataset (Africa, beach, bus, dinosaurs, elephants, horses, food, mountain, flowers,
Rome).
Table 5 compares the recall results of state-of-the-art CBIR models with the
proposed model. From the table, it is evident that the proposed model outperforms
better than all of the state-of-the-art models giving perfect results on retrieving 20
images from each class of Wang’s dataset. Table 6 represents the average recall
achieved by each class for SOA models. Dinosaurs class was retrieved with an
average recall of 19.82, which was the maximum average precision received among
the SOA models. While horses, flowers, and bus classes have an average recall of
174
Table 3 Precision comparison of the proposed CBIR model with state-of-the-art models on retrieving 20 images
Wang’s database Lin et al. [27] Irtaza et al. [26] Wang et al. [28] Walia et al. [29] Walia et al. [30] Hamreras et al. [31] Proposed model
African 55.5 53 80.5 41.25 73 93.33 100
Beach 66 46 56 71 39.25 90 100
Rome 53.5 59 48 46.75 46.25 96.67 100
Bus 84 73 70.5 59.25 82.5 100 100
Dinosaurs 98.25 99.75 100 99.5 98 100 100
Elephants 63.75 51 53.75 62 59.25 100 100
Flowers 88.5 76.75 93 80.5 86 96.67 100
Horse 87.25 70.25 89 68.75 89.75 100 100
Mountain 48.75 62.5 52 69 41.75 83.83 100
Food 68.75 70.75 62.25 29.25 53.45 96.83 100
Average 71.425 66.2 67.2 60.91 66.92 95.73 100
R. T. Akash Guna and O. K. Sikha
Content-Based Image Retrieval Using Deep Features and Hamming Distance 175
16.2, 16.99, and 14.77, Rome, food, elephants, beach, and mountain classes have an
average recall of less than 12.
When compared to the SOA models, our proposed model produces a greater
average recall for all the 10 classes of Wang’s dataset (Africa, beach, bus, dinosaurs,
elephants, horses, food, mountain, flowers, Rome).
6 Conclusion
Table 5 Recall comparison of the proposed CBIR model with state-of-the-art models on retrieving 20 images
Wang’s database Lin et al. [27] Irtaza et al. [26] Wang et al. [28] Walia et al. [29] Walia et al. [30] Hamreras et al. [31] Proposed model
African 11.1 10.6 16.1 8.25 14.6 18.6 20
Beach 13.2 9.2 11.2 14.2 7.85 18.0 20
Rome 10.7 11.8 9.6 9.35 9.25 19.33 20
Bus 16.8 14.6 14.1 11.85 16.5 20 20
Dinosaurs 19.65 19.95 20 19.9 19.6 20 20
Elephants 12.75 10.2 10.75 12.4 11.85 20 20
Flowers 17.7 15.35 18.6 16.1 17.2 19.33 20
Horse 17.45 14.05 17.8 13.75 17.95 20 20
Mountain 9.75 12.5 10.4 13.8 8.35 16.6 20
Food 13.75 14.15 12.45 5.85 10.69 19.33 20
Average 14.285 13.24 14.1 12.545 13.384 19.125 20
R. T. Akash Guna and O. K. Sikha
Content-Based Image Retrieval Using Deep Features and Hamming Distance 177
large and diverse within classes. This leads to the retrieval of more similar content-
based image retrieval.
7 Future Works
To evaluate the effectiveness of the proposed model against microscopic images and
exploration of preprocessing techniques to enhance the proposed model on applying
to a medical dataset.
References
1. Pass, G., & Zabih, R. (1996). Histogram refinement for content-based image retrieval. In
Proceedings third IEEE workshop on applications of computer vision. WACV’96 (pp. 96–102).
https://doi.org/10.1109/ACV.1996.572008
2. Konstantinidis, K., Gasteratos, A., & Andreadis, I. (2005). Image retrieval based on fuzzy color
histogram processing. Optics Communications, 248(4–6), 375–386.
3. Jain, A. K., & Vailaya, A. (1996). Image retrieval using color and shape. Pattern Recognition,
29(8), 1233–1244.
4. Folkers, A., & Samet, H. (2002). Content-based image retrieval using Fourier descriptors on a
logo database. In Object recognition supported by user interaction for service robots (Vol. 3).
IEEE.
5. Manjunath, B. S., & Ma, W. Y. (1996). Texture features for browsing and retrieval of image
data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(8), 837–842. https:/
/doi.org/10.1109/34.531803
6. Hörster, E., Lienhart, R., & Slaney, M. (2007). Image retrieval on large-scale image databases.
Proceedings of the 6th ACM international conference on Image and video retrieval.
7. Fukushima, K., & Miyake, S. (1982). Neocognitron: A self-organizing neural network model
for a mechanism of visual pattern recognition. In Competition and cooperation in neural nets
(pp. 267–285). Springer.
178 R. T. Akash Guna and O. K. Sikha
8. Haralick, R. M., Shanmugam, K., & Dinstein, I.’. H. (1973). Textural features for image
classification. IEEE Transactions on Systems, Man, and Cybernetics, 6, 610–621.
9. Szegedy, C., et al. (2016). Rethinking the inception architecture for computer vision. Proceed-
ings of the IEEE conference on computer vision and pattern recognition.
10. LeCun, Y., et al. (1998). Gradient-based learning applied to document recognition. Proceedings
of the IEEE, 86(11), 2278–2324.
11. Wan, J., et al. (2014). Deep learning for content-based image retrieval: A comprehensive study.
Proceedings of the 22nd ACM international conference on Multimedia.
12. Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by
reducing internal covariate shift. arXiv preprint arXiv:1502.03167.
13. Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines.
ICML.
14. Pan, S. J., & Yang, Q. (2009). A survey on transfer learning. IEEE Transactions on Knowledge
and Data Engineering, 22(10), 1345–1359.
15. Likas, A., Vlassis, N., & Verbeek, J. J. (2003). The global k-means clustering algorithm.
Pattern Recognition, 36(2), 451–461.
16. Danielsson, P.-E. (1980). Euclidean distance mapping. Computer Graphics and Image Pro-
cessing, 14(3), 227–248.
17. Norouzi, M., Fleet, D. J., & Salakhutdinov, R. R. (2012). Hamming distance metric learning.
In Advances in neural information processing systems.
18. Wang, J. Z., Li, J., & Wiederhold, G. (2001). SIMPLIcity: Semantics-sensitive integrated
matching for picture libraries. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 23(9), 947–963.
19. Tao, D., et al. (2006). Direct kernel biased discriminant analysis: A new content-based image
retrieval relevance feedback algorithm. IEEE Transactions on Multimedia, 8(4), 716–727.
20. Bian, W., & Tao, D. (2009). Biased discriminant Euclidean embedding for content-based image
retrieval. IEEE Transactions on Image Processing, 19(2), 545–554.
21. Haldurai, L., & Vinodhini, V. (2015). Parallel indexing on color and texture feature extraction
using R-tree for content based image retrieval. International Journal of Computer Sciences and
Engineering, 3, 11–15.
22. Lin, K., et al. (2015). Deep learning of binary hash codes for fast image retrieval. Proceedings
of the IEEE conference on computer vision and pattern recognition workshops.
23. Babenko, A., et al. (2014). Neural codes for image retrieval. In European conference on
computer vision. Springer.
24. Chollet, F., et al. (2015). Keras. https://keras.io.
25. Putzu, L., Piras, L., & Giacinto, G. (2020). Convolutional neural networks for relevance
feedback in content based image retrieval. Multimedia Tools and Applications, 79(37), 26995–
27021.
26. Irtaza, A., Jaar, M. A., Aleisa, E., & Choi, T.-S. (2014). Embedding neural networks for
semantic association in content based image retrieval. Multimedia Tools and Applications,
72(2), 1911{1931}.
27. Lin, C.-H., Chen, R.-T., & Chan, Y.-K. (2009). A smart content-based image retrieval system
based on color and texture feature. Image and Vision Computing, 27(6), 658{665}.
28. Wang, X.-Y., Yu, Y.-J., & Yang, H.-Y. (2011). An e_ective image retrieval scheme using color,
texture and shape features. Computer Standards & Interfaces, 33(1), 59{68}.
29. Walia, E., & Pal, A. (2014). Fusion framework for e_ective color image retrieval. Journal of
Visual Communication and Image Representation, 25(6), 1335{1348.
30. Walia, E., Vesal, S., & Pal, A. (2014). An e_ective and fast hybrid framework for color image
retrieval. Sensing and Imaging, 15(1), 93.
31. Hamreras, S., et al. (2019). Content based image retrieval by convolutional neural networks.
In International work-conference on the interplay between natural and artificial computation.
Springer.
32. Sikha, O. K., & Soman, K. P. (2021). Dynamic Mode Decomposition based salient edge/region
features for content based image retrieval. Multimedia Tools and Applications, 80, 15937.
Content-Based Image Retrieval Using Deep Features and Hamming Distance 179
33. Akshaya, B., Sri, S., Sathish, A., Shobika, K., Karthika, R., & Parameswaran, L. (2019).
Content-based image retrieval using hybrid feature extraction techniques. In Lecture notes in
computational vision and biomechanics (pp. 583–593).
34. Karthika, R., Alias, B., & Parameswaran, L. (2018). Content based image retrieval of remote
sensing images using deep learning with distance measures. Journal of Advanced Research in
Dynamical and Control System, 10(3), 664–674.
35. Divya, M. O., & Vimina, E. R. (2019). Performance analysis of distance metric for con-
tent based image retrieval. International Journal of Engineering and Advanced Technology
(IJEAT), 8(6), 2249.
36. Byju, A. P., Demir, B., & Bruzzone, L. (2020). A progressive content-based image retrieval
in JPEG 2000 compressed remote sensing archives. IEEE Transactions on Geoscience and
Remote Sensing, 58, 5739–5751.
Bioinspired CNN Approach for
Diagnosing COVID-19 Using Images
of Chest X-Ray
1 Introduction
COVID-19, a novel virus, was revealed in December 2019, at Wuhan, China [1].
This is a member of coronavirus class; however, it is more virulent and hazardous
than the other coronaviruses [2]. Many nations are allowed to administer the
COVID-19 trial to a minor group of participants due to limited diagnostic facilities.
There are significant attempts to expand a feasible method for diagnosing COVID-
19, a key stumbling block continues to be the health care offered in so many
nations. There is also a pressing need to develop a simple and easy way to identify
and diagnose COVID-19. As the percentage of patients afflicted with this virus
grows by the day, physicians are finding it increasingly difficult to complete the
clinical diagnosis in the limited time available [3]. One of most significant areas
of study is clinical image processing, which provides identification and prediction
model for a number of diseases, including the MERS coronavirus and COVID-19,
concerning many others. Imaging techniques have increasingly gained prominence
and effort. As a result, interpreting these images needs knowledge and numerous
methods to improve, simplify, and provide a proper treatment [4]. Numerous efforts
have been made to use computer vision and artificial intelligence techniques to
establish an efficient and quick technique to detect infected patients earlier on.
For example, digital image processing with supervised learning technique has been
developed for COVID-19 identification by fundamental genetic fingerprints used for
quick virus categorization [5]. Using a deep learning method, a totally spontaneous
background is created to diagnose COVID-19 as of chest X-ray [6]. The information
was acquired from clinical sites in order to effectively diagnose COVID-19 and
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 181
B. V. Kumar et al. (eds.), Smart Computer Vision, EAI/Springer Innovations in
Communication and Computing, https://doi.org/10.1007/978-3-031-20541-5_8
182 P. M. Bala et al.
lung loss. As per the WHO (World Health Organization), COVID-19 also causes
pores in the chest, similar to MERS, providing them a “hexagonal appearance.”
Some of the ways for controlling pneumonia is digital chest imaging. Machine
learning (ML)-based image analytical techniques for the recognition, measurement,
and monitoring of MERS-CoV (Middle East respiratory syndrome coronavirus)
were created to discriminate among individuals with coronavirus and those who
were not. Deep learning method to autonomously partition all lung and disease
locations using chest radiography is developed. To create an earlier model for
detecting COVID-19, influenza and pneumonia-bacterial meningitis in a healthy
case utilizing image data and in-depth training methodologies are identified. In
the investigation by the authors, they created a deep neural approach based on
COVID-19 radiography alterations of X-ray images that can bring out the visual
features of COVID-19 prior pathologic tests, thus reducing critical time for illness
detection. MERS features like pneumonia can be seen on chest X-ray images
and computer tomography scans, according to the author study. Data mining
approaches to discriminate between MERS and predictable influenza depending on
X-ray pictures were used in the research. The clinical features of 40 COVID-19
participants, indicating that coughing, severe chronic fatigue, and weariness were
common beginning symptoms, has been evaluated. All 40 patients were determined
to have influenza, and the chest X-Ray examination was abnormal. The author
team identified the first signs of actual COVID-19 infection at the Hong Kong
University [17]. The author proposed a statistical methodology to predict the actual
amount of instances discovered in COVID-19 during January 2020. They came to
the conclusion that there were 469 unregistered instances between January 1 and
January 14, 2020. They also stated that the number of instances has increased.
Using information from 555 Chinese people relocated from Wuhan on the 29th
and 31st of January 2020, the author suggested a COVID-19 disease rate predictive
models in China. According to their calculations, the anticipated rate is 9.6 percent,
with a death rate of 0.2 percent to 0.5 percent. Unfortunately, the number of
Asian citizens moved from Wuhan is insufficient to assess illness and death. A
mathematical method to identify the chance of infection for COVID-19 has been
proposed. Furthermore, they estimated that the maximum will be attained after
2 weeks. The prediction of persistent human dissemination of COVID-19 from
48 patients was based on information from Thompson’s (2020) research [18]. The
researchers study created a prototype of the COVID-19 risk of dying calculation.
For two other situations, the percentages are 5.0 percent and 8.3 percent. For the
two situations, the biological number was calculated to be 2.0 and 3.3, respectively.
COVID-19 could cause an outbreak, according to the projections. X-ray imaging is
utilized to check for fracture, bone dislocations, respiratory problems, influenza, and
malignancies in the national health. Computed tomography is a type of advanced X-
ray that evaluates the extremely easy structure of the functioning amount of the body
and provides sharper images of soft inside organs and tissues. CT is faster, better,
more dependable, and less harmful than X-rays. Death can increase if COVID-19
infection is not detected and treated immediately.
184 P. M. Bala et al.
2 Related Work
The use of an X-ray image of chest has grown commonplace in recent years. A
chest X-ray is used to evaluate a patient’s respiratory status, including evolution
of the infection and any accident-related wounds. In comparison to CT scan
images, chest X-ray has shown encouraging outcomes in the period of COVID-
19. Moreover, due to the domain names rapid growth, academics have become
more unaware of advances across many techniques, and as a result, knowledge of
different algorithms is waning. As a consequence, artificial neural network, particle
swarm optimization, firefly algorithm, and evolutionary computing dominate the
research on bioinspired technology. The researchers then investigated and discussed
several techniques relevant to the bioinspired field, making it easier to select the best
match algorithm for each research [17]. Big data can be found in practically every
industry. Furthermore, the researchers of this research emphasize the significance
of using an information technology rather than existing data processing techniques
such as textual data and neural networks [19]. A fuzzy logic learning tree approach
is utilized in this study to improve image storing information performance [20].
The researchers’ purpose is to provide the concept of image recommendations
from friends (IRFF) and a comprehensive methodology for it. The significance
of reproductive argumentative and their numerous requests in the domain of
background subtraction has been said according to the author of this research.
Health care, outbreaks, face recognition, traffic management, image translation,
image analysis, and 3D image production are some of the uses of GAN that
have been discovered [21]. In interacting with radiographic images, state-of-the-
art computing and machine learning have investigated an amount of choices to
make diagnoses. The rapid increase of deep neural networks and their benefits to the
Bioinspired CNN Approach for Diagnosing COVID-19 Using Images of Chest X-Ray 185
health-care industry has been unstoppable since the 1985. Specular reflection class
activation transfer has been standing up with DNN to conquer over the identification
of COVID-19. Deep learning techniques have been operating interactively to assist
in the analysis of COVID-19. Aside from the time restrictions, deep neural networks
(DNN) are providing confidence in the analysis of COVID-19 utilizing chest X-ray
data, with no negative cases. DNN’s main advantage is that it detects vital properties
without the need for human contact.
Understanding the present condition and irrespective of COVID-19 confirmation,
it is critical to diagnose COVID-19 in a timely manner so that COVID-19 patients
diagnosed can be free of additional respiratory infection. Image categorization and
information extraction play a significant role in the nature of X-ray of chest and
diagnostic image procedures. Based on autonomous pulmonary classification, a
convolutional deep neural network is needed to retrieve significant information and
partition the pulmonary region more accurately. In this article, an SRGAN+VGG
framework is designed, where a deep neural network named visual geometrical team
infrastructure (VGT16) is being used to recognize the COVID-19 favorable and
unfavorable results from the image of chest X-ray, and a deep learning model is
used to rebuild those chest x-ray images to excellent quality. A convolutional neural
network identified as Disintegrate, Transmit, and Composition was employed for
the categorization of chest X-ray pictures with COVID-19 illness. DNN investigates
the image dataset’s category limitations by a session disintegration methodology to
handle with any irregularities. Multiple preconditioning methods, such as VGT16
and ResNet, have been deployed for categorization of COVID-19 chest X-ray
pictures from a normal image of chest X-ray to one impacted with influenza
using a supervised learning process. Dense convolutional network was employed
in this study to improve the outcomes of COVID-19 illness utilizing the suggested
bioinspired CNN model.
COVID-19 was diagnosed based on two sets of chest X-ray images, one taken
from one source and another. Joseph Paul Cohen and colleagues [22] were able to
magnify the chest X-ray of COVID-19 data collection by using images from various
publicly available sources. Four hundred ninety-five of them have been identified as
having COVID-19 antibodies. Figure 1 shows the distribution of the CIFAR dataset
has 950 people in it as of the time of this writing. Results from COVID-19 were
found in 53.3 percent (of all the images) of the photos with COVID-19 findings,
while results from normal-healthy X-ray findings were found in 46.7 percent (of all
the images). The standard X-ray images of a healthy chest were developed by Paul
Mooney, who independently created them after reading an article in the same journal
186 P. M. Bala et al.
40%
1
Covid-19 Normal/Regular
written by Thomas Kermany and his colleagues [23]. This CIFAR dataset contains
a total of 1341 regular and healthy photos. Roughly one-third of the photos were
chosen at random. It is essential to ensure that images with chest radiographs that
are images of normal-healthy people are not included because this prevents learning
using unequal datasets. If the dataset has many samples, it favors classes with a
few images, limiting the images that can be used. All X-rays fall into one class,
called normal or healthy X-rays, while COVID-19 X-rays are separate. There are
two types of data: When the number of patients who have COVID-19 is calculated
and considering gender, there are 346 males who have the disease and 175 females
who have it. The results show that 88 of the COVID-19-positive patients diagnosed
between the ages of 20 and 40 are 20- to 30-year-olds. In 175 patients, the most
patients were 41–61 years old, and 175 patients were received. COVID-19 was
detected in 172 patients who were between the ages of 62 and 82 years old.
Even medical specialists may find the X-ray images challenging to interpret. We
propose that our approach might be able to help them achieve their goals. PyTorch is
utilized for the model’s development and implementation. Tensor computations are
an essential element in developing deep learning algorithms, which consists a deep
learning directory with tensor computation functionality. Google is developing it,
and it is being used by Facebook’s AI Research Lab at the moment. The explosion
in interest in this subject among researchers has undoubtedly resulted in the advance
of several leading-edge algorithms, including NLP and computer vision, which has
been applied throughout the entire field of deep learning. By using chest X-ray of
COVID-19 for diagnosis, one of the most important goals is to develop a model for
image ordering, which is the important goal of using PyTorch. Classification models
that form images could generate considerable concern for clinicians, particularly
those who utilize X-ray imaging.
Two types of views have been chosen to distinguish between the lung image scenario
and the infection affecting it: posteroanterior (PA) and anteroposterior (AP). In
the posteroanterior (PA) view, the X-ray of the patient’s chest is taken from the
posterior to anterior ends of the patient’s upper body. When it occurs with chest X-
rays, the term “anteroposterior” refers to the X-ray taken from the patient’s anterior
Bioinspired CNN Approach for Diagnosing COVID-19 Using Images of Chest X-Ray 187
to posterior coverage. Before the labeling process can begin, the images must first
be scaled to accommodate data augmentation, which takes place before the images
can be labeled. It is necessary to scale the image first before preprocessing (Fig. 2)
because it will be subjected to several transformations during its creation process.
Compared to the chest X-ray images of regular patients, which are readily available,
the COVID-19 data is limited in comparison. K-Fold Cross-Validation is one of
the methods that can be used to develop skewed datasets, which are defined as
those that have a significant difference in the amount of data they contain from one
another. While staging the data throughout the study, all of the images were taken
in proportion to the data in order to avoid overfitting the results to the datasets.
The following data transformations are available such as data augmentation and
preprocessing.
The next step involves loading the dataset with images that are positive for
COVID-19 and normal in appearance. Data augmentation is the process of creating
novel data from available data and a few simple image manipulation and image
classification methods, which are referred to as data synthesis. By including an aug-
mentation, the model’s generalization will be improved, and the risk of overfitting
training results will be eliminated. Using image augmentation, adding additional
information to the live dataset can be manually entering without overwhelming.
PyTorch’s torch vision library can be used to accomplish all of these tasks. In addi-
tion to data transformation and handling, Torchvision offers deep learning models
that are already defined and at the field’s cutting edge. An image augmentation
technique may include image rotation, image shifting, image flipping, or image
noising, among other things. The training and validation of image transforms are
performed with only a small amount of data, resulting in the creation of additional
data. Preprocessed input images (Fig. 3) are always required for pretrained models.
The datasets are first to read into PIL image (Python imaging format), which is then
used to create a sequence of transformations. Aside from that, To TenSor converts a
PIL image (C, H, W), which is in the range of [0–255] in (x, y, width, height), and
shape (C, H, W), which is in the range of [0–1], to a floating-point FloatTensor in
188 P. M. Bala et al.
the range of [0–1], which is in the range [0–1]. Images have been normalized to a
range of 0 to 1, with a 0.5 standard deviation serving as the standard deviation.
Input − μ
Input = (1)
Standard deviation
Input − 0.5
Input =
0.5
In this case, μ is equivalent to the average deviation 0.5. The length of the channel
is denoted by the letter C, the height is denoted by H, and the width is denoted by
the letter W, H, and W and must score a minimum of 224 points in order to be
considered for the tournament. Normative values were calculated using the mean
and average deviation, with the mean range being [0.484, 0.455, 0.405] and the
standard deviation range being [0.228, 0.225, 0.226], respectively, for the data. In
this case, the CIFAR dataset is used for the normalization procedure. CIFAR dataset
is a group of images that is commonly used to train deep learning algorithms. It is
mainly used dataset for image classification and also used by scientists with different
algorithms. Imaging networks are used in many applications, and CIFAR dataset is
the known dataset for machine learning and computer vision algorithms. CIFAR is
widely used in machine learning applications, such as image recognition techniques
with machine learning. In total, the forum has more than 1.2 million images in
10,000 different categories, which can be searched by typing in a keyword. On
the other hand, the data in this dataset are loaded into a top-of-the-line piece
of hardware, such as a CPU that alone cannot handle datasets of this size and
complexity.
Bioinspired CNN Approach for Diagnosing COVID-19 Using Images of Chest X-Ray 189
During training and validating the model, the dataset is divided into 80/20 ratios to
avoid utilizing skewed datasets. With regard to each folder, the images are tagged
using the class name of the folder where the images are located. Additionally, the
DataLoader loads the labeled images and train tracks in the game’s memory (Class
Name). This divides the dataset into two distinct classes: regular and healthy X-rays
and the other for the dangerous COVID-19 X-ray. In either case, data is loaded into
CUDA (graphics processing unit) or the CPU before moving on to model definition.
Torchvision is a sub-package that contributes to deep learning image classification,
detection of objects, image segmentation, etc. For image processing, Torchvision
offers a pretrained and in-built deep learning model.
The CNN model is a form of neural network that allows us to obtain higher
interpretations for picture input. Unlike traditional image processing, which requires
the user to specify the feature representations, CNN takes the image’s original
captured image, training the system, and then separates the characteristics for
improved categorization. The structure of the brain is deeply implicated in machine
learning. Like MRI, CT, and X-rays, signal and image processing technology are
widely used to apply deep learning to images described in Fig. 4. Deep learning
models are overly configured by the CNN model parameters using deep feature
extraction techniques.
The visual system of the human brain inspired CNNs. The goal of CNNs is to
enable computers to see the world in the same way that humans do. Image iden-
tification and interpretation, image segmentation, and natural language processing
can all benefit from CNNs in this fashion [24]. CNNs feature convolutional, max
pooling, and nonlinear activation layers and are a type of deep neural network.
The convolutional layer, which is considered a CNN’s core layer, performs the
“convolution” action that gives the network its name.
Convolutional neural networks are similar to classic machine learning. Figure 5
describes convolution layers have odd number layers, while sharing and subsam-
pling layers have even number layers, excluding the input and output layers. Figure
6 describes the CNN has eight different layers linked to sharing layers with kernel,
the group dimension is 100, and the model boundary is 100 epochs.
190 P. M. Bala et al.
Preprocessing
Classification
Hidden Layers
Input Output
Layer Layer
x1
y0
x2
y1
x3
x4
yq
xp
Convolution
Kernel size: 3 Feature map: 8
Sharing Stride=2
Convolution
Kernel size: 3 Feature map: 6
Sharing Stride=2
Convolution
Kernel size: 3 Feature map: 4
Sharing Stride=2
COVID-19 Image as
Output
The cuckoo algorithm is realized to determine the interested regions of the COVID-
19 x-ray images. Here, the cuckoo-based hash function (CHF) is a contour
metaheuristic process that utilizes constant variance as a method of search. To
design cuckoos as they look for the strongest roost to lay, a method of this kind is
investigated to obtain pixel locations. Almost every pixel Pi is indeed a destination
192 P. M. Bala et al.
that might be good for applying the information gain and can potentially be selected
from the pixel locations that meet the function’s criteria. It is assumed that, for the
purposes of medical imaging, x-ray pixel intensities Pi are probable locations. We
can greatly increase efficiency by placing initial egg-nesting cuckoos throughout all
C-ray spatial domain. In CHF, we represent the intention to move a destination with
the probability less than 1 in order to ensure that the total number of regions to
assess remain constant. Additionally, we take an arbitrary number from the image
and assign it to a position.
The pixel selection over the X-ray image based on cuckoo hash functions is
modeled as:
Ti +1 T
Pi = Pi i + ϕ. Loc (ρ, σ, τ ) (2)
ϕHu1 (Pi ) + ϕHu2 (Pi ) ϕSa1 (Pi ) + ϕSa2 (Pi ) ϕBr1 (Pi ) + ϕBr2 (Pi )
ρ = Least , ,
2 2 2
(3)
ωHu1 (Pi ) + ωHu2 (Pi ) ωSa1 (Pi ) + ωSa2 (Pi ) ωBr1 (Pi ) + ωBr2 (Pi )
σ = Least , ,
2 2 2
(4)
ϑHu1 (Pi ) + ϑHu2 (Pi ) ϑSa1 (Pi ) + ϑSa2 (Pi ) ϑBr1 (Pi ) + ϑBr2 (Pi )
τ = Least , ,
2 2 2
(5)
where Ti stands for the event time period, PiTi represents the chosen pixel location,
ϕ defines the measured normal variance distance, Loc(ρ , σ , τ ) express the location
of current pixel location in terms of rows and column, Hu denotes the hue value of
pixel location, and Sa denotes the saturation value of pixel location.
Mean (H u1 (Pi )) − Hu (Pi )
ϕHu1 = (6)
width
Mean (Hu2 (Pi )) − Hu (Pi )
ϕHu2 = (7)
height
Mean (Hu1 (Pi )) (ϕHu1 (Pi ) − Hu (Pi ))
ωHu1 = (8)
width
Bioinspired CNN Approach for Diagnosing COVID-19 Using Images of Chest X-Ray 193
Mean (Hu2 (Pi )) (ϕHu2 (Pi ) − Hu (Pi ))
ωHu2 = (9)
height
Mean (Hu1 (Pi )) (ωHu1 (Pi ) − Hu (Pi ))
ϑHu1 = (10)
width
Mean (Hu2 (Pi )) (ωHu2 (Pi ) − Hu (Pi ))
ϑHu2 = (11)
height
Mean (Sa1 (Pi )) − Sa (Pi )
ϕSa1 = (12)
width
Mean (Sa2 (Pi )) − Sa (Pi )
ϕSa2 = (13)
height
Mean (Sa1 (Pi )) (ϕSa1 (Pi ) − Sa (Pi ))
ωSa1 = (14)
width
Mean (Sa2 (Pi )) (ϕSa2 (Pi ) − Sa (Pi ))
ωSa2 = (15)
height
Mean (Sa1 (Pi )) (ωSa1 (Pi ) − Sa (Pi ))
ϑSa1 = (16)
width
Mean (Sa2 (Pi )) (ωSa2 (Pi ) − Sa (Pi ))
ϑSa 2 = (17)
height
Mean (Br1 (Pi )) − Br (Pi )
ϕBr1 = (18)
width
194 P. M. Bala et al.
Mean (Br2 (Pi )) − Br (Pi )
ϕBr2 = (19)
height
Mean (Br1 (Pi )) (ϕBr1 (Pi ) − Br (Pi ))
ωBr1 = (20)
width
Mean (Br2 (Pi )) (ϕBr2 (Pi ) − Br (Pi ))
ωBr2 = (21)
height
Mean (Br1 (Pi )) (ωBr1 (Pi ) − Br (Pi ))
ϑBr1 = (22)
width
Mean (Br2 (Pi )) (ωBr2 (Pi ) − Br (Pi ))
ϑBr2 = (23)
height
We realize the X-ray image vital points using the feature vector for pixels Pi ,
which uses integer positions. The notion of conditional variance is used to enhance
stochastic search, as in the case of the recommended CHF. Across each phase,
the distance of the normal distribution is set using a unique dissemination that is
calculated as
⎧ τ
⎪
⎨ lim 1+ ρ1
σ ρ→∞
0<ρ<σ <τ <∞
Loc (ρ, σ, τ ) = 2 (ρ−σ )5/2 (24)
⎪
⎩
0 Loction not satisfied
The hypothesis for which a pixel in the X-ray image will be deselected is based
on the following formula:
Based upon the above equational model, the proposed cuckoo-based hash search
algorithm works as follows:
Algorithm 1 Cuckoo-Based Hash Search
1. Evaluate and obtain the dissemination points from the X-ray image using Eqs.
(2) to (23).
2. Declare the variance values for CHF: Loc(ρ , σ , τ ).
Bioinspired CNN Approach for Diagnosing COVID-19 Using Images of Chest X-Ray 195
Accuracy of training, validation, and loss of training and validation are some of the
measures used to assess the model’s performance. When it comes to determining
classification models, accuracy is crucial. Accuracy is the percentage of the number
of exact estimates to the total number of exact estimates. When it comes to training
accuracy, it’s indeed evident that a model’s accuracy is determined by the instances
it must have been trained on. Each epoch’s accuracy is depicted on the graph. The
accuracy of each epoch was recorded here, and a final chart was generated. As a
result, the graphs show that as the amount of epochs grows, the model’s accuracy
grows as well. As a result, training precision should be excellent. The greatest
training accuracy recorded here seems to be 99.14 percent. Training accuracy is
critical in researcher situation since they need to detect a larger number of valid
COVID-19 instances. As illustrated in Fig. 8, training accuracy indicates that
the model is acquiring all the features correctly and that there will be minimal
misinterpretation. Testing accuracy is another name for validation accuracy. That’s
the accuracy that has to be determined on the dataset that the model hasn’t been
trained on. It only shows the model of the sets that it hasn’t seen so far. Its purpose
is to determine how generalized model it is. Validation precision ought to be lower
or equal to train precision. It can be claimed that the model is overfitting when
there is a significant gap among validation and training accuracy. On each epoch,
the figure illustrates the validation accuracy, which is somewhat higher than that
of the accuracy. Validation and accuracy are important in this context to accurately
describe the test data, since both classifications are significant. It essentially explains
how the program will identify COVID-19 and regular cases for CT scan of chest.
As a result, the system should be able to categorize a wider range of right cases
as shown in Fig. 9. The error that happened on the training data is referred to as
training loss. The loss is a metric that shows how terrible the model is. When a
Bioinspired CNN Approach for Diagnosing COVID-19 Using Images of Chest X-Ray 197
0.961 0.96
0.96 0.954
0.95 0.952 0.95
0.95 0.942
0.94
0.93
0.93
0.92
0 2 4 6 8 10 12
Epochs
0.99
0.984 0.984
0.985 0.982
0.98
0.975
0.97
0.965
1 2 3 4 5 6 7 8 9 10
Epochs
0.08
0.06
0.04
0.02
0
0 2 4 6 8 10 12
Epochs
model predicts exactly, then it has a deficit of zero. Aside from that, researchers
argue that the deficit is bigger. The basic goal of training a model is to discover a
loss function that will result in minimal loss. Every time a loss occurs, the weights
are changed to reduce the loss. As a result, it is evident that train loss must be
kept to a minimum. Since the smaller the loss, the greater accurate our system is.
There is an attempt to see all the degradation of the model on each epoch, and as
the amount of epochs increases, the loss decreases. These have increased in certain
circumstances. So training loss basically indicates whether effectively the system
is learning for each cycle as well as which parameters are included because then
the model makes fewer errors in the next cycle and can successfully differentiate
between the COVID-19 and regular cases. As demonstrated in Fig. 10, a lower
number of losses indicates that the model is indeed very efficient and there are fewer
mistakes in the categorization of covid and regular instances. The validation loss is
nearly identical to a training loss, with the exception that it is computed mostly on
validation set. During training, the weight does not change. Validation loss must
be comparable to training loss. The system is overfitting if somehow the validation
loss is higher than that of the training loss. Also, it’s under fitting if somehow the
validation loss is lower than the training loss. Validation loss must be kept to a
minimum, and a tiny amount of overfitting could be tolerated. As illustrated in Fig.
11, the graph demonstrates the validation loss for each epoch, and the validation
loss decreases as the amount of epochs grows. Validation loss indicates how so
much inaccuracy the system has when identifying the COVID-19 and regular cases
for testing data. Unless the validation loss is minimal, the system is making fewer
mistakes when categorizing the COVID and regular cases in the testing data.
Bioinspired CNN Approach for Diagnosing COVID-19 Using Images of Chest X-Ray 199
0.02
0.015
0.01
0.005
0
0 2 4 6 8 10 12
Epochs
6 Conclusion
In the event of COVID-19 outbreak, a global crisis has emerged and affected people
worldwide. Keeping up with the demand for medical supplies and testing kits is
nearly impossible, even for the most developed countries. Because there are not
enough testing kits available, a rise in COVID-19 infections can occur because many
infections are not discovered. Prevention is paramount when it comes to reducing
spread and mortality rates. The computer-aided diagnosis system is currently under
development and would use radiograph films of patients’ chest X-rays to predict
COVID-19. COVID-19 diagnosis methods have gained much attention, with many
researchers devoting their time to the project.
The use of deep neural networks for aerial views allows for a more comprehen-
sive understanding of the spread and treatment of COVID-19 in our approach. Deep
feature extraction and the CSA approach were applied to identify coronaviruses
in chest X-ray images from the GitHub and Kaggle repositories. To extract the
concepts, 11 CNN models have been pretrained to serve as classifiers for CSA.
Statistical research is performed to identify which classification pattern is the most
effective. The training and validation accuracy will be improved using cuckoo-based
function. The accuracy of the proposed COVID-19 disease classification model is
98.54 percent.
References
1. Singh, A. K., Kumar, A., Mufti Mahmud, M., Kaiser, S., & Kishore, A. (2021). COVID-19
infection detection from chest X-ray images using hybrid social group optimization and support
vector classifier. Cognitive Computation, 1–13.
200 P. M. Bala et al.
2. Dhiman, G., Chang, V., Singh, K. K., & Shankar, A. (2021). Adopt: Automatic deep learning
and optimization-based approach for detection of novel coronavirus covid-19 disease using
x-ray images. Journal of Biomolecular Structure and Dynamics, 1–13.
3. Anter, A. M., Oliva, D., Thakare, A., & Zhang, Z. (2021). AFCM-LSMA: New intelligent
model based on Lévy slime mould algorithm and adaptive fuzzy C-means for identification
of COVID-19 infection from chest X-ray images. Advanced Engineering Informatics, 49,
101317.
4. Altan, A., & Karasu, S. (2020). Recognition of COVID-19 disease from X-ray images by
hybrid model consisting of 2D curvelet transform, chaotic salp swarm algorithm and deep
learning technique. Chaos, Solitons & Fractals, 140, 110071.
5. Elaziz, M. A., Hosny, K. M., Salah, A., Darwish, M. M., Songfeng, L., & Sahlol, A. T. (2020).
New machine learning method for image-based diagnosis of COVID-19. PLoS One, 15(6),
e0235187.
6. Dev, K., Khowaja, S. A., Bist, A. S., Saini, V., & Bhatia, S. (2021). Triage of potential
COVID-19 patients from chest X-ray images using hierarchical convolutional networks. Neural
Computing and Applications, 1–16.
7. Dhiman, G., Kumar, V. V., Kaur, A., & Sharma, A. (2021). DON: Deep learning and
optimization-based framework for detection of novel coronavirus disease using X-ray images.
Interdisciplinary Sciences: Computational Life Sciences, 1–13.
8. Kavitha, S., & Inbarani, H. (2021). Bayes wavelet-CNN for classifying COVID-19 in chest
X-ray images. In Computational vision and bio-inspired computing (pp. 707–717). Springer.
9. Pathan, S., Siddalingaswamy, P. C., & Ali, T. (2021). Automated detection of Covid-19 from
chest X-ray scans using an optimized CNN architecture. Applied Soft Computing, 104, 107238.
10. El-Kenawy, El-Sayed, M., Mirjalili, S., Ibrahim, A., Alrahmawy, M., El-Said, M., Zaki, R.
M., & Metwally Eid, M. (2021). Advanced meta-heuristics, convolutional neural networks,
and feature selectors for efficient COVID-19 X-ray chest image classification. IEEE Access, 9,
36019–36037.
11. Alorf, A. (2021). The practicality of deep learning algorithms in COVID-19 detection:
Application to chest X-ray images. Algorithms, 14(6), 183.
12. Vrbančič, G., Pečnik, Š., & Podgorelec, V. (2020). Identification of COVID-19 X-ray images
using CNN with optimized tuning of transfer learning. In 2020 International Conference on
INnovations in Intelligent SysTems and Applications (INISTA) (pp. 1–8). IEEE.
13. Bahgat, W. M., Balaha, H. M., AbdulAzeem, Y., & Badawy, M. M. (2021). An optimized
transfer learning-based approach for automatic diagnosis of COVID-19 from chest x-ray
images. PeerJ Computer Science, 7, e555.
14. Rajpal, S., Lakhyani, N., Singh, A. K., Kohli, R., & Kumar, N. (2021). Using handpicked
features in conjunction with ResNet-50 for improved detection of COVID-19 from chest X-ray
images. Chaos, Solitons & Fractals, 145, 110749.
15. Toğaçar, M., Ergen, B., & Cömert, Z. (2020). COVID-19 detection using deep learning models
to exploit Social Mimic Optimization and structured chest X-ray images using fuzzy color and
stacking approaches. Computers in Biology and Medicine, 121, 103805.
16. Balachandar, A., Santhosh, E., Suriyakrishnan, A., Vigensh, N., Usharani, S., & Manju Bala, P.
(2021). Deep learning technique based visually impaired people using YOLO V3 framework
mechanism. In 2021 3rd International Conference on Signal Processing and Communication
(ICPSC) (pp. 134–138). IEEE.
17. Gopalakrishnan, A., Manju Bala, P., & Ananth Kumar, T. (2020). An advanced bio-inspired
shortest path routing algorithm for SDN controller over VANET. In 2020 International
Conference on System, Computation, Automation and Networking (ICSCAN) (pp. 1–5). IEEE.
18. Thompson, R. N. (2020). Novel coronavirus outbreak in Wuhan, China, 2020: Intense
surveillance is vital for preventing sustained transmission in new locations. Journal of Clinical
Medicine, 9(2), 1–8.
19. Ucar, F., & Korkmaz, D. (2020). COVIDiagnosis-Net: Deep Bayes-SqueezeNet based diagno-
sis of the coronavirus disease 2019 (COVID-19) from X-ray images. Medical Hypotheses, 140,
109761.
Bioinspired CNN Approach for Diagnosing COVID-19 Using Images of Chest X-Ray 201
20. Shams, M. Y., Elzeki, O. M., Elfattah, M. A., Medhat, T., & Hassanien, A. E. (2020). Why
are generative adversarial networks vital for deep neural networks? A case study on COVID-
19 chest X-ray images. In Big data analytics and artificial intelligence against COVID-19:
Innovation vision and approach (pp. 147–162). Springer.
21. Pereira, R. M., Bertolini, D., Teixeira, L. O., Silla Jr, C. N., & Costa, Y. M. G. (2020).
COVID-19 identification in chest X-ray images on flat and hierarchical classification scenarios.
Computer Methods and Programs in Biomedicine, 194, 105532.
22. Cohen, J. P., Morrison, P., & Dao, L. (2020). COVID-19 image data collection.
arXiv:2003.11597.
23. Kermany, D., Zhang, K., & Goldbaum, M. (2018). Labeled optical coherence tomography
(OCT) and chest X-ray images for classification. Mendeley data, 2.
24. Reshi, A. A., Rustam, F., Mehmood, A., Alhossan, A., Alrabiah, Z., Ahmad, A., Alsuwailem,
H., & Choi, G. S. (2021). An efficient CNN model for COVID-19 disease detection based on
X-ray image classification. Complexity, 2021, 1.
Initial Stage Identification of COVID-19
Using Capsule Networks
1 Introduction
Coronavirus is an outbreak that has been exceedingly infectious and has spread
rapidly with common symptoms such as fever, toxicity, myalgia, or weariness
worldwide. COVID-19 is treated differently depending on the degree of the illness,
although typically antibiotics, cough drugs, antipyretics, and painkillers are effective
[1]. The initial phase of treatment is the discovery of any disease. COVID-19
is mostly detected by a swab test along with an X-ray or CT scan of the lung
[2]. Chest X-rays are the most cheap for a typical person among these medical
assessments. A major barrier for adopting radiographic imaging is the scarcity of
adequately qualified radiologists who can interpret X-ray pictures in a timely and
reliable manner. Artificial intelligence (AI) was widely used to speed up biological
research. AI is frequently employed with deep learning algorithms in numerous
applications such as image detection, data categorization, and picture segmentation
[3, 4]. Pneumonia in individuals infected with COVID-19 can occur when the virus
progresses to the lungs. Many in-depth study investigations have discovered the
condition employing an X-ray chest imagery approach [5]. In the early diagnosis
and treatment of COVID-19 illness, computed tomography and X-ray imaging play
a vital role [4]. Due to X-ray pictures being cheaper and faster and less radiation
exposed, these pictures prefer to CT images [3, 5]. However, pneumonia cannot
be diagnosed mechanically. White spots on X-ray pictures must be analyzed by
a specialist and explained in depth. However, these patches might be mistaken
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 203
B. V. Kumar et al. (eds.), Smart Computer Vision, EAI/Springer Innovations in
Communication and Computing, https://doi.org/10.1007/978-3-031-20541-5_9
204 S. Ganesan et al.
2 Literature Review
Over the years, humankinds have seen a variety of pandemics, some of which
are more damaging to humans than others. We are faced with the COVID-19
coronavirus, a new unseen opponent, and this is a difficult war. The COVID-
19 pandemic continues to have a disastrous impact on the health and well-being
of the global population, with people infected with the serious acute respiratory
coronavirus 2 (SARS-CoV-2). The purpose of this literature review is to provide
the latest results on the sciences of SARS-basic CoV-2. On 31 December 2019,
the Republic of China’s Wuhan Health Commission Hubei warned the National
Health Commission, the China CDC, and WHO of a cluster of 27 pneumonia
cases of undetermined origin [6]. These patients experienced a host of symptoms,
including fever, dyspnea, and dry cough, and bilateral glassy opacity in the lungs
was found in radiographical examinations. Due to its high population density and
proximity to a market that sold live animals, Wuhan became the hub of the human–
animal connection. Furthermore, the fast transmission in Wuhan was supported
by the absence of early containment due to the failure to accurately trace the
exposure history in early patient cases. This led in the announcement of the viral
pneumonia outbreak by the World Health Organization (WHO) on 30 January
2020. The 2019 Coronavirus (COVID-19) was identified by the WHO on 11 March
2020 as a pandemic based on the global logarithmic increase in cases. On January
7, 2020, China CDC detected the virus known as the novel coronavirus 2019
Initial Stage Identification of COVID-19 Using Capsule Networks 205
(CoV 2019). The SARS-CoV-2 virus is identical to the 2002 SARS Coronavirus
(SARS-CoV-1). A variety of distinct coronaviruses can cause the common cold.
The virus can become an infectious virus if these coronaviruses discover a mammal
reservoir that provides an appropriate cellular environment to multiply the virus
and to acquire a series of advantageous genetic changes. Similar to the SARS-
CoV-1 and MERS-CoV viruses, the origin of the SARS-CoV-2 genome has been
traced to bats [7]. Coronavirus illness 2019 is the outcome of a SARS-CoV-2
viral infection (COVID-19). In the joint WHO-China publication on COVID-19 [8],
the COVID-19 symptomatology was thoroughly studied. In 85% of the instances,
COVID-19 patients develop pyrexia through the course of their condition, whereas
only 45% are initially febrile [9]. Cough is also seen in 67.7% of the patients and
33.4% produces sputum. In 18.6%, 13.9%, and 4.8%, dyspnea, sore throat, and
nasal congestion were noted [10], respectively. Constitutional symptoms, such
as muscle or bone pain, chills, and headache, are found in 14.8%, 11.4%, and
13.6% of the patients accordingly [9]. In 5 and 3.7% of the patients, GI symptoms
such as nausea or vomit, as well as diarrhea, are recorded correspondingly. Even
in many rich countries, the health system is on the point of collapse as demand
for intensive care units is simultaneously increasing. The number of patients
in intensive care units is increasing with COVID-19 pneumonia. Deep learning
algorithms in recent years have continued to show remarkable results both in
the field of medical image processing and in a number of other fields. Tests are
carried out to get meaningful findings from medical data utilizing in-depth learning
algorithms. Effective screening of infected people using a primary methodology of
screening, i.e., radiological imaging using chest X-rays, is a major milestone in the
struggle against COVID-19. Early studies showed that COVID-19 patients exhibit
abnormalities in their chest X-ray imagery. As a result, a wide range of deep learning
approaches for the artificial intelligence (AI) have been created, with encouraging
findings regarding accuracy in the detection of COVID-19 infected persons utilizing
thoracic X-rays. However, these advanced AI systems have remained closed sources
and are only available for future study and growth by the scientific community
[10]. As an automated prediction technique for COVID-19, a deep convolutionary
network based on pretrained transmission models and chest X-ray images has
been developed. Deep learning is a machine learning subdiscipline inspired by the
structure of the brain. They used pretrained models such as ResNet50, InceptionV3,
and Inception-ResNetV2, to increase the accuracy of prediction for small X-ray
datasets [20]. (i) The recommended models have a full end-to-end design that
removes the necessity for the extraction and selection of human features. (ii) They
show that ResNet50 is the most effective of the three pretrained models. (iii)
Chest X-ray photos are the greatest device to identify COVID-19. (iv) Pretrained
models have been proven to generate great results in limited datasets [21], [22].
Hand washing is one of the preventative strategies advised by the World Health
Organization for at least 20 seconds after visiting public places. Soap or hand
sanitizers should be used with at least 60% ethanol [11]. It is also a good idea
to keep your hands away from the given T-zone face (eyes, nose, and mouth),
as the virus penetrates the upper respiratory system. Avoid interaction with those
206 S. Ganesan et al.
3 Dataset Description
Fig. 1 Zero and affine transformed X-ray image. (a) Zero degree centered X-ray. (b) Affine
transformed X-ray
for pretraining is that the nature of the pictures in that dataset (natural pictures)
is completely different from that of the COVID-19 X-ray dataset. It is believed that
utilizing a model that has been pretrained on comparable X-ray pictures would result
in superior COVID-CAPS boosting. The whole COVID-CAPS model is initially
trained on the external data, with the number of final Capsules set to the number
of output classes in the external set, for pretraining with an external dataset. The
final Capsule layer is changed with two Capsules to represent positive and negative
COVID-19 instances, in order to fine-tune the model using the COVID-19 dataset.
All of the other Capsule layers are fine-tuned, whereas the traditional layers are
set to the pretraining weights. In our work, we have used the Kaggle Covid-19
Radiography database that consists of two classes viz. COVID-19 and Normal,
consisting of 208 samples from each class for training, making it 416 samples for
training altogether. For testing, we have created two sets of data—one with 10.◦ of
rotation and one with 30.◦ of rotation. Each of the testing uses 32 samples, 23 rotated
by the corresponding angles of rotation and 9 that are not rotated. Figure 1a shows
a zero-degree centered X-ray image. Figure 1b shows an affine transformed X-ray
image.
4 Methodology
The convolutional layer generates a simple dot product between the defined region
of the image and the kernel function collection called filters [15]. In most situations,
an image with MNC dimensions is utilized as an input. Length and breadth of the
picture are MN, and the number of color channels are usually .C = 3 (red, green, and
blue color channels). The characteristic map is the output of the convolution layer,
Initial Stage Identification of COVID-19 Using Capsule Networks 209
and its dimensions are given in Eq. 1 as WQK. The following equation can be used
to obtain W and Q values. The feature map dimensions have a variety of parameters
such filter size (F ), zero padding (ZP ), stride (S), and filter number (K) [16]
M − F + 2P
W =
. + 1 (1) . (1)
S
Filter parameters such as filter size (F ) indicate the size of the filter or kernel that
will be utilized for convolution. Kernel functions are often odd-numbered square
matrices, such as .3 ∗ 3, 5 ∗ 5, 7 ∗ 7, etc. An odd value is desirable so that the kernel
matrix’s center may be set on the pixel on which it is operated.
4.1.2 Stride(S)
A filter or the kernel matrix must be translated throughout the input matrix vertically
from top to bottom and horizontally from left to right covering [15] all the elements
in the input matrix. Stride controls the movement of the translation. It represents the
number of steps taken by the kernel matrix during its translational movement. The
movement of the filter when a .3 × 3 kernel function is applied over a .7 × 7 input
with stride .S = 2 results in of size .3 × 3.
map to half of its original size. For example, the effect of .2 × 2 pooling filter applied
on .224 × 224 × 64 feature map that reduces the size of the o/p feature map to
.112 × 112 × 64. However, the depth of the feature map remains unaltered.
The ReLU function returns values in the [19], [20] range. It is the most often
utilized activation method. The negative values in the input matrix are fully deleted.
This activation may be implemented using a simple thresholding scheme. This
function reduces the amount of calculations by avoiding the simultaneous activation
of all neurons, as indicated in Eq. 2. When compared to sigmoidal and tangent
functions, this function’s convergence is also quicker. When the gradient is zero,
this function cannot update all of the weights during back propagation, and when
210 S. Ganesan et al.
employed with a fast learning rate, it results in a higher number of dead neurons [17].
The art and science of teaching computers to make judgments based on data
without being explicitly programmed is known as machine learning. In general,
there are three forms of learning in machine learning: supervised, unsupervised,
and reinforcement learning. We employed a convolutional neural network based
on supervised machine learning in this chapter. The supervised machine learning
model [8] is trained on a labeled dataset to predict the result of the sample data.
The suggested methodology for identifying different pneumonia instances, such as
viral and bacterial pneumonia in COVID-19, is described in the next section.
5 Proposed Work
Convolutional neural networks (CNNs) form the basic computing structures for
deep learning architectures. While CNNs try to grasp the generalized features of
an image, they do not focus on geometrical or orientational information such as
relative size of the features, angle of rotation of the features, etc. The idea of Capsule
Networks is to explicitly capture such information, wherein the dimension of each
Capsule would be equal to the number of such orientational information captured.
Each Capsule is a vector (a set of feature maps in case of multi-dimensional data),
where the size of the vector (the number of feature maps) that constitutes a Capsule
is equal to its dimension. An encoder and a decoder are the two most important
components of a Capsule Network. They have a total of six layers. The first three
layers of the encoder oversee accepting the input picture and transforming it to a
vector format (16-dimensional). The convolutional neural network, which is the first
layer of the encoder, extracts the picture’s fundamental characteristics. The Primary
Caps Network is the second layer, and it takes those fundamental traits and looks
for more intricate patterns among them. It might detect the spatial link between
specific strokes, for example. In the Primary Caps Network, different datasets
have varying numbers of Capsules; for example, the MNIST dataset includes 32
Capsules. The Digit Caps Network is the third layer, and the number of Capsules
in it changes as well. The encoder produces a 16-dimensional vector that is sent
to the decoder after these layers. The decoder is made up of three levels. It takes
the 16-dimensional vector and uses the data it must try to rebuild the same image
from scratch. The network becomes more resilient because of its ability to generate
Initial Stage Identification of COVID-19 Using Capsule Networks 211
K1 X V X B BXCXd K2 X C X D X d X d HXEXdXd
FCCaps
Routing
Routing
K2 Routing
E
B C D
= d-dimension
BxCxd K2 x C x D x d x d
Routing Routing
predictions based on its prior knowledge. The input image or a feature map first
undergoes the process of forming Capsules using a Transformation Matrix W, and
the obtained activity vector U.∧ is used to compute the CapsNet output, S. Finally,
a squash function is applied to obtain the prediction vector, V. This ensures that the
output is approximated to zero if the value is too small or to one, if the value is too
huge. The block diagram for the Capsule Network is given in Fig. 2.
Uj ∨i = Wij ui
. (3)
.sj = icij Uj ∨i (4)
(sj2 ) sj
vj =
. . (5)
(1 + sj2 ) (sj )
.3 × 3, and the second convolutional layer with 256 filters of kernel size 3 .× 3.
The extracted features are passed into a Primary Caps layer that creates Capsules
with dimension as 8 by using a convolution of kernel size 9 .× 9. These Capsules
are passed on to the Digit Caps layer where the Dynamic Routing is carried out
with routings for each update to be 3. The generated output called the activity
vector gives a 16-dimensional output corresponding to each class. The magnitude of
each of these vectors is passed through a SoftMax layer to obtain the classification
probabilities. A detailed block diagram for our proposed model is given in Fig. 3.
The architecture was experimentally decided based on the experimental results
given in Table 1. Experiments were conducted for 52 epochs with a batch size of 32
and an Adam Optimizer with a learning rate of 0.0001. Due to the increased training
loss associated with increasing the number of epochs, the training was limited to 52
epochs. Initially, 416 samples for training and 32 samples for testing a 128 .× 128
.× 1 dimension of data with a batch size of 32 were used. The highest average is
Initial Stage Identification of COVID-19 Using Capsule Networks 213
with 512 filters in Conv1 and 256 filters in Conv2, and the corresponding number
of learnable parameters is 1,935,136. We used the Adam optimizer with a learning
rate of 103%, 100 epochs, and a batch size of 16. The training dataset (described
in Sect. 4) was separated into two parts: training (70%) and validation (30%), with
the training set used to train the model and the validation set used to choose the
best model. The chosen model is then assessed on the testing set. To indicate the
performance, the following four measures are used: accuracy, sensitivity, specificity,
and AUC (area under the curve) are all terms that are used to describe the precision,
sensitivity, and specificity of a measurement (AUC). Following that, we will show
you the outcomes. We utilized the same dataset as Reference [10] to carry out
our studies. This information was gathered from two publically available sources
[19, 20]. Datasets of chest X-rays are accessible. Normal and COVID-19 are two
separate labels in the produced dataset. As the title suggests, the primary purpose of
this research is to discover COVID-19 positive patients. We divided the labels into
two categories: positive and negative.
The confusion matrix is used to evaluate the efficacy of any model. The number
of properly predicted outcomes and the number of wrongly predicted outcomes are
separated into classes in the confusion matrix, which represents the performance of
a classifier.
TP +TN
Accuracy =
. . (6)
T P + T N + FP + FN
214 S. Ganesan et al.
5.3.2 Precision
Precision is the % age of correct positive predictions in relation to the total number
of positive forecasts.
TP
P recision =
. . (7)
T P + FP
5.3.3 Recall
It is the % age of correct positive predictions made out of the total number of
samples in a given class. True positive rate (TPR) is another term for recall (TPR).
TP
Recall =
. . (8)
T P + FN
5.3.4 F1-Score
F1-score denotes the sympathetic mean between the recall and precision values.
2 ∗ P recision ∗ Recall
F1 =
. . (9)
P recision + Recall
FP
FPR =
. . (10)
FP + T N
critical because the better the sensitivity, the less likely it is that COVID-19 patients
would be mistakenly identified. DenseNet121 identified 22 COVID-19 patients
as Normal, whereas ResNet50 identified 24 COVID-19 patients as Normal, with
ResNet50 having nearly double the amount of false detections as DenseNet121. This
conclusively demonstrates that DenseNet121 outperforms ResNet50 in identifying
COVID-19. This might be attributed to DenseNet 121’s ability to harvest deeper
features due to the network level deepening. DenseNet 121 is also superior than
ResNet 50 because of its outstanding feature transfer and feature reuse abilities.
DenseNet121, on the other hand, does not deliver sufficient results. Table 1 further
shows that the combination of a CNN and a Capsule Network is significantly
better at identifying COVID-19 than the CNN alone. DenseNet121 is combined
with CapsNet to make CapsNet, and ResNet50 is combined with CapsNet to make
ResNet 50, and these two models are tested for their ability to identify COVID-19.
When combined, these two frameworks outperform the ResNet50 and DenseNet121
models. As seen in Fig. 5, both CapsNet and ResNet50 are very sensitive to COVID-
19 recognition. ResNet 50, on the other hand, misses five COVID-19 patients, but
CapsNet misses only three, implying that CapsNet is better. Our research aims to
improve the detection sensitivity of COVID-19 while also determining how long
it takes to train. Under the same settings and with the same testing equipment, the
same dataset is utilized to train ResNet 50 and CapsNet. 30 epochs in all, with
ResNet 50 lasting 12 hours, 47 minutes, and 56 seconds and CapsNet taking 6
hours, 3 minutes, and 24 seconds. CapsNet requires less than half the training time
of ResNet 50 and also gives considerable time savings.
Using the aforementioned dataset, Table 6 shows the COVID-CAPS obtained an
accuracy of 62.50%, a precision of 57.50%, a recall of 62.50%, and an F1-score
of 62.50% for 395 10-degree rotation, and COVID-CAPS obtained an accuracy of
68.4%, a precision of 67.60%, a recall of 68.75%, and an F1-score of 68.75% for
10-degree rotation. False positive instances were researched further to see which
categories are most likely to be misclassified by COVID-19. Normal instances
account for 54% of false positives, while COVID patients account for just 17%
of false positives, respectively. We compare our results to those of Reference [12],
which employed a binarized version of the same dataset, as given in Figs. 4 and 5. In
terms of precision and specificity, COVID-CAPS exceeds its equivalent. The model
suggested in Reference [12], which comprises 23 million trainable parameters,
has a greater sensitivity. Reference [6] has more study on the binarized version
of identical X-ray images. We did not compare the COVID-CAPS performance
to this study since the negative label only comprises normal patients (as opposed
to all normal, bacterial, and non-COVID viral cases being labeled as negative).
Initial Stage Identification of COVID-19 Using Capsule Networks 217
120
Accuracy
Statistical Measures with 30 degrees rotation
Precision
100 Recall
F1- Score
80
60
40
20
8
64
6
6
64
64
6
8
64
12
12
25
25
25
12
12
25
+
+
+
+
+
+
+
+
+
+
+
6
8
2
64
2
6
51
8
25
6
64
2
12
64
51
25
51
12
25
12
Filters Size
120
Accuracy
Statistical Measures with 30 degrees rotation
Precision
100 Recall
F1- Score
80
60
40
20
8
64
6
6
64
6
64
8
64
12
12
25
25
25
12
12
25
+
+
+
+
+
+
+
+
+
+
2
8
64
2
6
51
8
25
6
64
2
12
64
51
25
51
12
25
12
Filters Size
Fig. 5 Statistical measures of the proposed architecture with 30-degree rotation testing
Normal
Predicted score: 100%
Covid
Predicted score: 0%
Normal
6 Conclusion
In this chapter, we explore the use of CapsNet for capturing rotational variations
in COVID chest X-ray data to mimic real-world scenario. We train our model with
chest X-ray images from the Kaggle COVID-19 Radiography dataset that are all
centered at 0.◦ . In order to verify that the orientation details are efficiently captured
by CapsNet, we test our model with two types of testing data—10.◦ rotated and
Initial Stage Identification of COVID-19 Using Capsule Networks 219
Normal
Predicted score: 0%
Covid
Predicted score: 100%
Covid
80%
Image-wise accuracy
60%
40%
20%
train values
validation values
best model saved
0%
0 500 1000 1500 2000 2500 3000 3500
Training time (seconds)
Note: Learning curve for image-wise accuracy in multi label classification is
plotted using 50% score threshold for all classes
30.◦ rotated. Based on the results, it is clear that CapsNet is able to capture affine
transformed data though it has not been trained with such data. Comparing the
accuracy with standard architectures, it is evident that CapsNet is more efficient
for COVID-19 chest X-ray classification, especially considering the computational
efficiency of the proposed architecture. As a future scope, we wish to observe
performance of CapsNet on a larger dataset to be able to generalize the results.
Also, modifications in the architecture to include more convolutional layers might
help in a different set of features for better classification.
220 S. Ganesan et al.
0.8
Cross entropy loss
0.6
0.4
0.2
0.0
0 500 1000 1500 2000 2500 3000 3500
Training time (seconds)
Precision-Recall curve
Precision-Recall curve
All classes All classes
100% 100%
80% 80%
Precision
Precision
60% 60%
40% 40%
20% 20%
0%
0% 20% 40% 60% 80% 100% 0%
0% 20% 40% 60% 80% 100%
Recall Recall
(a) (b)
Fig. 10 (a) and (b) shows the AUC curve for both training and validation of the images
References
1. Guan, W., Ni, Z., Hu, Y., Liang, W., Ou, C., He, J. X., Liu, L., Shan, H., Lei, C.L., Hui, D.S.,
Du, B., Li, L. J., Zeng, G., Yuen, K. Y., Chen, R. C., Tang, C. L., Wang, T., Chen, P. Y., Xiang,
J., . . . , Zhong, N. S. (2020). Clinical characteristics of coronavirus disease 2019 in China. New
England Journal of Medicine, 382(18), 1708–20.
2. Corman, V. M., Landt, O., Kaiser, M., Molenkamp, R., Meijer, A., Chu, D. K. W., Bleicker, T.,
Brünink, S., Schneider, J., Schmidt, M. L., Mulders, D. G. J. C., Haagmans, B. L., van der Veer,
B., van den Brink, S., Wijsman, L., Goderski, G., Romette, J.-L., Ellis, J., Zambon, M., . . . ,
Drosten, C. (2020). Detection of 2019 novel coronavirus (2019-nCoV) by real-time RT-PCR.
Eurosurveillance, 25(3), 2000045.
Initial Stage Identification of COVID-19 Using Capsule Networks 221
3. Togacar, M., Ergen, B., & Cömert, Z. (2020). Application of breast cancer diagnosis based on
a combination of convolutional neural networks, ridge regression and linear discriminant anal-
ysis using invasive breast cancer images processed with autoencoders. Medical Hypotheses,
135, 109503.
4. Liu, X., Deng, Z., & Yang, Y. (2019). Recent progress in semantic image segmentation.
Artificial Intelligence Review, 52(2), 1089–1106.
5. Jaiswal, A. K., Tiwari, P., Kumar, S., Gupta, D., Khanna, A., & Rodrigues, J. J. (2019).
Identifying pneumonia in chest X-rays: A deep learning approach. Measurement, 145, 511–
518.
6. Sabour, S., Frosst, N., & Hinton, G. E. (2017). Dynamic routing between capsules. In Advances
in Neural Information Processing Systems (pp. 3856–3866).
7. Apostolopoulos, I. D., & Mpesiana, T. A. (2020). COVID-19: Automatic detection from
x-ray images utilizing transfer learning with convolutional neural networks. Physical and
Engineering Sciences in Medicine, 43, 635–40. https://doi.org/10.1007/s13246-020-00865-4
8. Kumar, G. S. P., Variyar, V. S., & Soman, K. P. (2019). Preprocessing techniques and area
estimation of ECG from a wireless ECG patch. Journal of Physics: Conference Series, 1362,
012107. https://doi.org/10.1088/1742-6596/1362/1/012107
9. Hemdan, E. E., Shouman, M. A., & Karar, M. E. (2020). COVIDX-Net: a framework of deep
learning classifiers to diagnose Covid-19 in x-ray images. arXiv: 2003. 11055.
10. Narin, A., Kaya, C., & Pamuk, Z. (2020). Automatic detection of coronavirus disease (COVID-
19) using X-ray images and deep convolutional neural networks. arXiv: 2003.10849.
11. Wang, L., & Wong, A. (2020). COVID-Net: a tailored deep convolutional neural network
design for detection of COVID-19 cases from chest radiography images. arXiv: 2003.09871.
12. Sethy, P. K., & Behera, S. K. (2020). Detection of coronavirus disease (COVID19) based on
deep features. https://doi.org/10.20944/preprints202003.0300
13. Afshar, P., Heidarian, S., Naderkhani, F., Oikonomou, A., Plaraniotis, K. N., & Mohammadi,
A. (2020). COVID-CAPS: a capsule network-based framework for identification of COVID-19
cases from X-ray images. arXiv: 2004.02696.
14. Mobiny, A., Cicalese, P. A., Zare, S., Yuan, P., Abavisani, M., Wu, C. C., Ahuja, J., de Groot,
P. M., & Van Nguyen, H. (2020). Radiologist-level COVID-19 detection using CT scans with
detail-oriented capsule networks arXiv, 2004.07407.
15. Pranav, J. V., Anand, R., Shanthi, T., Manju, K., Veni, S., & Nagarjun, S. (2020). Detection
and identification of COVID-19 based on chest medical image by using convolutional neural
networks. International Journal of Intelligent Networks, 1, 112–118.
16. Zheng, C., Deng, X., Fu, Q., Zhou, Q., Feng, J., Ma, H., Liu, W., & Wang, X. (2020). Deep
learning-based detection for COVID-19 from chest CT using weak label. medRxiv. https://doi.
org/10.1101/2020.03.12.20027185
17. Song, Y., Zheng, S., Li, L., Zhang, X., Zhang, X., Huang, Z., Chen, J., Zhao, H., Wang, R.,
Chong, Y., Shen, J., Zha, Y., & Yang, Y. (2020). Deep learning enables accurate diagnosis of
novel coronavirus (COVID-19) with CT images. medRxiv. https://doi.org/10.1101/2020.02.23.
20026930
18. Shanthi, T., Anand, R., Annapoorani, S., & Birundha, N. (2023). Analysis of phonocardiogram
signal using deep learning. In D. Gupta, A. Khanna, S. Bhattacharyya, A. E. Hassanien,
S. Anand, & A. Jaiswal (Eds.), International conference on innovative computing and
communications (Lecture notes in networks and systems) (Vol. 471). Springer. https://doi.org/
10.1007/978-981-19-2535-1_48
19. Anand, R., Sowmya, V., Menon, V., Gopalakrishnan, A., & Soman, K. P. (2021). Modified
VGG deep-learning architecture for COVID-19 classification using chest radiography images.
Biomedical and Biotechnology Research Journal (BBRJ), 5(1), 43.
20. Ramakrishnan, R., Vadakedath, A., Modi, A. J., Sajith Variyar, V. V., Sowmya, V., Gopalakr-
ishnan, E. A., & Soman, K. P. (2023). CT image enhancement using Variational mode
decomposition for AI-enabled COVID classification. In Artificial Intelligence on Medical Data
(pp. 27–37). Springer.
222 S. Ganesan et al.
21. Pandianchery, M. S., Sowmya, V., Gopalakrishnan, E. A., & Soman, K. P. (2022). Long short-
term memory-based recurrent neural network model for COVID-19 prediction in different
states of India. In Emerging Technologies for Combatting Pandemics (pp. 245–270). Auerbach
Publications.
22. Kandasamy, S. K., Maheswaran, S., Karuppusamy, S. A., Indra, J., Anand, R., Rega, P., et al.
(2022). Design and fabrication of flexible Nanoantenna-based sensor using graphene-coated
carbon cloth. Advances in Materials Science & Engineering.
Deep Learning in Autoencoder
Framework and Shape Prior for Hand
Gesture Recognition
1 Introduction
For the past several years, humans to interact with the computer or machine, as they
have been adequate in performing most of the tasks, are using wired devices such
as keyboard and mouse. The advancement in science and technology has led to the
inventions of complex and embedded system requiring the use of faster interfacing
devices. These input devices have been well recognized but have limitations when it
comes to naturalness and speed of human-machine interaction. Visual interpretation
of gestures can be useful in preserving the naturalness and ease of interaction.
Gesture is a type of nonverbal communication where a human being communicates
with the help of physical positions or movements of any body part either in place
of or in concurrence with speech. Hand gestures include physical positioning or
movement of fingers and sometimes entire hand. Gesture recognition process is a
subject in computer vision and communication fields with the aim of recognizing
human gestures using mathematical computation. Gesture recognition system can
B. N. Subudhi
Department of Electrical Engineering, Indian Institute of Technology Jammu, Nagrota, Jammu,
India
T. Veerakumar () · S. R. Harathas · R. Prabhudesai
Department of Electronics and Communication Engineering, National Institute of Technology
Goa, Farmagudi, Ponda, Goa, India
e-mail: [email protected]
V. Kuppili
Department of Computer Science and Engineering, National Institute of Technology Goa,
Farmagudi, Ponda, Goa, India
V. Jakhetiya
Department of Computer Science & Engineering, Indian Institute of Technology Jammu,
Nagrota, Jammu, India
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 223
B. V. Kumar et al. (eds.), Smart Computer Vision, EAI/Springer Innovations in
Communication and Computing, https://doi.org/10.1007/978-3-031-20541-5_10
224 B. N. Subudhi et al.
2 State-of-the-Art Techniques
Krueger et al. [3] have considered the work on artificial reality. This is one of
the fundamental researches where the user tried to communicates with the digital
world. Oka et al. [4] highlighted about the various techniques that can be used by
the users to communicate with the outside world, and they also discussed about
the fingertip movements in images. They showed the methodology for recognizing
symbolic gestures based on fingertip motions. The authors use an explicit various
color markers invasive devices to detect the finger movements. They were able
to detect the movements even in complicated backgrounds, and also the detection
was independent of the illumination. Invasive devices are hardware devices placed
on human body (in this case, they were directly placed on the hand to detect the
gesture). Mitra and Acharya [5] have worked on gesture recognition by deriving
the various signs made by an individual involving various body parts like hands,
arms, face, etc. with importance given to the hand and facial gestures. Hidden
Markov model (HMM) is used by the authors, which is mostly used to remove the
spatiotemporal instability. Freeman and Roth [6] gave a view about histogram of
the oriented gradients for recognizing hand gestures, and he presented a method to
recognize hand gestures, developed by McConnell [7]. The said scheme uses local
orientation histograms. The features were extracted from histogram of orientation
gradients, which were further used to classify and divide the gestures into various
classes. Stergiopoulou and Papamarkos [8] used neural network to recognize hand
gestures and used the YCbCr color model instead of the RGB. The advantage in
YCbCr model is the hand can be segmented even if the lighting conditions are
poor and can be used for complex backgrounds. Atsalakis et al. [9] proposed a
theory for color estimation and color reduction. Both color and spatial features are
given as input to the above techniques, and similarity functions are used for vector
comparison. Thus, the above algorithm can be applied to any model independent of
the background. Chen et al. [10] worked on hand gesture recognition that consisted
of four stages. The first stage included acquiring the image from the camera and
tracking the moving hand to obtain the segmented hand region. This was followed
by extraction of features from the segmented hand image. Feature extraction
included the Fourier descriptors to obtain the spatial features. The temporal features
were obtained from the motion analysis of the hand gesture. The feature vector
consists of the temporal and the spatial features that were obtained in the previous
steps. The HMM was used for hand. An accuracy of 90% was obtained for 20
classes. In the past years, the research community has witnessed substantial amount
of work done in hand gesture recognition based on using neural network. One of the
significant works in hand recognition was proposed by Symeonidis [11], which uses
histogram of oriented gradients to obtain the feature vector. Malik [12] developed a
method that captures hand gestures and recognizes. The HMM is used for gesture
recognition. Using HMM, it is difficult to detect the initial and the final points, so
the Baum-Welch algorithm is used where the unknown features of the HMM can be
obtained. Hasan [13] put forward a hand gesture recognition technique using depth
226 B. N. Subudhi et al.
camera. Depth of the camera is defined as the distance of the image from the camera.
The features used for classifying the images into different classes are area and the
orientation. To find the depth of the image, kinect sensor can be used that can track
the body movement of the person. A real-time system was proposed by Ao Tang
et al. [14] to recognize the gestures. Both color and depth are used as features of
the image and classified using deep neural networks. Use of principal component
analysis for hand gesture recognition is also well cited [15].
Gesture recognition with skin color detection is getting its popularity in last one
decade [16]. A gesture recognition scheme is proposed by Kawulok et al. [17] where
a new skin detection algorithm is used in color images with spatial analysis by newly
proposed texture-based discriminative skin-presence features. The feature fusion
scheme is also explored in this regard for gesture recognition. Yao and Fu [18]
proposed hand gesture recognition scheme, where a semiautomatic labeling is used
in RGB color and depth data to label hand patches. The 3D contour model is used to
acquire the hand region, and nearest neighbor method is used for correspondence. A
new superpixel-based hand gesture recognition scheme is developed by Wang et al.
[19], where the skeleton information from kinect is used for extraction of the hand
region. The textures and depths are represented in the form of superpixels to retain
the shape information. The earthmover distance is used to measure the dissimilarity
between the hand gestures. Chen et al. [20] proposed a gesture recognition system
using an interactive image segmentation scheme. The authors have used Gaussian
Mixture Model for image modeling, and Expectation Maximum algorithm is used
to learn the parameters of it. Gibbs random field is used for image segmentation.
Feature
Preprocessing Extraction
Input Hand
Gesture
Classification
Image
Using Deep
NN
Recognized
Gesture
Output
feature vectors are used to train the set of images using deep learning framework
based on which the new hand gesture images from test set are classified into
predefined finite number of classes.
3.1 Preprocessing
A good hand gesture recognition system has to tackle many difficulties like
illumination variations, complex background, and scaling. Illumination variations
may affect badly on the extracted hand skin region due to different lighting
conditions. In complex background, images contain other objects in the scene along
with hand pixels. These objects may contain color similar to skin that leads to the
problem of misclassification. The hand poses have different sizes in the gesture
image, and it arises scaling effect. Hand gesture images vary with respect to the size
and the quality, depending on the image capturing source. Hand gesture images are
also affected by the background and the lighting conditions. The background pixels
may sometimes resemble as the hand pixels, which makes it difficult to recognize
the gesture. Image of the same hand gesture taken in different lighting conditions
may give different results. Therefore, it is necessary to convert all the images into a
suitable form where the hand pixels are easily separable from the non-hand pixels.
In the proposed scheme, we have used the following steps for preprocessing: color
conversion, background removal, bounding box, and resizing.
228 B. N. Subudhi et al.
To recognize the hand gesture, it is necessary to separate the hand pixels from
non-hand pixels. This is done by thresholding the YCbCr image. This results in
separation of hand pixels from non-hand pixels. After obtaining the segmented form
of the hand gesture from the YCbCr color model, various morphological operations,
dilation and erosion operations, are used to obtain the proper preprocessed image.
The background is black in color while the hand is white. Figure 3 shows three
examples of gesture images, gesture 2, gesture 3, and gesture 5, where the RGB
to YCbCr color converted images are thresholded by Kapur’s thresholding scheme
[21]. The obtained image may have isolated holes and noisy objects pixels. A
combined operation of dilation and erosion is used on the thresholded images to
remove noisy object pixels, which is shown in Fig. 3. From this figure, it is possible
to observe that segmentation of the hand gesture is very clear.
The bounding box is used to find a minimum box containing the maximum
percentage of hand pixels in gesture image. The bounding box has edges parallel
to the Cartesian coordinate axis. It is mathematically expressed as an array of
coordinate axis of the edges of the box. The hand image is then cropped to overcome
the effects of hand size and camera depth. The images are then resized to a particular
size that seems optimal for offering enough data. Figure 4 shows output of the hand
gesture after using bounding box for two images of gesture 4.
Fig. 2 Conversion of RGB to YCbCr color space and YCbCr histogram separately for images:
gesture 1, gesture 3, and gesture 5
230 B. N. Subudhi et al.
Fig. 3 Segmentation of hand images for background removal for images: gesture 2, gesture 3, and
gesture 5
Fig. 5 Extracted HOG feature image and plot for images: gesture 1 and gesture 2
3.3 Classification
The gradient descent method is mainly employed in neural networks, and the
framework is commonly known as the multilayer back propagation method. Here,
the error function (obtained by taking the difference between the target vector
and the predicted output vector) is to be minimized to obtain the optimal weights
connecting the neurons. It can be defined in simple words as the method that is
used to compute the gradients of the error function with respect to predefined
parameters like the learning rate and number of epochs, and the aim is to minimize
the error. In back propagation, the errors computationally precede “backwards,” and
the weights are changed depending on how it will minimize the error with respect to
the computations performed to compute the loss. In the back propagation algorithm,
given the error in the output of the network, it tries to modify or distribute the
weights in the network so that error is obtained as minimum as possible [26].
Deep learning [27] is a type of machine learning algorithm that uses more than
one hidden layers. This involves cascade of many layers of non-processing units to
get the optimal features. Deep learning is mainly used in classification and pattern
analysis in learning multiple levels of features where each consecutive layer uses the
output of the previous layer as the input to it. In most general case, there are two sets
of neurons consisting of nodes as defined. The first set of neurons receives the input
from the input layer, which passes the modified input to the other set of neurons.
Deep neural network consists of multiple processing layers between the input layer
and the output layer with multiple linear and nonlinear conversions. Architecture of
deep neural network is provided in Fig. 6.
Deep neural network can be viewed as an artificial neural network that consists
of many hidden layers between the input and the output nodes. It divides the desired
complicated mapping into a series of nested multiple simple mappings each defined
by a separate layer. The input layer is called the visible layer as the original input is
known to us. This is followed by a sequence of hidden layers and an output layer. In
a gesture data, various shapes of fingers and palm express the human gesture. Hence,
such data has various factors of variation in the dataset. Deep learning algorithms
have already established its effectiveness to capture the statistical variations in the
232 B. N. Subudhi et al.
Output Layer
Input Layer
Hidden Layer
data and hence able to discriminate easily the important variation as different gesture
for recognition.
In the proposed scheme, the considered deep neural network architecture is a
feed-forward network, which has more than one layer of hidden units between the
input and output layer. In the proposed scheme, each neuron in the hidden layer is
considered to be j, and a logistic function of hyperbolic tangent function f is used.
This is expressed as
1
yj = f xj = (1)
1 + e−xj
where yj is the output of a neural unit and xj is the input of the same. The input xj is
defined as
xj = bj + yi wij (2)
i
where bj is the input bias function for jth neuron and wij is the weight connecting
between ith neuron and jth neuron. For different gesture class, the output unit j
converts the input image into a class probability, Prj , using softmax function:
e xj
Pr j = x (3)
ek
k
Fig. 7 Diagram of an
x1 x’1
autoencoder
x2 z1 x’2
x3 z2 x’3
x4 z3 x’4
x5 x’5
network that applies back propagation by defining the target values same as the
inputs. The autoencoder tries to learn the function with the aim of making output
similar to the input. It is often trained using the methods of back propagation.
Architecture of autoencoder is given in Fig. 7. So the huge complexity of the
network is reduced by considering that the dimension of the input is the same as
the dimension of the output, which is x’ = x. Thus, the objective function is defined
as
N
J = x(n) − x̂(n)2 (4)
n=1
where .x̂(n) = Q−1 Qx(n), and Q is the full transformation matrix. Hence, the
modified objective function is written as
N
2
J = x(n) − Q−1 Qx(n) (5)
n=1
In the training phase of the autoencoder, for input x, activation function is used
at each unit of the hidden layer, as defined in Eq. (1). Then in the output layer, the
output x’ is obtained. The back propagation algorithm is used to back propagate the
error to the designed network to perform the weight updates [28].
The proposed algorithm is run on Pentium D, 2.8 GHz PC with 2GB RAM,
Ubuntu operating system, and Python. We have tested with three hand gesture
image databases: HGR1 [29], Thomas Moeslund’s Gesture Recognition [30], and
NITG. The results obtained by the proposed scheme are compared with that of
234 B. N. Subudhi et al.
Vision-based interfaces are feasible and popular at this moment because the
computer is able to communicate with user using webcam without the requirement
of any interfacing device between human and machine. Hence, users are able
to perform human-machine interaction (HMI) with these user-friendlier features
preserving the naturalness and with higher speed. Therefore, the computer vision
algorithm should be reliable and fast. There should be no delay between the
gestures being captured and the response time of the computer in recognizing the
238
Fig. 8 Confusion matrices with training of (a) 3%, (b) 5%, (c) 7%, (d) 10%, (e) 15%, (f) 20%,
(g) 30%, and (h) 40%
gesture. Also, the vision-based interfaces are low cost compared to the other gesture
recognition techniques, which makes use of interfacing devices.
In this article, we have designed a system that aims at recognizing static
hand gestures using vision-based approach. The images of hand gestures taken
from the camera are firstly preprocessed to extract the hand part from the image.
This is followed by the extraction of histogram of orientation gradients (HOG)
feature vector. We have classified the hand gestures using deep neural network.
The proposed scheme is tested on three different hand gesture databases: HGR1,
Moeslund, and NITG. The results obtained are compared against four state-of-the-
art techniques. From the results, we can conclude that the accuracy of deep neural
240 B. N. Subudhi et al.
90
Classification Accuracy (%)
80
40
30
20
0 10 20 30 40 50 60 70
Training Percentage
(c)
Fig. 9 Classification accuracy vs. training percentage plot for (a) average performance, (b) best
performance, and (c) worst performance
network is much better. Also, with a lesser training data, we have achieved a higher
accuracy. It may be observed from the results that on an average, a 14% improve-
ment in the recognition accuracy is obtained by the proposed scheme against the
competitive state-of-the-art recognition techniques. The proposed method also helps
in overcoming the effect of illumination and camera depth.
In the future, this work will be extended for real-world applications by designing
an app for smartphone applications. Using such an app in smartphone, a deaf and
dumb person can interact with normal people for their daily needs.
References
1. Pavlovic, V., Sharma, R., & Huang, T. (1997). Visual interpretation of hand gestures for
human-computer interaction: A review. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 19(7), 677–695.
Deep Learning in Autoencoder Framework and Shape Prior for Hand Gesture. . . 241
2. Stenger, B., Thayananthan, A., Torr, P., & Cipolla, R. (2006). Model-based hand tracking using
a hierarchical Bayesian filter. IEEE Transactions on Pattern Analysis and Machine Intelligence,
28(9), 1372–1384.
3. Krueger, M. (1991). Artificial reality. Addison-Wesley Professional.
4. Oka, K., Sato, Y., & Koike, H. (2002). Real-time fingertip tracking and gesture recognition.
University of Tokyo, Report.
5. Mitra, S., & Acharya, T. (2007). Gesture recognition: A survey. IEEE Transactions on Systems,
Man, and Cybernetics—Part C: Applications and Reviews, 37(3), 311.
6. Freeman, W. T., & Roth, M. (1995). Orientation histograms for hand gesture recognition. In
IEEE International Workshop on Automatic Face and Gesture Recognition.
7. McConnell, R. K. (1986). Method of and apparatus for pattern recognition. U. S. Patent No.
4,567,610.
8. Stergiopoulou, E., & Papamarkos, N. (2009). Hand gesture recognition using a neural network
shape fitting technique. Engineering Applications of Artificial Intelligence, 22(8), 1141–1158.
9. Atsalakis, A., Papamarkos, N., & Andreadis, I. (2005). Image dominant colors estimation and
color reduction via a new self-growing and self-organized neural network. In Proceedings
of the 10th Iberoamerican Congress conference on Progress in Pattern Recognition, Image
Analysis and Applications
10. Chen, F. S., Fu, C. M., & Huang, C. L. (2003). Hand gesture recognition using a real-time
tracking method and hidden Markov models. Image Vision Computing, 21(8), 745–758.
11. Symeonidis, K. (2000, August 23). Hand gesture recognition using neural networks. School of
Electronic and Electrical Engineering, Report.
12. Malik, S. (2003, December 18). Real-time hand tracking and finger tracking for interaction.
CSC2503F Project Report.
13. Hasan, M. M. (2010). HSV brightness factor matching for gesture recognition system.
International Journal of Image Processing (IJIP), 4(5), 456–467.
14. Tang, A., Lu, K., Wang, Y., Huang, J., & Li, H. (2013). A real-time hand posture recognition
system using deep neural networks. ACM Transactions on Intelligent Systems and Technology,
9(4), 1–23.
15. Birk, H., Moeslund, T. B., & Madsen, C. B. (1997). Real-time recognition of hand alphabet
gestures using principal component. In 10th Scandinavian Conference on Image Analysis (pp.
261–268).
16. Kawulok, M. (2013). Fast propagation-based skin regions segmentation in color images.
In 10th IEEE International Conference and Workshops on Automatic Face and Gesture
Recognition (FG), Shanghai (pp. 1–7).
17. Kawulok, M., Kawulok, J., & Nalepa, J. (2014). Spatial-based skin detection using discrimi-
native skin-presence features. Pattern Recognition Letters, 41, 3–13.
18. Yao, Y., & Fu, Y. (2014). Contour model-based hand-gesture recognition using the kinect
sensor. IEEE Transactions on Circuits and Systems for Video Technology, 24(11), 1935–1944.
19. Wang, C., Liu, Z., & Chan, S. C. (2015). Superpixel-based hand gesture recognition with kinect
depth camera. IEEE Transactions on Multimedia, 17(1), 29–39.
20. Chen, D., Li, G., Sun, Y., Kong, J., Jiang, G., Tang, H., Ju, Z., Yu, H., & Liu, H. (2017). An
interactive image segmentation method in hand gesture recognition. Sensors, 7(2), 253–270.
21. Kapur, J. N., Sahoo, P. K., & Wong, A. K. C. (1985). A new method for gray-level picture
thresholding using the entropy of the histogram. Computer Vision, Graphics, and Image
Processing, 29(3), 273–285.
22. Kato, T., Relator, R., Ngouv, H., Hirohashi, Y., Takaki, O., Kakimoto, T., & Okada, K. (2011).
Segmental HOG: New descriptor for glomerulus detection in kidney microscopy image. BMC
Bioinformatics, 16(1), 1–16.
23. Wang, B., Liang, W., Wang, Y., & Liang, Y. (2013). Head pose estimation with combined
2D SIFT and 3D HOG features. In Seventh International Conference on Image and Graphics,
Qingdao (pp. 650–655).
24. Jung, H., Tan, J. K., Ishikawa, S., & Morie, T. (2011). Applying HOG feature to the detection
and tracking of a human on a bicycle. In 11th International Conference on Control, Automation
and Systems, Gyeonggi-do (pp. 1740–1743).
242 B. N. Subudhi et al.
25. Subudhi, B. N., Veerakumar, T., Yadav, D., Suryavanshi, A. P., & Disha, S. N. (2017). Video
skimming for lecture video sequences using histogram based low level features. In IEEE 7th
International Advance Computing Conference (pp. 684–689).
26. Ahmed, T. (2012). A neural network based real time hand gesture recognition system.
International Journal of Computer Applications, 59(4), 17.
27. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., & Manzagol, P. (2010). Stacked denoising
autoencoders: Learning useful representations in a deep network with a local denoising
criterion. The Journal of Machine Learning Research, 11, 3371–3408.
28. Xue, G., Liu, S., & Ma, Y. (2020). A hybrid deep learning-based fruit classification using
attention model and convolution autoencoder. Complex Intelligent Systems.
29. Database for hand gesture recognition [Online]. Available: http://sun.aei.polsl.pl/~mkawulok/
gestures/. Accessed April 24, 2020.
30. Moeslund Gesture [Online]. Available: http://www-prima.inrialpes.fr/FGnet/data/12-
MoeslundGesture/database.html. Accessed June 16, 2020.
Hierarchical-Based Semantic
Segmentation of 3D Point Cloud Using
Deep Learning
1 Introduction
Deep learning and convolutional neural networks have shown promising results on
various computer vision problems like image classification, image segmentation,
face detection, image inpainting, etc. These results are made possible due to enor-
mous amounts of image datasets and lot of emphasis that was laid on annotation.
Similar techniques were ideated and implemented on 3D datasets like voxelized 3D
shapes, triangulated 3D meshes, and 3D point cloud data [7, 12].
In this work, we are interested in performing analysis on 3D point cloud data,
which are inherently available in an unordered and unstructured form. Point clouds
also have variable number of points in each scene, which makes it even challenging
and harder to deal with. Existing approaches on image datasets fail to work with
unstructured data. Advanced CNNs-based technology usually requires structured
grids of data as input. Naive approach for dealing with raw and unstructured point
cloud data would be to voxelize the data, which involves lot of preprocessing and
loss in data. Recently, techniques like Randla-net [1] and recurrent slice networks
[2] are used for efficient semantic segmentation of large-scale point clouds.
Some of the works make the input point clouds order independent by building
permutation invariant representations for the data points like building a kd tree in
[13]. But these approaches come with own limitations that we can input only point
clouds with size of the form 2∧n.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 243
B. V. Kumar et al. (eds.), Smart Computer Vision, EAI/Springer Innovations in
Communication and Computing, https://doi.org/10.1007/978-3-031-20541-5_11
244 J. Narasimhamurthy et al.
Some recent techniques like [3] proposes a new method of extracting order
invariant features using global average pooling. Point Net was the first paper
to propose this architecture, which does not involve any domain transformation
of point clouds. It has achieved state-of-the-art results in shape classification,
segmentation, and retrieval tasks.
Very recent approaches like [9] proposes a similar technique to address the
issue of order invariance. It computes the fisher vectors as derivative of gaussian
likelihood of all the points. As gaussian likelihood involves summation, the sum
makes the fisher vectors order independent. They have shown state-of-the-art results
in shape classification and segmentation.
Main contributions of our work include:
(a) Ability to perform semantic segmentation on point clouds with variable number
of sizes using global max pooling.
(b) Redefining the traditional segmentation problem into an independent per point
classification problem aided by two-step hierarchy of local and global features.
(c) Using classification as an auxiliary task for extracting higher-level features.
2 Related Work
Deep learning on 3D data broadly falls into one of the following three categories:
(a) voxel based, (b) multi view based, and (c) graph based. Extending the successful
convolution neural network architectures to 3D shape analysis involves exponential
computation cost that limits to the usage of lower-resolution voxelized grids
leading to poor performance. Also, performing convolution inside the object is
not quite helpful though. Some previous works like [10] address this issue to
compute convolution effectively on the surface voxels by defining a new convolution
operation that adapts to the shape of the surface voxels instead of a conventional
rectangular kernel.
Multi view-based architectures leverage the advantages with an image-based
CNN in that they have similar data for deep processing. Usually in these approaches,
the 3D data is projected onto different planes, and these images are analyzed using
deep networks [6, 11]., etc. falls under this category. Graph-based methods perform
spectral convolutions on the graph constructed using 3D meshes.
With the development of 3D point cloud sensors like LiDAR, Kinect, etc., the
availability of 3D point cloud data has massively increased. Point cloud data helps
in indoor navigation and outdoor navigation of autonomous vehicles, nondestructive
machine vision inspection, and production audit. Better benchmarks for 3D point
cloud analysis are extremely important to leverage this data. 3D point clouds
obtained from these techniques are sparse, unordered, and extremely noisy. Point
cloud analysis include but not limited to segmentation, reconstruction, etc.
Hierarchical-Based Semantic Segmentation of 3D Point Cloud Using Deep Learning 245
This network takes in points with x, y, and z coordinates and returns the segmenta-
tion labels of each point (refer to Fig. 1).
Octree-based methods have been applied in analyzing 3D voxel grids like in [8]. In
our work, we construct an octree to efficiently query the neighbors for all the points
in a scene. Faster approach mentioned [14] has been used to construct octree for
each point cloud scene.
As said earlier, we also employ the global max pooling operator as order invariant
transformation. The max pooling operation is done at the end of the convolution
layers.
Since we perform nearest neighbor box search as opposed to KNN, we will end up
having variable number of input points for local feature extractor, but the network
architecture is designed in such a way that it doesn’t fail under any input shape.
4 Experiments
We performed several experiments with each slight modification from the other.
The first network we trained on the segmentation branch of the above network
using neighborhood data extracted from octrees and then the network predicts the
Hierarchical-Based Semantic Segmentation of 3D Point Cloud Using Deep Learning 247
segmentation labels of each point. Cross entropy loss is computed on these predicted
labels whose gradients are back-propagated to train the network. It was found that
there was a huge class imbalance since trained on segmentation labels directly.
Post scaling is applied as a method to correct the predicted softmax probabilities.
Then we understood that since we are breaking the segmentation problem into a
neighborhood-wise classification problem (can be considered to be analogous to the
patch-wise segmentation done in images), the predicted labels do not semantically
understand the object. So to include the ability to semantically segment a particular
neighborhood, we posed the classification problem as an auxiliary task to extract the
global features of an object and use these features in predicting the segmentation
labels of a point using its neighborhood. This brings in the network ability to
semantically segment a point cloud scene.
Fig. 3 (a) Training loss for model without feature hierarchy. (b) Validation loss for model without
feature hierarchy. (c) Training and validation loss for model with feature hierarchy
incorporate segmentation of all the other object categories too. The qualitative
results for car part segmentation from shapenet dataset and Porsche car is provided
in Figs. 4 and 5, which is quite promising.
Our method can produce effective segmentation even for shapes different from
the ones used while training. The network while training is observed to perform
250 J. Narasimhamurthy et al.
Acknowledgments First author was a student from IIT Madras supported by TCS research
internship at TCS Research and Innovation, Tata Consultancy Services Limited. The authors thank
reviewers for comments and suggestions for improvement.
References
1. Hu, Q., Yang, B., Xie, L., Rosa, S., Guo, Y., Wang, Z., Trigoni, N., & Markham, A. (2020).
Randla-net: Efficient semantic segmentation of large-scale point clouds. In Proceedings of the
IEEE conference on computer vision and pattern recognition.
Hierarchical-Based Semantic Segmentation of 3D Point Cloud Using Deep Learning 251
2. Huang, Q., Wang, W., & Neumann, U. (2018). Recurrent slice networks for 3d segmentation
of point clouds. In Proceedings of the IEEE conference on computer vision and pattern
recognition.
3. Qi, C., Su, H., Mo, K., & Guibas, L. (2017). PointNet: Deep learning on point sets for 3D
classification and segmentation. In International conference on computer vision and pattern
recognition.
4. Qi, C., Su, H., Mo, K., & Guibas, L. (2017). PointNet++: Deep hierarchical feature learning
on point sets in a metric space, In Conference on neural information processing systems.
5. Yi, L., Su, H., Shao, L., & Savva, M. (2017). Large scale 3D shape reconstruction and
segmentation from shapenet core55. In International conference on computer Vision.
6. Su, H., Maji, S., Kalogerakis, E., & Miller, E. (2015). Multi view convolutional neural networks
for 3D shape recognition. In International conference on computer Vision.
7. Dai, A., Chang, A., Savva, M., Manolis, Halber, Maciej, Funkhouser, Thomas, & Niessner,
M. (2017). ScanNet: Richly-annotated 3D reconstructions of indoor scenes. In International
conference on computer vision and pattern recognition.
8. Riegler, G., Ulusoy, A., & Geiger, A. (2017). OctNet: Learning deep 3D representations at high
resolutions. In International conference on computer vision and pattern recognition.
9. Shabat, Y., Lindenbaum, M., & Fischer, A. (2017). 3D point cloud classification and segmen-
tation using 3D modified fisher vector representation for convolutional neural networks, arXiv.
https://arxiv.org/abs/1711.08241
10. Li, Y., Pirk, S., Su, H., Qi, C., & Guibas, L. (2016). FPNN: Field probing neural networks for
3D data. In Conference on neural information processing systems.
11. Shi, B., Bai, S., Zhou, Z., & Bai, X. (2015). DeepPano: Deep panoramic representation for 3-D
shape recognition. In Signal Processing Letters.
12. http://modelnet.cs.princeton.edu/
13. Klokov, R., & Lempitsky, V. (2017). Escape from cells: Deep kd networks for the recognition
of 3D point cloud models. In International conference on computer vision.
14. Behley, J., Steinhage, V., & Cremers, A. B. (2015). Efficient radius neighbor search in three
dimensional point clouds. In IEEE International conference on robotics and automation.
Convolution Neural Network
and Auto-encoder Hybrid Scheme for
Automatic Colorization of Grayscale
Images
1 Introduction
Converting the grayscale images into colored images with the help of computers
with some human intervention was carried out in the early approaches of research.
Generally, two large approaches are pursued in computer vision for image coloriza-
tion such as user-oriented and automatic colorization. User-oriented approach is
with the human intervention giving the rules to be followed to color the pixels
for a given grayscale images. But in this approach, there is a constant need for
human intervention, and the output produced will be of dull scale pictures with
shading issues. The automation was carried out using some statistical techniques.
But the drawback of these techniques paved the way for new techniques. The
latest methodology of utilizing machine learning techniques used neural network to
perform the automatic colorization by making the system to learn about the process.
Varga and Szirányi [1] suggested the plan of automatic colorization for animation
images, as they are different from the original images. But there is a limitation of
much higher color uncertainty with poor image shading. Shweta Salve et al. [2]
proposed image colorization using Google image classifier and ResNet V2 but failed
to produce optimal implementation due to poor computation process. Putri et al. [3]
A. Anitha ()
School of Information Technology and Engineering, Vellore Institute of Technology, Vellore,
Tamil Nadu, India
e-mail: [email protected]
P. Shivakumara
B-2-18, Block B, Department of Computer System and Technology, Faculty of Computer Science
and Information Technology, University of Malaya (UM), Kuala Lumpur, Malaysia
S. Jain · V. Agarwal
School of Computer Science and Engineering, Vellore Institute of Technology, Vellore, Tamil
Nadu, India
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 253
B. V. Kumar et al. (eds.), Smart Computer Vision, EAI/Springer Innovations in
Communication and Computing, https://doi.org/10.1007/978-3-031-20541-5_12
254 A. Anitha et al.
used sketch inversion model to convert plain sketches into colorful images. This
approach can handle various geometric transformation but failed to work well since
it is trained for limited dataset. Existing effort on image colorization research can be
divided into scribble-based approaches, example-based approaches, and learning-
based approaches as proposed by Varga and Siranyi [1]. Scribble-based colorization
can be applied for static images and for the continuous frame of images. Levin
et al. [4] introduced a colorization technique using scribble-oriented colorization
where user gives the region of interest (ROI) to be colored in the picture by placing
scribbles. This algorithm was enhanced by Huang et al. [5] to decrease the color
blending at the edges of the images. Yatziv and Sapiro [6] proposed a model to
determine the RGB of the pixel by combination of multiple scribbles. The distance
between the scribble and the pixel is calculated by distance metric. Example-
oriented colorization was suggested by Reinhard et al. [7] on transferring color from
one image to another using statistical analysis. Likewise, Welsh et al. [8] proposed
a method of finding similar pixel-concentrated images. In addition, they proposed
to transfer color information to the matched pixel of the grayscale picture based on
the neighborhood statistics. Irony et al. [9] integrated example-oriented colorization
and scribble-oriented colorization to learn picture colorization. Charpiat et al. [10]
classified and forecasted the anticipated difference of RGB pixel intensity at each
pixel by defining a variable spatial coherency standard. Learning-based colorization
model studies the data of image colorization modeling to avoid human intervention.
Bugeau and Ta [11] introduced a train model for color prediction by taking the
images into square patches around each pixel. Cheng et al. [12] projected a three-
layered deep neural network for extracting characteristics of raw grayscale values
and sophisticated semantic features. Ryan Dahl [13] concatenated these semantic
features to teach a deep neural network using convolution neural network as a
character extractor. Then to endow the color channels, an enduring encoder was
trained and employed. The prognosticated colors are coherent for the majority part,
albeit the scheme is deeply horizontal to desaturate else to illuminate invariance
images.
In order to perform the automatic colorization, semantic features play a vital role
in colorization. With the aim of coloring an image efficiently and effectively, the
machine should be fed with information about the semantic composition of each
and every image and its localization. For instance, the shade of chameleon should
be changed according to the environment. Similarly, the ocean looks mostly blue
in color during daytime, but it will be total darkness during night. So to make the
machine to learn, we proposed CNN model combined with auto-encoder. The model
is used for significant feature identification and to train the model for the given
dataset of different categories and predict the color for the new grayscale images.
The chapter is organized as follows: Section 1 discussed about the introduction
followed by Sect. 2, which talks more about the basics of convolution neural
network and auto-encoder and decoder model. Section 3 confers about the proposed
methodology and Sect. 4 an empirical study on automatic colorization using CNN
model. Section 5 demonstrates the experimental results and analysis. The chapter
ends with conclusion as Sect. 6.
Convolution Neural Network and Auto-encoder Hybrid Scheme for Automatic. . . 255
This is the initial layer in CNN for feature extraction from the input images. A filter
of particular size, R x R, where R can be a small matrix of R > 2, convoluted with
the input image and slide toward the whole image to perform the dot product. The
produced matrix is called as the feature map, which provides information of every
corner and edge of the image by applying the filters to an input image. The cause
for feature map is to gain the knowledge about the input image, and those features
are in to other layers for better learning the whole input image.
Generally, pooling layer is preferred next to convolution layer. By this layer, the
computation cost is reduced since this layer aims in reducing the convolute feature
map from the convolution layer. It is achieved by reducing the associations between
the layers and works separately on each and every feature map. There are various
pooling operations performed, such as max pooling, average pooling, and global
pooling.
In max pooling, the biggest part is taken from the feature map. Average pooling
computes the normalized values in a considered sized image section. The summing
up is calculated in sum pooling. The pooling layer generally serves as a viaduct
between the convolution layer and the fully connected layer.
The fully connected layer is a typical neural network structure used to connect the
pooling layer values and the output layer. The layer consists of weights and biases
with the neurons as hidden layer between two different layers. The process of image
flattening is performed in the previous layer, and it is fed to the fully connected
layers. The flattened vector experiences few more fully connected layers to compute
the mathematical functions and to produce the classified images.
Generally, the basic learning process has overfitting issues. When the training data
produces some negative impact on the performance of the model, it is identified as
overfitting. To overcome the issue of overfitting, the dropout layer can be utilized.
This is achieved by identifying the neurons whose weights are negligible can be
dropped or random selection can be used, in order to decrease the dimension of the
model and to perform the training procedure.
Lastly, the majority parameter of the CNN model is the activation function.
Activation functions are acclimated to be trained and estimated any kind of perpetual
and involutes association among the variables of the network. To be straightforward,
it determines which information of the model should be moved in the forward
direction and which one should be removed from the network. Numerous regularly
utilized activation functions such as the ReLU, Softmax, tanH, and the Sigmoid
functions are proposed in various research models. These activation functions
are used for categorical utilization. Sigmoid activation function is preferred for
binary relegation, whereas Softmax activation function is preferred for multi-class
relegation.
Convolution Neural Network and Auto-encoder Hybrid Scheme for Automatic. . . 257
In this modern era, it makes the system to learn, and it is advisable to feed huge
dataset as input for training. In image processing technique, the collected data
from various sources can be fed into the system for automatic colorization. As
the number of data increases, the chance of increasing the execution time may
reduce the performance of the learning rate. To avoid feeding the input with huge
data without data loss, the same data can be compressed and can be fed into
the system. To implement this process, the concept of auto-encoder comes into
existence. Auto-encoder is an unsupervised artificial neural network (ANN) that
is trained to compress and encode the data efficiently and then learn how to rebuild
the data back from the compressed encoded representation, which is almost similar
to the original input. Auto-encoder comprises of four main components such as
encoder, bottleneck, decoder, and reconstruction loss as represented in Fig. 2. In
Fig. 2, x1 ,x2 ,x3 ...xn represents the input, and the hidden layers are the convolution
layer, flatten layer, and dense layer. Similarly for reshaping the reconstruction of
flatten layer, dense layer, and convolution layer to get the compressed input, with
the same dimension as the original image, the encoder converts the input dimension
into encoded representation. Bottleneck contains the reduced representation of the
input data to be stored. Decoder learns how to unwrap the data from the reduced
form; see the proximity with the original input data. Reconstruction loss helps in
checking the performance of the decoder. The whole process of auto-encoder is
illustrated in Fig. 3.
The chapter aims to perform automatic colorization using machines. To achieve this
aim, convolution neural networks with auto-encoder model is used for the proposed
system. In the latest Internet era, convolutional neural networks (CNN) have been
emerging as one of the de facto standards for solving various problems relating to
image classification. It has come to limelight and has become popular because of the
lower error rates (lesser than 4%), which it has achieved in ImageNet challenge. The
main reason behind their success is due to their capability to discover and discern
colors, shapes, and patterns within various images and to find relationship with
object classes. This becomes one of the main reasons for such well-defined coloring
carried out by CNNs as object classes, prototype, and silhouette are generally
correlated with color options. Apart from using CNNs, auto-encoders are used
for image classification. As discussed in Sect. 3, auto-encoders are a self-learning
unsupervised technique in which the neural networks are influenced for the task
of representation learning. It employs back propagation, fixing the target values to
be equal to the input values, i.e., it uses y(i) = x(i), where i represents the values,
y(i) indicates the input, and x(i) denotes the target values. Most of the automatic
258 A. Anitha et al.
x1 xˆ 1
a1 a1
x2 xˆ 2
a1
a2 a2
x3 xˆ 3
a2
x4 a3 a3 xˆ 4
a3
x5 xˆ 5
a4 a4
x6 xˆ 6
encoder decoder
Input Output
Latent space
representation
32X32X1
4096 4096 16X16X64 32X32X1
8X8X128
16X16X64 8X8X128 4X4X256
4X4X256
256
Con3 Reshape
Con2 stride=2 DeCon3 DeCon2
Con1 stride=2 h stride=2 stride=2
stride=2 Flatten Fully
layer connectd
motivates to propose the method. The architecture of the proposed research method
is depicted in Fig. 4.
In this chapter, statistical learning-driven approach is used to solve the problem
to attain the output values. As initial phase, the convolution neural network (CNN)
was designed and build that would accept a grayscale image as its input and would
generate a colorized description of the input image as its output. To accomplish the
aim of the proposed work, the neural network would be trained on thousands of
colored images, and the output generated by it would solely be based on images it
has learned from. This would also remove the intervention of humans to generate the
desired image. If the neural network gets well trained, then the output should be an
image that the user would be looking for. The main aim of the chapter is to efficiently
as well as accurately convert a given grayscale image to its corresponding colored
image with the help of various deep learning models along with auto-encoders. The
proposed model was trained to learn on its own with the help of various datasets
available and without any human intervention. Another aim is to provide a direct
function to convert a grayscale image to a colored image, which can be embedded
with various software in the near future.
As the real-time data involves various color and with various perception, it is
not enough to make the system to learn only about filling the color. Since every
image can be placed in various (x,y) coordinates, it may be scaled, rotated, or
transpositioned in the given coordinates. So simply making the model to learn
about the automatic colorization is necessary to make the model to learn about the
semantic features of the image. To adopt the proposed idea, the convolution neural
network has been designed. Generally, 32-bit precision will be used in training a
neural network. Since the input data are stored and used in training, it is better to
convert the input values into float. It is preferred to scale the input between 0.0
and 1.0. So each input is divided by 255 to make the learning rate reasonably the
better one. In general, Canadian Institute for Advanced Research (CIFAR-10)
dataset was considered as one of the best image dataset used in the neural networks.
The dataset consists of 60,000 images of 32 × 32 dimensioned color images in
10 classes. Every class contains 6000 images. Example for cat as image, it contains
260 A. Anitha et al.
6000 images of cat in various positions and colors. For the proposed model, the total
dataset is divided into 50,000 training images and 10,000 testing images in random.
For better performance measure, the dataset is divided into batches of each batch
containing 10,000 images, so totally five training batches and one testing batch. So
the test batch contains accurately 1000 randomly selected images from each class.
The training batches hold the rest of images in random order. Some training batches
may contain more images from one class. But it was carefully selected that the
training batches contain exactly 5000 images from each class. Initially, the dataset
are not uniform with colored images so it needs to be converted into black and white
and then used it in the training processes. The sample dataset depicted in Fig. 5 has
the random images from each class.
Initial stage of training involves preprocessing techniques such as converting the
images into grayscaled image and batch normalization and applied into the encoder
model for data compression and decoder for reconstruction of the compressed data.
The reconstructed data is fed into the convolution neural network. Batch normaliza-
tion is performed by making the input values between 0.0 and 1.0. To perform this,
each pixel is divided by 255, and during training, the images having high values are
considered as significant, whereas the pixel values having low values are considered
low significant. So the features are extracted based on the values obtained. This
normalization provided the uniformity among the images in the dataset. Looking
Convolution Neural Network and Auto-encoder Hybrid Scheme for Automatic. . . 261
The proposed method was experimented in the free online cloud-based Jupyter note-
book environment called Google Colab. Google Colab uses python development
environment that runs in the browser using Google cloud and matplotlib library
package is used to plot the images. The CIFAR-10 dataset is fed into the three-
layered CNN model with the size of three-channel RGB image to the convolution
layer. The input layer is a grayscaled image. Each neuron receives the input from
every element of the previous layer. The output layer is a multi-class label as
262 A. Anitha et al.
y=x
1
y=0
–3 –2 –1 0 1 2 3
This section explores about the classification and prediction process for the consid-
ered dataset. The proposed model was trained using CNN by the compression of
data using auto-encoder process. The classification process was carried out using
various optimizer such as Adam, RMSProp, and stochastic gradient descent (SGD).
Adam is considered as optimizer that used stochastic gradient descent method
based on adaptive estimation to train the deep learning models. It also handles the
noise reduction as it handles optimization by integrating AdaGrad and RMSProp
algorithm. To check for the fitness of the model, the validation is performed with the
customized epochs and with varying batch sizes. Generally, the batch size estimates
the number of samples to be considered as batches.
The dataset contains 50,000 input, the batch size is fixed as 32, and the training
started with epoch as 5. Once the model is trained with best fit of classification
accuracy, we can make the model to predict for the unobserved images. The
classification was carried out with epoch = 5, epoch = 10, epoch = 30, and
epoch = 50. The batch size = 32 and batch size = 64 are used to train the model.
The computed classification accuracy for batch size = 32 for SGD, RMSProp, and
Adam is provided in Table 1.
From Table 1, it is clearly seen that the RMSProp attains the classification
accuracy for epoch 50 as 64.89%, whereas SGD attains 65.02% and the Adam
optimizer attains 66.9%, which is the best optimizer among the computed one.
The pictorial representation of the classification accuracy against every epoch was
depicted in Fig. 7. In Fig. 7, the exponential trend line for the SGD and linear
trend line for the Adam are indicated to show that there is increase in accuracy
for computation of 5, 10, 15, 20, 25, 30, 35, 40, 45, and 50 epochs separately.
Similarly, the classification accuracy is computed for batch size = 64, with
various optimizer such as SGD, RMSProp, and Adam with epoch from 5, 10, 15,
40 Classification
64.19 65.02 Accuracy(%) RMSProp
30 61.73 63.36 63.5 63.88 63.98
55.35 59.23
52.79 Classification
Accuracy(%) ADAM
20
Expon.(Classification
10 Accuracy(%) RMSProp)
Expon.(Classification
0
5 10 15 20 25 30 35 40 45 50 Accuracy(%) RMSProp)
Number of Epochs
20, 25, 30, 35, 40, 45, to 50. Table 2 represents the classification accuracy for batch
size = 64. The results are obtained with batch size = 64, with ReLU as activation
function, on various optimizer such as SGD, RMSProp, Adam. The SGD gets the
accuracy of 61.42%, RMSProp arrives the classification accuracy of 60.34%, and
Adam provides the classification accuracy of 62.98%. Even though the batch size
increases, the proposed method gives less accuracy rather than the accuracy arrived
from batch 32. The accuracy increases by 3.92% for batch size = 32 is greater than
the batch size = 64. So it is enough to train the proposed model with batch size
as 32. The classification accuracy obtained for batch size 64 is shown in Fig. 8.
Along with the classification accuracy, the proposed model computed for validation
accuracy and validation loss. Tables 1 and 2 contain the validation accuracy for the
batch size 32 and 64, respectively. Figures 9 and 10 portray about the validation
accuracy for the batch sizes 32 and 64, respectively.
Convolution Neural Network and Auto-encoder Hybrid Scheme for Automatic. . . 265
40
30 61.42
57.99 59.1
49.34 52.01 53.79 55.76 Classification
45.34 46.1 47.67 Accuracy(%) SGD
20
Classification
10 Accuracy(%) RMSProp
Classification
0 Accuarcy(%) ADAM
5 10 15 20 25 30 35 40 45 50
Number of Epochs
44.99
40
30
52.35 54.3 50.49 51.38 51.74 51.02 50.9 50.97 51.34 51.95
20
10
0
5 10 15 20 25 30 35 40 45 50
SGD 52.35 54.3 50.49 51.38 51.74 51.02 50.9 50.97 51.34 51.95
RMSProp 44.99 52.45 50.03 50.63 52.44 52.16 49.16 51.84 52.89 51.76
ADAM 50.79 51.76 51.89 51.9 51.39 51.11 51.1 52.01 52.46 52.78
30 56.57 57.34
51.67 52.16 53.09 53.98 54.24 55.1 55.89
50.46
20
10
0
5 10 15 20 25 30 35 40 45 50
SGD 50.46 51.67 52.16 53.09 53.98 54.24 55.1 55.89 56.57 57.34
RMSProp 46.24 46.98 47.89 48.92 50.01 51.42 51.98 52.41 53.03 53.98
ADAM 48.16 48.2 52.6 51.9 53.18 54.56 55.38 56.27 57.16 57.98
0.0086
Validation loss
0.0081
Validation loss is an indicator for the research model about the data partition,
and if the validation loss is equal or less than the actual loss, then we can arrive to
the conclusion that the training and testing data partition is good for the proposed
model. The validation loss obtained from the optimizers such as RMSprop, SGD,
and Adam is depicted in Table 3, Figs. 11, and 12. Figure 11 shows a clear idea about
how the validation loss is obtained from various optimizers for batch size =32, and
Fig. 12 shows the validation loss for batch size = 64. From Figs. 11 and 12 even
though the validation loss is almost same during the training process of epoch = 50,
the initial training validation loss is having much variation.
During the training process, there is a problem of overfitting, which is to address
the performance of the training with respect to the accuracy obtained for how long
the training should be performed. Figure 13 clearly shows the overfitting caused
during training from epoch 1 to epoch 50. Since to exhibit the figure in a clear
Convolution Neural Network and Auto-encoder Hybrid Scheme for Automatic. . . 267
0.0104
0.0096
Validation loss
0.0088
0.008
0.0072
5 10 15 20 25 30 35 40 45 50
SGD 0.0103 0.0097 0.0086 0.0083 0.0083 0.0081 0.0081 0.0079 0.0079 0.0079
RMSProp 0.0091 0.0086 0.0081 0.0079 0.0079 0.0078 0.0075 0.0077 0.0077 0.0076
ADAM 0.0101 0.0095 0.0087 0.0081 0.0081 0.0079 0.0079 0.0077 0.0078 0.0078
Overfitting
68
66.78 66.87 66.9
67 66.34
65.99 66.01 66.01
65.73 65.89
66 65.36 65.47 65.58 65.45
64.98
65 64.1
Accuracy
64.01
64 63.34 63.43
62.98
63
62.01 61.78
62
61
60
59
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Accuracy 65.36 65.47 65.58 65.73 65.89 65.99 66.01 66.34 66.78 66.87 66.9 66.01 65.45 64.98 64.01 63.34 64.1 63.43 62.98 62.01 61.78
Fig. 13 Overfitting
manner the epoch from 40 to 60 displayed in Fig. 13 for batch size = 32, and
for Adam optimizer. From Fig. 13, we can clearly see that at epoch = 50, the
accuracy achieved is 66.9%, whereas at epoch 51–60, the accuracy gets reduced.
So it undoubtedly shows that the training can be performed till epoch = 50 for
better classification process.
The proposed model accuracy and overall loss for epoch = 30 are displayed
using matplotlib library as shown in Fig. 14. Since the proposed model is trained
with 66.90% accuracy, now it is time to predict the automatic colorization for the
grayscale images.
268 A. Anitha et al.
0.56
Loss
0.010
0.54 0.008
0.52
0.006
0.50
0.48 0.004
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Epoch Epoch
Fig. 14 Model accuracy and loss for the proposed model for epoch = 30
5.2 Prediction
6 Conclusion
References
1. Varga, D., & Szirányi, T. (2016). Fully automatic image colorization based on Convolutional
Neural Network. In 2016 23rd International Conference on Pattern Recognition (ICPR) (pp.
3691–3696). IEEE.
2. Salve, S., Shah, T., Ranjane, V., & Sadhukhan, S. (2018). Notice of violation of IEEE
publication principles: Automatization of coloring grayscale images using convolutional
neural network. In 2018 Second International Conference on Inventive Communication and
Computational Technologies (ICICCT) (pp. 1171–1175). IEEE.
3. Putri, V. K., & Fanany, M. I. (2017). Sketch plus colorization deep convolutional neural
networks for photos generation from sketches. In 2017 4th International Conference on
Electrical Engineering, Computer Science and Informatics (EECSI) (pp. 1–6). IEEE.
4. Levin, A., Lischinski, D., & Weiss, Y. (2004). Colorization using optimization. In ACM
SIGGRAPH 2004 Papers (pp. 689–694).
5. Huang, Y. C., Tung, Y. S., Chen, J. C., Wang, S. W., & Wu, J. L. (2005). An adaptive edge
detection based colorization algorithm and its applications. Proceedings of the 13th annual
ACM international conference on Multimedia, 351–354.
6. Yatziv, L., & Sapiro, G. (2006). Fast image and video colorization using chrominance blending.
IEEE Transactions on Image Processing, 15(5), 1120–1129.
7. Reinhard, E., Ashikhmin, M., Gooch, B., & Shirley, P. (2001). Color transfer between images.
IEEE Computer Graphics and Applications, 21(5), 34–41.
8. Welsh, T., Ashikhmin, M., & Mueller, K. (2002). Transferring color to greyscale images. ACM
Transactions on Graphics, 21(3), 277–280.
9. R. Irony, D. Cohen-Or, and D. Lischinski. Colorization by example. Eurographics Symp. on
Rendering, 2005.
10. Charpiat, G., Hofmann, M., & Sch¨olkopf, B. (2008). Automatic image colorization via multi-
modal predictions. Computer Vision-ECCV, 126, 2008–139.
11. Bugeau, A., & Ta, V. T. (2012). Patch-based image colorization. Proceedings of the IEEE
International Conference on Pattern Recognition, 3058–3061.
12. Cheng, Z., Yang, Q., & Sheng, B. (2015). Deep colorization. Proceedings of the IEEE
International Conference on Computer Vision, 415–423.
13. Dahl, R. (2016, January). Automatic colorization. http://tinyclouds.org/colorize/
Deep Learning-Based Open Set Domain
Hyperspectral Image Classification Using
Dimension-Reduced Spectral Features
1 Introduction
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 273
B. V. Kumar et al. (eds.), Smart Computer Vision, EAI/Springer Innovations in
Communication and Computing, https://doi.org/10.1007/978-3-031-20541-5_13
274 C. S. Krishnendu et al.
to the domain adaptation [8]. In this technique, the dissemination of the known
data in target (testing) is different from that of the data in source (training). For
some applications, preparing information is lacking due to the significant expense
of getting annotated data [2]. Known samples are samples that are shared in both
source and target, and unknown samples are samples that are absent in source. So
here, the issue is we don’t have the thought regarding which tests are unknown. In
[8], they have utilized GAN for data generation, and furthermore classifier assists
with identifying known and unknown samples. Generator creates data to train the
unknown samples. Additionally, here utilized a similar idea of adversarial learning
to make sample and to distinguish known and unknown data samples. Sometimes,
huge component of hyperspectral dataset brings complexity in computation. One
technique to take care of this issue is feature reduction of the data. It is a process
of anticipating information into lower measurement feature space, and this helps
in reducing redundant features. In machine learning, different feature extraction
methods are there for dimension reduction such as Principal Component Analysis
(PCA), diffusion maps, Independent Component Analysis (ICA), Local Linear
Embedding (LLE), etc. In [9, 10], they have used dynamic mode decomposition
(DMD) as dimension reduction technique for hyperspectral image classification and
also shows that this method is more efficient over existing conventional methods.
Two separate dimensionality reduction methods are utilized here and investigate
the impact of these methods in decreasing the redundancy of features. Here, the
work incorporates two stages. In first stage, dynamic mode decomposition (DMD)
has been utilized for the dimension reduction [11]. The viability of the technique
is investigated through precision and calculation time. In the subsequent stage, we
investigated a novel Chebyshev polynomial-based strategy for the feature reduction
of HSI. Chebfun isn’t much acclimated in image processing tasks. But it has been
applied to solving several problems in engineering [12, 13]. The method is devel-
oped upon the theories of approximation, and it can perform numerical computing
with functions [14]. In [15], Chebyshev approximation is used for improving epoch
extraction from Telephonic Speech, and it outperforms the conventional methods.
Chebyshev approximation is found to be very effective in estimating power system
frequency along with variational mode decomposition in [16]. So this approach
has been applied in many areas and shown its significance, but not explored with
image classification tasks. Here, this technique is used for approximating spectral
features for hyperspectral image. A function f (x), defined in the interval [−1,1], can
be represented by a unique polynomial interpolant PN(x) at the Chebyshev points.
A good approximate of a function f (x) with reduced coefficients is obtained by
evaluating the Chebyshev polynomial series at the Chebyshev points rather than
uniformly spaced points [15]. So Chebyshev uses an adaptive procedure to find the
right points so as to represent the function to approximately machine precision,
that is called Chebyshev coefficients [14]. Thus, the major contribution of the
present work is the dimension reduced Chebyshev-based approximated spectral
feature-based open set domain adaptation for hyperspectral image classification.
The structure of the paper is as per the following. Third section clarifies the related
Deep Learning-Based Open Set Domain Hyperspectral Image Classification. . . 275
2 Methodology
The principal objective of this work is to reduce the dimension for open set domain
HSI classification. At first, the hyperspectral data without feature reduction of
each dataset is utilized to prepare the model. Both training and test data have
separate class dimension probabilities. Since there is no information for unknown
sample training, GAN model is utilized for sample generation in order to train and
furthermore to characterize known and unknown samples. GAN model incorporates
generator and classifier. Here, classifier has been prepared to give yield as P (y =
K + 1/xt) == t, where xt is the objective example, y is the label, and t ought to be
in between 0 and 1. Here, we accept t as 0.5. In the event that the optimum value is
under 0.5, sample is recognized as known, and if it is more prominent than 0.5, it
is distinguished as unknown [10]. Before feature reduction, 3D HSI information is
changed. Before dimension reduction, 3D HSI data is converted to mn × b format,
where mn represents no. of pixels and b represents number of bands. Figure 1
shows the flow of the proposed work. Here, the work includes two phases. In the
primary stage, we utilized DMD as a strategy for feature reduction. DMD assists
with acquiring the dynamic modes (eigenvalues and eigenvector) of a nonlinear
framework. Here, the HSI information is changed over to 2D lattice structure. At
that point the final column was recreated and added to the matrix. The same matrix
is then isolated into two matrices, x1 and x2. Singular value decomposition of x1 is
figured, and a low-rank approximation matrix is acquired [10]. After performing QR
decomposition of eigenvector of S, P matrix that is known as per mutation matrix is
acquired.
This P matrix contains order of eigenvector arrangement. Spectral information
are organized in the matrix in such a way that the bottom contains least significant
and the top contains most significant bands. The precision of the model is being
noted for each percentage reduction of bands. Half of the features (bands) are
diminished at first, and the reduction kept repeating percentage-wise till most
elevated conceivable decrease of the feature is accomplished with no loss of
information. The learning rate that corresponds to the highest accuracy has been
considered as the ideal boundary for better outcome. For each feature reduction,
calculation time and number of learnable boundaries are likewise noted. In the
second stage, Chebyshev-based approximation method is used for the dimension
reduction for HSI. By using DMD, the maximum possible reduction for each dataset
with highest classification accuracy is achieved before and after each dimension
reduction. The novelty of the present work lies where the Chebyshev approximation
is used to analyze the performance of Chebyshev approximation of features on
image dataset. Suppose 30% is the maximum possible reduction obtained with
comparable classification accuracy for one dataset by using DMD. Chebyshev is
276 C. S. Krishnendu et al.
used to check whether we can truncate spectral features to 20% and 10% of feature
with better classification accuracy. Here also, the data should be in the form of
2D matrix before applying Chebyshev approximation. Each row of the matrix
represents pixels corresponding to each class. Chebyshev helps to truncate the data
with minimum number of coefficients. These coefficients are capable to reconstruct
the data with almost machine precision. The model has been trained with these
truncated coefficients, and the performance is analyzed in terms of classification
accuracy, PSNR (peak signal to noise ratio), and computation time. Each time, data
has been truncated to lesser number of coefficients to obtain highest achievable
reduction by keeping all parameters like number of epochs, learnable parameter, etc.
the same. Data variation can be visible by comparing the plots of spectral signature
of each pixel that belongs to each class before and after truncation. In this stage,
three hyperspectral dataset named Salinas, Salinas A, and Pavia U are used for the
experiment.
2.1 Dataset
This segment gives the brief description of HSI dataset. There are a total of three
datasets that we have utilized for the trial. They are Salinas, Salinas A, and Pavia
University dataset.
Deep Learning-Based Open Set Domain Hyperspectral Image Classification. . . 277
2.2 Salinas
This dataset is gathered over Salinas Valley, California, utilizing AVIRIS sensor
[17], and it has a resolution of 3.7-meter pixels. The dimension is 512217224,
and the total number of classes is 16. Table 1 gives the details of sample and
classes. The hyperspectral data incorporates exposed soils, vegetables, and grape
plantation fields. Out of absolutely 16 classes, 6 classes are taken as source
(training information) and 8 classes of Salinas dataset are taken as target (testing
information). The train test split is 80%–20% more. Also, around 5702 examples are
utilized distinctly for testing (target). These 5702 examples are acquired by joining
two classes, in particular fallow and broccoli green weeds 2.
2.3 Salinas A
2.4 Pavia U
Pavia U is one of the two scenes gotten from ROSIS sensor, from Pavia, Northern
Italy [17]. Pavia University image involves 610 × 610 pixels, and the number of
278 C. S. Krishnendu et al.
Table 3 Dataset description for Pavia U with five classes (source) and Pavia U with eight classes
(target) [17]
Source (Pavia U with Target (Pavia U with
Class Class labels five classes) eight classes)
0 Asphalt 5305 1325
1 Gravel 1679 420
2 Trees 2451 613
3 Painted metal sheets 1076 269
4 Bare soil 4023 1006
5 Unknown class (meadows, 0 23,278
self-blocking bricks,
shadows)
Total samples 14,534 26,911
bands are 103. It contains absolutely nine classes, and the resolution is 1.3 meters.
Table 3 shows the dataset depiction of Pavia U dataset with five classes as source
and Pavia U dataset with eight classes as target. Five classes of Pavia U are shared
by both source and target. Three classes with around 23,278 examples are utilized
distinctly for testing (target). These 23,278 examples are acquired by consolidating
three classes, in particular shadows, self-blocking bricks, and meadows.
3 Experiment Results
The proposed work is implemented on three datasets and assessed through classi-
fication accuracy. This same model is utilized for dimension reduced dataset. The
optimum values that are utilized for raw dataset for getting great outcomes are 128
batch size and epoch of 500. Learning rates of 0.0001 is utilized for both datasets
and Adam optimization, and cosine ramp down is utilized for preparing network.
Deep Learning-Based Open Set Domain Hyperspectral Image Classification. . . 279
The configurations of hardware that are utilized for the experiment are Intel Core
i5-8250U with clock speed of 1.6 GHz. Driver adaptation of GPU is NVIDIA-SMI
391.25. Cuda 10.0 is likewise utilized for computation reason.
For Salinas dataset, Salinas with six classes is taken as source, and target is Salinas
with eight classes. The accuracies are acquired for Salinas dataset for various
learning rates. The accuracy is same for 20% (44 bands) of bands as that of
raw information with a learning rate of 0.0001. The most elevated classification
precision of 98.42% is accomplished for the learning rate of 0.0001 for raw data
comprising of 224 groups. Accuracy for unknown class is 98.40% and for known
class is 98.00%. Henceforth, the learning rate is set to 0.0001. In like manner, for 44
bands, the most noteworthy classification accuracy of 97.66% has been achieved for
the learning rate of 0.0001. Accuracy for unknown class is 98.45% and for known
class is 96.88%. Thus, it is fixed to 0.0001. For Salinas dataset, 20% of the spectral
information is sufficient to acquire practically identical precision. Table 4 shows the
examination of classification accuracies for dimension reduced data, and also, table
shows the time taken for network training.
To verify the efficiency of the approach, we shuffled the dataset in such a way
that every class should come under both known and unknown classes. So in case
of Salinas dataset, shuffling has been done for three times, and we analyze the
performance of the model using these new sets of datasets with overall accuracy
(OA) and computation time. Classification results for shuffled Salinas dataset is
shown in Table 5.
280 C. S. Krishnendu et al.
As the following step, Salinas A dataset is utilized for network training. Five classes
are utilized as source, and six classes are utilized as target. Classification accuracies
are practically the same for 10% (22 bands) bands at learning rate of 0.0001. The
most elevated classification precision of 99.80% is accomplished for the learning
rate of 0.0001 for the raw dataset comprising of 224 bands. Accuracy for unknown
class is 99.67% and for known class is 99.89%. Subsequently, the learning rate is
set to 0.0001. In like manner, for 22 bands (10%), the highest accuracy of 98.53%
is accomplished for the learning rate of 0.0001. So it is set to 0.0001. 98.32% is
the unknown class accuracy and 99.01% is the known class accuracy. For Salinas
A dataset, 10% of the bands are sufficient to acquire comparable accuracy with
less computational time. Table 6 shows the investigation of accuracies for feature
reduced data, and it likewise shows the time taken for network training. It is clear
from the table that the computational time is additionally decreased after feature
reduction.
To verify the efficiency of the approach, we shuffled the dataset in such a way
that every class should come under both known and unknown classes. So in case
of Salinas A dataset, shuffling has been done for five times, and we analyze the
performance of the model using these new sets of datasets with overall accuracy
(OA) and computation time. Classification results for shuffled Salinas A dataset is
shown in Table 7.
As our third analysis, Pavia U dataset is utilized for network training. Five
classes are utilized as source, and eight classes are utilized as target. Classification
accuracies are practically same for 30% (31 bands) bands at learning rate of 0.0001.
The most noteworthy classification precision of 83.10% is accomplished for the
learning rate of 0.0001 for the raw data comprising of 103 groups. Unknown class
accuracy is 86.22, and known class accuracy is 60.42%. Henceforth, the learning
rate is set to 0.0001. Similarly, for 31 bands (30% of bands), the most noteworthy
classification precision of 82.3% is accomplished for the learning rate of 0.0001.
Along these lines, it is set to 0.0001. unknown class precision is 85.74% and known
class precision is 59.42%. For Pavia U dataset, 30% of the bands are sufficient to
get practically identical precision with less computational time. Table 8 shows the
accuracies for feature reduced data, and it is additionally showing the time taken for
Table 7 Classification results for shuffled Salinas A dataset
Dataset 1 Dataset 2 Dataset 3 Dataset 4 Dataset 5
100% of 10% of the 100% of 10% of the 100% of 10% of the 100% of 10% of the 100% of 10% of the
bands bands bands bands bands bands bands bands bands bands
OA (%) 99.83 98.75 98.78 97.32 98.98 97.65 99.90 98.98 99.65 98.71
Time (min) 35 19 32 13 38 15 28 12 32 13
Deep Learning-Based Open Set Domain Hyperspectral Image Classification. . .
281
282 C. S. Krishnendu et al.
network training. It is clear from the table that the computational time is likewise
diminished after feature reduction.
To verify the efficiency of the approach, we shuffled the dataset in such a way
that every class should come under both known and unknown classes. So in case
of Pavia U dataset, shuffling has been done for three times, and we analyze the
performance of the model using these new sets of datasets with overall accuracy
(OA) and computation time. Classification results for shuffled Pavia U dataset are
shown in Table 9.
From the previous experimental results by using DMD, it is evident that though
there is dimension reduction, the model is more or less able to achieve almost the
same classification accuracy as that of raw data of HSI. Also, the experimental
results show that 20% (44 bands) of the total number of available bands are the
maximum possible reduction in feature dimension for Salinas dataset, 10% bands
(22 bands) for Salinas A dataset, and 30% (31 bands) of the total bands for Pavia U
dataset, which results in comparable classification accuracy. Shuffling of classes has
been done for every dataset to analyze the accuracy and computation time in each
percentage-wise dimension reduction. As an extension for the experiment, spectral
features are approximated using Chebyshev coefficients. Chebyshev is used to check
whether is it possible to reduce spectral features further with better classification
accuracy. Chebyshev approximation helps to truncate the data with minimum
number of coefficients. The model is trained with these truncated coefficients,
Deep Learning-Based Open Set Domain Hyperspectral Image Classification. . . 283
Table 10 Classification results for Salinas dataset after dimension reduction using Chebfun
44 coefficients
(20% of the bands) 34 coefficients 24 coefficients 14 coefficients
OA (%) 99.30 98.74 95.43 90.94
Computation time 46 38 29 11
(min)
PSNR 25.07 25.02 23.24 19.42
Reflectance
a a
Reflectance
0.5 0.5
value
value
0 0
-0.5 -0.5
-1
0 50 100 150 200 0 50 100 150 200
Number of bands Number of bands
b b
Reflectance
Reflectance
0.5 0.5
value
value
0 0
-0.5 -0.5
-1 -1
0 50 100 150 200 0 50 100 150 200
Number of bands Number of bands
c c
1
Reflectance
Reflectance
0.5 0.5
value
value
0 0
-0.5 -0.5
-1
0 50 100 150 200 0 50 100 150 200
Number of bands Number of bands
-0.2 -0.2
value
value
-0.4 -0.4
-0.6 -0.6
-0.8 -0.8
0 50 100 150 200 0 50 100 150 200
Number of bands Number of bands
b b
Reflectance
Reflectance
-0.2 -0.2
-0.4
value
value
-0.4
-0.6 -0.6
-0.8 -0.8
-1 -1
0 50 100 150 200 0 50 100 150 200
Number of bands Number of bands
c c
0
Reflectance
Reflectance
-0.2
value
-0.4
value
-0.5
-0.6
-1 -0.8
-1
0 50 100 150 200 0 50 100 150 200
Number of bands Number of bands
Fig. 2 (i) Spectral signature plot corresponding to one sample from class 0 (broccoli green weeds
1) for Salinas original dataset. (ii) Spectral signature plot corresponding to one sample from class
1 (corn senesced green weeds) for Salinas original dataset. (a) Plot between no. of bands and
reflectance value. (b) Plot after using Chebyshev without truncation. (c) Plot after using Chebyshev
approximation with truncated 34 and 24 coefficients
number of epochs and learning rate that were employed in the case of DMD are
used here to run the model. From the table, it is obvious that there is a significant
increase in accuracy of 99.64% for the classification using coefficients in case of
Chebyshev polynomial approximation. It was 98.53% for the 10% of the bands
(22 bands) in case of using DMD. Then the reduction process is repeated until
we get minimum number of coefficients with comparable classification accuracy.
PSNR ratio is computed between the original dataset and the dimensionally reduced
dataset. Computation time is getting reduced by dimension reduction. From the
286 C. S. Krishnendu et al.
Fig. 3 Classification maps obtained before and after dimension reduction for Salinas dataset.
(a) Without dimension reduction. (b) After using DMD (44 bands). (c) After using Chebyshev
approximation (34 coefficients)
Table 12 Classification results for Salinas A dataset after dimension reduction using Chebyshev
22 coefficients (20% of the bands) 15 coefficients 10 coefficients
OA (%) 99.64 99.50 97.84
Computation time (min) 13 10 10
PSNR 21.78 20.08 17.54
table, it is evident that the overall accuracy and PSNR are almost comparable
compared to the results for 22 coefficients.
As the next step, the efficiency of the approach has been verified by shuffling
the dataset in such a way that every class should come under both known and
unknown classes. So in case of Salinas A dataset, we did shuffling five times and
analyze the performance of the model using these new sets of datasets with overall
accuracy (OA) and computation time. Results show that this approach is efficient in
reducing spectral features without further information loss. Classification results for
shuffled Salinas A dataset is shown in Table 13. The table shows a comparison of
the classification accuracies obtained for the maximum possible reduction of bands
in case of DMD and the maximum possible reduction of coefficients in case of
Chebyshev approximation. To visualize the variation in pattern of the data, we have
plotted spectral signature of one pixel corresponding to any of the class of Salinas
A dataset before and after truncation using Chebyshev approximation. Figure 4(i)
shows the comparison of the plot of one pixel corresponding to class 5 (lettuce
romaine 7 wk) before and after truncated with 15 coefficients and 10 coefficients.
The data has been normalized in -1 to 1 range.
It is evident from Fig. 4(i) that the plot of the pixel after truncation is almost
similar to the plot of the original pixel before truncation in the case of using 15
Table 13 Classification results for shuffled Salinas A dataset
Dataset 1 Dataset 2 Dataset 3 Dataset 4 Dataset 5
With22 With 15 With 22 With 15 With 22 With 15 With 22 With 15 With 22 With 15
bands in coeff in case bands in coeff in case bands in coeff in case bands in coeff in case bands in coeff in case
case of of case of of case of of case of of case of of
DMD Chebyshev DMD Chebyshev DMD Chebyshev DMD Chebyshev DMD Chebyshev
OA (%) 98.75 99.70 97.32 98.67 97.65 97.34 97.09 99.50 98.71 99.43
Time (min) 19 11 13 10 15 10 12 12 13 12
Deep Learning-Based Open Set Domain Hyperspectral Image Classification. . .
287
288 C. S. Krishnendu et al.
a a
Reflectance
0.5 0.5
Reflectance
value
0 0
value
-0.5 -0.5
-1
0 50 100 150 200 0 50 100 150 200
Number of bands Number of bands
b b
Reflectance
Reflectance
0.5 0.5
value
value
0 0
-0.5 -0.5
0.5
Reflectance
0.5
value
value
0
-0.5 -0.5
-1 -1
0 50 100 150 200
0 50 100 150 200 Number of bands
Number of bands
0.5
Reflectance
0.5
value
0
value
0
-0.5
-0.5
-1
0 50 100 150 200 0 50 100 150 200
Number of bands Number of bands
b b
0.5
Reflectance
0.5
Reflectance
value
0 0
value
-0.5 -0.5
-1 -1
0 50 100 150 200 0 50 100 150 200
Number of bands Number of bands
c c
Reflectance
0.5
Reflectance
0.5
value
0 0
value
-0.5 -0.5
-1 -1
0 50 100 150 200
0 50 100 150 200
Number of bands
Number of bands
Fig. 4 (i) Spectral signature plot corresponding to one sample from class 5 (lettuce romaine 7
wk) for Salinas A original dataset. (ii)Spectral signature plot corresponding to one sample from
class 4 (lettuce romaine 6 wk) for Salinas A original dataset. (a) Plot between no. of bands and
reflectance value. (b) Plot after using Chebyshev without truncation. (c) Plot after using Chebyshev
approximation with truncated 15 and 10 coefficients
Fig. 5 Classification maps obtained before and after dimension reduction for Salinas A dataset.
(a) Without dimension reduction. (b) After using DMD (22bands). (c) After using Chebyshev
approximation (15 coefficients)
Table 14 Classification results for Pavia U dataset after dimension reduction using Chebfun
31 coefficients (20% of the bands) 21 coefficients 11 coefficients
OA (%) 82.84 92.24 90.94
Computation time (min) 116 104 98
PSNR ratio 40.77 41.13 41.09
possible reduction of bands in case of DMD and the maximum possible reduction
of coefficients in case of Chebyshev approximation.
To visualize the variation in pattern of the data, we have plotted spectral signature
of one pixel corresponding to any of the class of Pavia U dataset before and after
truncation using Chebyshev approximation. Figure 6(i) shows the comparison of the
plot of one pixel corresponding to class 2 (trees) before and after truncated with 21
coefficients and 11 coefficients. The data has been normalized in -1 to 1 range.
It is evident from Figure 6(i) that the plot of the pixel after truncation is almost
similar to the plot of the original pixel before truncation in the case of using 21
coefficients. But in case of 11 coefficients, the smoothening of signal has happened
and more information has lost, hence less accuracy obtained. The spectral signature
for one pixel corresponding to another class is plotted. Figure 6(ii) shows the
comparison of the plot of one pixel corresponding to class 0 (asphalt) before and
after truncated with 21 coefficients and 11 coefficients. It is evident from Figure
6(ii) that the plot of the pixel after truncation is almost similar to the plot of the
original pixel before truncation in the case of using 21 coefficients. But in case of
11 coefficients, the smoothening of signal has happened and more information has
lost, hence less accuracy obtained.
Figure 7 shows the comparison of classification maps obtained for Pavia U
dataset under different dimension reductions. Also, the figure gives the comparison
of maps obtained for both the techniques. It is obvious that the map obtained after
Chebyshev polynomial approximation shows better clarity in scene.
Deep Learning-Based Open Set Domain Hyperspectral Image Classification. . . 291
Reflectance 2
a a
Reflectance
1
value
1
value
0 0
-1
10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Number of bands Number of bands
b b
2
Reflectance
Reflectance
1
1 0
value
value
0 -1
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Number of bands
Number of bands
2
c c
Reflectance
Reflectance
2
1
value
value
1
0 0
-1
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Number of bands Number of bands
a a
Reflectance
0
Reflectance
0
-0.2
value
-0.2
value
-0.4 -0.4
-0.6 -0.6
0 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100
Number of bands Number of bands
b b
Reflectance
0
Reflectance
-0.2 -0.2
value
value
-0.4 -0.4
-0.6 -0.6
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Number of bands Number of bands
c c
Reflectance
Reflectance
0
-0.2 -0.2
value
value
-0.4 -0.4
-0.6 -0.6
-0.8
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Number of bands Number of bands
Fig. 6 (i) Spectral signature plot corresponding to one sample from class 2 (trees) for Pavia U
original dataset. (ii) Spectral signature plot corresponding to one sample from class 0 (asphalt). (a)
Plot between no. of bands and reflectance value. (b) Plot after using Chebyshev without truncation.
(c) Plot after using Chebyshev approximation truncated to 21 and 11 coefficients
4 Conclusion
In this work, open set domain adaptation with GAN model has been applied for
HSI classification. For dimension reduced dataset also, the same model has been
applied. In the principal stage, dynamic mode decomposition (DMD) is utilized as
the feature reduction procedure, and the outcomes show that this method is very
effective in removing the redundancy of bands without much data loss. It performed
well on open set domain HSI classification. The fundamental point of the model
is to identify the classes that were absent during training as unknown. From the
292 C. S. Krishnendu et al.
Asphalt
Gravel
Trees
Metal sheets
Bare soil
Unknown class
(Meadows,
bricks hadows)
Fig. 7 Classification maps obtained before and after dimension reduction for Pavia U dataset.
(a) Without dimension reduction. (b) After using DMD (31 bands). (c) After using Chebyshev
approximation (21 coefficients)
results for the three datasets, it is evident that the model is able to accomplish
practically identical classification accuracy as that of raw dataset even after the
dimension reduction. Here, the time for computation is likewise decreased after
dimension reduction for three dataset. Additionally, the results show that 20%
of the groups of Salinas dataset, 10% of the bands for Salinas A dataset, and
30% of the bands for Pavia U dataset are the most achievable feature reduction
that outcomes in practically same accuracies. In the next phase, we explored a
novel Chebyshev approximation-based dimensionality reduction technique for HSI
classification to check whether it is possible to reduce the dimension of each
dataset further with comparable classification accuracy. The performance of the
model is analyzed in terms of overall accuracy, computation time, and PSNR
ratio. Surprisingly, the results show that the Chebyshev polynomial approximation
is a very effective approach to approximate the spectral features and results in
good classification accuracy. Computational time taken for the model when using
reduced data became much lesser compared to raw data for each dataset. Also, the
experimental results show that only 15 coefficients or bands are needed for Salinas
A, about 34 coefficients for Salinas dataset, and 21 coefficients for Pavia U dataset,
which results in better classification accuracy.
References
1. Goetz, A. F., Vane, G., Solomon, J. E., & Rock, B. N. (1985). Imaging spectrometry for earth
remote sensing. Science, 228(4704), 1147–1153.
2. Pau, P. B., & Gall, J. (2017). Open set domain adaptation. Proceedings of the IEEE
International Conference on Computer Vision.
3. Hoffman, J., Rodner, E., Donahue, J., Kulis, B., & Saenko, K. (2014). Asymmetric and
category invariant feature transformations for domain adaptation. International Journal of
Computer Vision, 109(1–2), 28–41.
Deep Learning-Based Open Set Domain Hyperspectral Image Classification. . . 293
4. Gopalan, R., Li, R., & Chellappa, R. (2011). Domain adaptation for object recognition: An
unsupervised approach. In IEEE Conference on Computer Vision and Pattern Recognition (pp.
999–1006).
5. Saenko, K., Kulis, B., Fritz, M., & Darrell, T. (2010). Adapting visual category models to new
domains. In IEEE European conference on computer vision (pp. 213–226).
6. Chopra, S., Balakrishnan, S., & Gopalan, R. (2013). DLID: Deep learning for domain adap-
tation by interpolating between domains. In ICML workshop on challenges in representation
learning.
7. Gong, B., Shi, Y., Sha, F., & Grauman, K. (2012). Geodesic flow kernel for unsupervised
do-main adaptation. In IEEE Conference on Computer Vision and Pattern Recognition, 2066–
2073.
8. Saito, K., Yamamoto, S., Ushiku, Y., & Harada, T. (2018). Open set domain adaptation by back
propagation, ArXiv, 1804.10427v2[cs.CV].
9. Fong, M. (2007). Dimension reduction on hyperspectral images. University of California.
10. Megha, P., Sowmya, V., & Soman, K. P. (2018). Effect of dynamic mode decomposition based
dimension reduction technique on hyperspectral image classification. In Computational signal
processing and analysis (pp. 89–99). Springer.
11. Krishnendu, C. S., Sowmya, V., & Soman, K. P. (2021). Impact of dimension reduced spectral
features on open set domain adaptation for hyperspectral image classification. In Evolution in
computational intelligence (pp. 737–746). Springer.
12. Aldhaher, S., Luk, P. C. K., & Whidborne, J. F. (2014). Electronic tuning of misaligned coils
in wireless power transfer systems. IEEE Transactions on Power Electronics, 29(11), 5975–
5982.
13. Lee, S.-P., Cho, B.-L., Ha, J.-S., & Kim, Y.-S. (2015). Target angle estimation of multifunction
radar in search mode using digital beamforming technique. Journal of Electromagnetic Waves
and Applications, 29(3), 331–342.
14. Driscoll, T. A., Hale, N., & Trefethen, L. N. (Eds.). (2014). Chebfun guide. Pafnuty Publica-
tions.
15. Gowri, B., Ganga, K. P., & Soman, and D. Govind. (2018). Improved epoch extraction from
telephonic speech using Chebfun and zero frequency filtering. Interspeech.
16. Mohan, N., & Soman, K. P. (2018). Power system frequency and amplitude estimation using
variational mode decomposition and chebfun approximation system. In 2018 twenty fourth
national conference on communications (NCC). IEEE.
17. Hyperspectral image dataset available at http://www.ehu.eus/ccwintco/index.php/
HyperspectralRemoteSensingScenes
An Effective Diabetic Retinopathy
Detection Using Hybrid Convolutional
Neural Network Models
1 Introduction
Diabetic retinopathy can be described as a disease that is related to the eye that is
caused due to damage of specific type of blood vessels [11] namely arteries and
veins of the photosensitive tissues at the rear of the eye (retina) that usually has
effects on either eyes. Initially, there is an absence of symptoms or mild vision
problems and this eventually leads to blindness. Therefore, it is necessary to detect
and categorize the diabetic retinopathy at the early stages for the effective treatment
and preventing the loss of vision [10]. Diabetic retinopathy identification is a
long-term process requiring manual intervention the involves techniques which a
qualified physician needs to analyze and measure pictures of the retinal images
of digitally colored backbone. According to the previous studies conducted on
DR treatment and the study undertaken at Wisconsin Epidemiologic on Diabetic
Retinopathy, there is a popularly used metric of classification of the 5 different
stages of DR that was initially proposed by Wilkinson in the year 2003. A brief
description of the same can be found in Table 1. They are namely: no apparent
retinopathy—I, mild none-proliferative diabetic retinopathy (NPDR)—II, moderate
NPDR—III, severe NPDR—IV, and proliferative diabetic retinopathy—V [4]. It
also provides an insight into intensity of the DR stages according to the observations
made by the dilated pupils during ophthalmoscopy checkup also referred to as
fundoscopy.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 295
B. V. Kumar et al. (eds.), Smart Computer Vision, EAI/Springer Innovations in
Communication and Computing, https://doi.org/10.1007/978-3-031-20541-5_14
296 N. Kumar et al.
2 Related Work
over 60 million. The integration of rectified linear (ReLU) activation function that
introduces nonlinearity into neural network architecture is an important feature in
the ImageNet implementation. Apart from ReLU, there are other nonlinear functions
such as the hyperbolic tangent (tanh) and the sigmoid function, but these are
saturating in nature. The reason in opting for ReLU as the activation function lies
in the fact that it is six folds faster than the tanh function in their respective error
rates during the phase of training. ImageNet also features the dropout functionality
that prevents the neural network model from overfitting. Since the distribution of the
dataset over the 5 different categories of DR was not even, the method could only
produce an average accuracy of 71.06%.
Like the previously discussed AlexNet model, there are a number of neural
network models that have evolved over the course of technological development.
One such instance is the VGG16 model [15], comprising of 16 weighted layers and
about 138 million parameters. Its simplicity and standard way of implementation
is one of the prime reasons to adopt this architecture. All the convolutional hidden
layers use 3 .× 3 filters, while 2 .× 2 kernels are used in the pooling layer. This is
due to the fact that the number of filters starts from 64 all the way up to 512 that
are actually powers of 2. At the time, the performance of such an architecture was
gratified since it provided about 51% in average accuracy.
3 Methodology
CNN models includes the combination of standard CNN models with other machine
learning techniques. Here we focus on integrating the CNN model with support
vector machine [3] and random forest classifiers. We found that the proposed deep
learning architecture performs quite well with an average accuracy of 75%.
Feature selection is the process in which the number of input variables of the data get
reduced while applying to a classifier. Feature selection improves the performance
of the model. Feature selection [6] can be used to identify and remove from data
the unnecessary variables that are not relevant and do not increase the accuracy of
the classifier. We have chosen 2 features, blood vessels and exudate areas. Blood
vessels that are little and delicate break through the bottom of the tissue and cover
the white of the eye that results in eye redness. The eye redness means that you have
a hemorrhage that is a sign of diabetic risk. The normal range for the blood vessel
area is 36,230.56 [17], and the value of this range decreases as contraction occurs in
diabetic retinopathy. In the retina, blood vessels get damaged when diabetes occurs.
The patches that are yellow are called hard exudates. Hence, it is an informative
feature, and we must consider it for the feature selection process.
For the classification of the retinal images, we deploy convolutional neural network
model. To achieve better performance and accuracy, the neural network model is
coupled with different classifiers, and they are random forest and support vector
machine. The description of the models (Fig. 1) is mentioned below.
An Effective Diabetic Retinopathy Detection Using Hybrid Convolutional. . . 299
The convolutional neural network (CNN) is effective toward image recognition and
classification. When the input images are fed into CNN models, the main aim is
to extract the features from the model. The preprocessing is done on the input
images, such as blurring, sharpening, and edge detection of the image. The operation
function called rectified linear unit (ReLU) is used after every layer of CNN. ReLU
is used for nonlinearity, and this function is a nonlinear operation that replaces all
pixel values in the feature map that contains negative by zero. While keeping most
of the important features, the dimensionality reduction of each feature map is done
using pooling and subsampling. Pooling is done by taking the sum, average, or
max of the sub-region in the feature map. After these three operations are added
with layers, the activation function called softmax function is added to output of the
feature map, and this completes the classification process.
In classification of the DR stages, we implemented our proposed convolutional
neural network model. We will be discussing our proposed model and explain
further configurations to leverage these CNN implementations.
Figure 2 represents the hybrid CNN with SVM classifier model. The proposed
model consists of CNN for feature extractions from the images, and the extracted
features are used by a SVM for classification.
300 N. Kumar et al.
The CNN layers consist of 5 layers in which the first layer is the input layer and
the last layer is the output layer. The normalized image shape that is fed into the
input layer is 144 .× 256 raw pixel images. As shown in Fig. 2, C1 is a convolution
layer with 32, 5 .× 5, filters, C2 is a convolution layer with 64, 3 .× 3, filter maps, and
C3 is a convolutional layer with 128, 3 .× 3, filter maps. The output layer is a dense
layer with 128 units of feature maps. In between the convolutional layers, there are
max-pooling filters of 2 .× 2 size consecutively arranged. Before the dense output
layer, the features are flattened. The intermediate generated from the dense layer is
fed into the support vector machine classifier for segmentation and classification.
As shown in Fig. 2 after flattening, the output of the dense layer is fed into the
SVM classifier. Support vector machine (SVM) [13] is a classifier that classifies
really well when there is a clear margin of separation between classes. Effectiveness
of SVM is observed in high-dimensional spaces. The diabetic retinopathy detection
is a multiclass classification model; we need the SVM to classify the multiclass
labels. Complexity of SVM increases when there are more than two classes to
classify; to solve this, there are ways such as we can use the one-against-all support
vector machine (OAASVM). For N -class problems (.N > 2), N binary SVM
classifiers are built [7].
Figure 3 represents the hybrid CNN with RF classifier model. The proposed model
consists of CNN for feature extractions from the images, and the extracted features
are used by a random forest for classification.
An Effective Diabetic Retinopathy Detection Using Hybrid Convolutional. . . 301
The net CNN feature extraction is the same as that explained in the CNN with
SVM classifier. But the main difference is the coupled classifier, which in this case
is the random forest (RF) classifier. Random forest consists of many individual
decision trees that operate as an ensemble [12]. Each decision tree predicts the
class of the diabetes, and the class with maximum votes becomes the prediction
class of the model.
The CNN model is trained to extract features from the input image, and the
fully connected layer of CNN is replaced by a random forest classifier to classify
the image pattern. The output of the dense layer of CNN produces a feature
vector representing the image pattern, consisting of 645 values. The random forest
classifier is trained using the features of images produced by the CNN model. The
trained random forest uses the features to perform classification task and makes
decisions on testing images using features extracted by CNN. In the experiment, the
random forest contains 50 individual decision trees keeping other values default.
some amount of noise. From the Kaggle dataset, 8407 representative and high-
quality images constituting about 8 GB of data were selected to build the dataset
that is used for training and testing the proposed models reported in this chapter.
Out of the 8 GB dataset containing 8407 images, 6112 images are used for training
200 the model (Table 2). Finally, for testing purpose, almost 10% of the images,
i.e., 768 images, are employed. The images are then chosen in such a way that each
stage in the current reorganized dataset has a reasonably balanced collection. From
Table 3, we can observe that the models using the LeakyReLU activation function
have a better improvement on applying preprocessing. This is mostly due to the loss
of information that is observed in the ReLU activation function when the output
value becomes negative. But in case of LeakyReLU, this negative output value is
not discarded and a parametric measure is applied.
Accuracy metric is adopted to measure and compare the performance of different
CNN-based classifiers. Equation 1 represents the formal definition of accuracy
where .χi = 1, if the predicted class is true class for image “i,” else .χi = 0.
1
m
Accuracy =
. χi . (1)
m
i=1
The test accuracy obtained by the models is around the range of 70–75% using just
24% of the images available in the DR dataset. Our experimental results indicate
the importance of CNN and machine learning techniques for detection of a different
diabetic retinopathy. While, even on such a small-sized training data, the accuracy
of the models is reasonable, indicating that there is room for further improvements
to the models in the future. The models can be employed with user-friendly user
An Effective Diabetic Retinopathy Detection Using Hybrid Convolutional. . . 303
Acknowledgments We would like to express our gratitude toward the Information Technology
Department of NITK, Surathkal for its kind cooperation and encouragement that helped us in the
completion of this project entitled “An Effective Diabetic Retinopathy Detection using Hybrid
Convolutional Neural Network Models.” We would like to thank the department for providing
the necessary cluster and GPU technology to implement the project in a preferable environment.
We are grateful for the guidance and constant supervision as well as for providing necessary
information regarding the project and also for its support in completing the project.
References
1. Bhatia, K., Arora, S., & Tomar, R. (2016). Diagnosis of diabetic retinopathy using machine
learning classification algorithm. In 2016 2nd International Conference on Next Genera-
tion Computing Technologies (NGCT) (pp. 347–351). https://doi.org/10.1109/NGCT.2016.
7877439
2. Boral, Y. S., & Thorat, S. S. (2021). Classification of diabetic retinopathy based on hybrid
neural network. In 2021 5th International Conference on Computing Methodologies and Com-
munication (ICCMC) (pp. 1354–1358). https://doi.org/10.1109/ICCMC51019.2021.9418224
3. Carrera, E. V., González, A., & Carrera, R. (2017). Automated detection of diabetic retinopathy
using SVM. In 2017 IEEE XXIV International Conference on Electronics, Electrical Engi-
neering and Computing (INTERCON) (pp. 1–4). https://doi.org/10.1109/INTERCON.2017.
8079692
4. Cuadros, J., Bresnick, G. (2009). EyePACS: an adaptable telemedicine system for diabetic
retinopathy screening. Journal of Diabetes Science and Technology, 3, 509–516.
5. Harun, N. H., Yusof, Y., Hassan, F., & Embong, Z. (2019). Classification of fundus images for
diabetic retinopathy using artificial neural network. In 2019 IEEE Jordan International Joint
Conference on Electrical Engineering and Information Technology (JEEIT) (pp. 498–501).
https://doi.org/10.1109/JEEIT.2019.8717479
6. Herliana, A., Arifin, T., Susanti, S., & Hikmah, A. B. (2018). Feature selection of diabetic
retinopathy disease using particle swarm optimization and neural network. In: 2018 6th
International Conference on Cyber and IT Service Management (CITSM) (pp. 1–4). https://
doi.org/10.1109/CITSM.2018.8674295
7. Hsu, C.-W., & Lin, C.-J. (2002). A comparison of methods for multiclass support vector
machines. IEEE Transactions on Neural Networks, 13(2), 415–425. https://doi.org/10.1109/
72.991427
8. Jayakumari, C., Lavanya, V., & Sumesh, E. P. (2020). Automated diabetic retinopathy detection
and classification using ImageNet convolution neural network using fundus images. In: 2020
International Conference on Smart Electronics and Communication (ICOSEC) (pp. 577–582).
https://doi.org/10.1109/ICOSEC49089.2020.9215270
9. Jiang, H., Xu, J., Shi, R., Yang, K., Zhang, D., Gao, M., Ma, H., & Qian, W. (2020). A multi-
label deep learning model with interpretable Grad-CAM for diabetic retinopathy classification.
In: 2020 42nd Annual International Conference of the IEEE Engineering in Medicine Biology
Society (EMBC) (pp. 1560–1563). https://doi.org/10.1109/EMBC44109.2020.9175884
10. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep
convolutional neural networks. In: Proceedings of the 25th International Conference on Neural
Information Processing Systems (Vol. 1, pp. 1097–1105). NIPS’12, Red Hook, NY, USA:
Curran Associates.
11. Kumar, S., & Kumar, B. (2012). Diabetic retinopathy detection by extracting area and number
of microaneurysm from colour fundus image. In: 2018 5th International Conference on
Signal Processing and Integrated Networks (SPIN) (pp. 359–364). https://doi.org/10.1109/
SPIN.2018.8474264
An Effective Diabetic Retinopathy Detection Using Hybrid Convolutional. . . 305
12. Ramani, R. G., Shanthamalar J., J., & Lakshmi, B. (2017). Automatic diabetic retinopathy
detection through ensemble classification techniques automated diabetic retinopathy classifi-
cation. In: 2017 IEEE International Conference on Computational Intelligence and Computing
Research (ICCIC) (pp. 1–4). https://doi.org/10.1109/ICCIC.2017.8524342
13. Roy, A., Dutta, D., Bhattacharya, P., & Choudhury, S. (2017). Filter and fuzzy C means based
feature extraction and classification of diabetic retinopathy using support vector machines. In:
2017 International Conference on Communication and Signal Processing (ICCSP) (pp. 1844–
1848). https://doi.org/10.1109/ICCSP.2017.8286715
14. Roychowdhury, S., Koozekanani, D. D., & Parhi, K. K. (2016). Automated detection of
neovascularization for proliferative diabetic retinopathy screening. In: 2016 38th Annual
International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC)
(pp. 1300–1303). https://doi.org/10.1109/EMBC.2016.7590945
15. Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale
image recognition. CoRR, abs/1409.1556.
16. Sodhu, P. S., & Khatkar, K. (2014). A hybrid approach for diabetic retinopathy analysis.
International Journal of Computer Application and Technology, 1(7), 41–48.
17. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke,
V., & Rabinovich, A. (2015). Going deeper with convolutions. In: 2015 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR) (pp. 1–9). https://doi.org/10.1109/CVPR.
2015.7298594
Modified Discrete Differential Evolution
with Neighborhood Approach for
Grayscale Image Enhancement
1 Introduction
Evolutionary Algorithms (EAs) are the potential optimization tools for wide range
of benchmarking and real-world optimization problems. The most prominent
algorithms under EAs are Differential Evolution (DE), Genetic Algorithm (GA),
Genetic Programming (GP), Evolutionary Programming (EP), and Evolutionary
Strategies (ES). Though the algorithmic structure of these algorithms is similar,
their performance varies based on different factors, viz., population representation,
variation operations, selection operations, and the nature of the problem to be
solved.
Among all these algorithms, DE is simpler and is applicable for complex real-
valued parameter optimization problems. The differential mutation operation of
DE makes it not directly applicable for discrete parameter optimization problems.
Considering the unique advantages of DE, extending its applicability to discrete
optimization problems is an active part of research. In computer vision, good
contrast images have vital role in many applications of image processing. From
past few decades, an extensive research was carried out on metaheuristic approach
for automatic image enhancement.
The objective of the study presented in this paper is to propose an algorithmic
change to DE, by adding a new mapping technique, to make it suitable for discrete
optimization problem. The performance of DE with proposed mapping technique
was tested with benchmarking travelling salesperson (TSP) problems and an image
enhancement problem. The algorithmic structure, design of experiments, results,
and discussion are presented in this chapter.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 307
B. V. Kumar et al. (eds.), Smart Computer Vision, EAI/Springer Innovations in
Communication and Computing, https://doi.org/10.1007/978-3-031-20541-5_15
308 A. Radhakrishnan and G. Jeyakumar
2 Related Works
sensing. Regions with low contrast appear as dark, and high-contrast regions
appear to be illuminated nonnatural. The outcome of both is loss of pertinent
information. Thus, the optimal enhancement of image contrast that represents the
relevant information of the input image is a difficult task [16, 17]. There is no
generic approach for image enhancement; they are image dependent. Histogram
Equalization (HE) and its variants are effectively applied to enhance the contrast of
the image that is widely used in several image processing applications [18–20]. The
major drawback of this approach is that for the darker and lighter image, it does not
produce quality enhanced image due to the noise and loss of relevant information.
In recent years, several bioinspired algorithmic approaches are used in image
contrast enhancement [21]. These algorithms help in searching the optimal map-
ping of gray-level input image to new gray-level image with enhanced image
contrast. Automatic image enhancement requires well-defined evaluation criterion
that validates wide range of datasets. An approach for tuning the parameters of
transformation function can be adopted. The transformation function is evaluated
by the objective function. Bioinspired algorithms search the optimal combination
of transformation parameters stochastically. Embedding population-based approach
in image enhancement has gained wide popularity in recent years. This approach
helps to explore and exploit such complex problems and search the solution space
to achieve the optimal parameter setting [22, 23]. Plenty of literature indicates the
application of metaheuristic algorithm for image contrast enhancement.
Pal et al. used Genetic Algorithm (GA) for automating the operator selection for
image enhancement [24]. Saitoh used Genetic Algorithm for modeling the intensity
mapping of transformation function. This approach generated better result with
respect to execution time [25]. Braik et al. examined Particle Swarm Optimization
(PSO) by increasing the entropy and edge details [26]. Dos Santos Coelho et al. [27]
modeled three DE variants by adopting the similar objective function proposed in
[26]. The advantage of this approach was faster convergence but could not provide
suitable statistical evidence for the quality of enhanced image. Shanmugavadivu
and Balasubramanian proposed new method that avoids mean shift that happens
at the equalization process [28]. This method could preserve the brightness of
enhanced images. Hybridization approaches are also found in improving the quality
of enhanced image. Mahapatra et al. proposed hybridization approach where PSO
is combined with Negative Selection Algorithm (NSA) [29]. This method could
preserve the number of edge pixels. Shilpa and Shyam [30] investigated a Modified
Differential Evolution (MDE), which could avoid the premature convergence. It also
enhanced the exploitation capability of DE by adapting the Levy flight from Cuckoo
search. In MDE, mean intensity of the enhanced image is preserved.
A comparative study of five traditional image enhancement algorithms
(Histogram Equalization, Local Histogram Equalization, Contrast Limited Adaptive
Histogram Equalization, Gamma Correction, and Linear Contrast Stretch) was
presented in [31]. A study on the effect of image enhancement was carried out in
[32], using weighted thresholding enhancement techniques.
On understanding the interesting research attempts in making DE suitable for
solving discrete-valued parameter optimization problems, and the importance of
310 A. Radhakrishnan and G. Jeyakumar
3 Differential Evolution
The classical DE has two phases: The first phase includes population initialization,
followed by the evolution phase (second phase). Mutation and crossover are
performed during the evolving stage. Selection of the candidate happens thereafter,
which replaces the candidate from the population thereby generating population for
the next generation. This is iterated until the termination criterion.
(a) Population Initialization – In this phase, the candidate set is generated
in g =
guniformly distributed fashion. The set of candidate solution .C
Ck : k = 1, 2, 3.. . . . n , where g denotes generation, and n denotes
g g
size of the population.
. Ck denotes a d – dimensional vector. .Ck =
g g g g
c1,k , c2,k , c3,k . . . cd,k and is generated using random uniform distribution, as
mentioned in Eq. (1).
g
Ck = CL + (CU − CL ) ∗ rand (0, 1) (1)
where CL and CU represent the lower bound and upper bound of search space Sg .
(b) Evolution – In this phase, the mutation operation that is a crucial step in DE
is performed. Three random vectors are selected to generate mutant vector. The
g
weighted difference is added with the base vector. The mutant vector .vk for
g
every target vector .Ck is generated using Eq. (2):
g g g g
vk = cr1 + F cr2 − cr3 (2)
Modified Discrete Differential Evolution with Neighborhood Approach for. . . 311
where r1 , r2 , and r3 are the random vectors in the population and r1 = r2 = r3 , and
F is the Mutation Factor with the value in the range [0, 1].
Once mutant vector is generated, the crossover operation is performed
between the mutant vector and the parent. The crossover operation is performed
g g g g g
between mutant vector .Uk = u1,k , u2,k , u3,k .. . . . uD,k and target vector
g g g g g
.C
k = c , c
1,k 2,k 3,k, c .. . . . c D,k , with a crossover probability Cr ∈ [0, 1], and
g g g g g g
a trial vector .Uk = u1,k , u2,k , u3,k .. . . . uD,k is generated. Each trial vector .ui,k
g g g
is produced as .ui,k = ui,k if ran dk ≤ Cr, xi,k if not where i ∈ {1, 2, 3, .. . . . D}
selection is performed after crossover, and the individual with better fitness value
moves to the next generation (it can be trial vector or the target vector). These
operations (mutation, crossover, and selection) repeat till the termination criterion.
4 Proposed Approach
g g g g g g
vk = cBEST + F cr4 − cr3 + F cr1 − cr2 (3)
312 A. Radhakrishnan and G. Jeyakumar
application. The design of experiments, results, and discussion for the Phase I and
Phase II experiments are presented next.
The parameter setting of DE was carried out with appropriate values after trial and
error. The crucial parameters of DE were set appropriately. The population size n
was set to 100. The mutation scale factor (F) was considered as [0.6, 1.5], referring
[1, 34]. F > 1 solves many problems and F < 1 shows good convergence. Optimal
value of F is between the values Fmin = 0.6 and Fmax = 1.5,and it is calculated
using the equation (Eq. (4)) below (as given in [1]):
Fmin − Fmax
F = × cfe + FMax (4)
MaxFit
where cfe is the number of times the objective function is evaluated, and MaxFit is
the maximum number of fitness evaluations. Crossover (Cr) value was considered
as 0.9, number of generations (g) = 2000, and number of runs (r) = 30.
All the DE mapping approaches were implemented in a computer system with
8 GB RAM, i7 processor with Windows 7 operating system, using python 3.6
programming language. The performance analysis of BNDE was carried out with
travelling salesman problem (TSP) as the benchmarking problem.
TSP is an NP hard problem. The solution for TSP is the smallest path of the
salesman to visit all the cities (nodes) in the city map, with the constraint of visiting
each city only once. Six different instances of TSP from TSPLIB dataset were
considered for the experiment. Each candidate in the population is a possible path
for the salesman. The Euclidean distance was calculated to validate the fitness of the
path. The objective function defined to measure the distance (D) is given in Eq. (5).
k
D = Dj,k + Dj,j +1 (5)
j =1
where Dj denotes the node j, Dj + 1 denotes the neighbor node j + 1, and k denotes
the total number of nodes in the city map.
Modified Discrete Differential Evolution with Neighborhood Approach for. . . 315
The performance metrics used for the comparative study were the best of optimal
values (Aov ) and the error rate (Er ). The Aov is the best of the optimal values attained
by a variant on all it runs, and it is calculated using the Eq. (6).
where Aov is the obtained solution, and the AS is the actual solution.
Table 1 Comparison of BNDE with existing mapping approaches with best value obtained
TP1 TP2 TP3 RB LRV BMV BNDE
Dataset Optimal solution BEST BEST BEST BEST BEST BEST BEST
att48 33,523 95,309 98,565 97,298 97,309 90,293 58,679 75,031
eil51 429 1112 1118 1106 1124 1078 766 863
berlin52 7544.37 20585.53 18,768 19,670 20,166 19,977 13,292 15,758
st70 675 2550 2506 2625 2559 2574 1597 1902
pr76 108,159 230,691 236,060 416,316 318,696 397,748 235,442 311,432
eil76 545.39 1825 1809 1860 1890 1865 1184 1394
316
Table 2 Comparison of BNDE with existing mapping approaches with average value obtained
TP1 TP2 TP3 RB LRV BMV BNDE
Dataset Optimal solution Avg Avg Avg Avg Avg Avg Avg
att48 33,523 102364.83 104100.56 102361.56 101795.83 102043.30 70946.70 83238.40
eil51 429 1157.23 1159.233 1156 1174.20 1180.46 832.93 959.83
berlin52 7544.37 21,431 20308.90 21107.33 21080.80 20925.66 14814.80 17185.26
st70 675 2677.47 2654.07 2727.33 2711.43 2716.83 1876.03 2116.43
pr76 108,159 254327.63 252948.50 431907.76 344088.63 415970.40 269644.20 333313.46
eil76 545.39 1922.86 1920.07 1932.63 1937.03 1935.17 1364.03 1516.40
A. Radhakrishnan and G. Jeyakumar
Modified Discrete Differential Evolution with Neighborhood Approach for. . . 317
Table 3 Comparison of BNDE with existing mapping approaches, with worst value obtained
TP1 TP2 TP3 RB LRV BMV BNDE
Dataset Optimal solution Worst Worst Worst Worst Worst Worst Worst
att48 33,523 106,700 98,565 107,536 106,878 105,805 82,000 91,649
eil51 429 1216 1118 1206 1218 1233 898 1046
berlin52 7544.37 19,559 21,266 21,914 21,660 21,911 16,579 18,519
st70 675 2807 2779 2810 2798 2779 2155 2270
pr76 108,159 276,195 267,409 440,280 426,851 437,337 302,095 351,145
eil76 545.39 1998 1973 1985 1982 1994 1577 1613
Table 4 Comparison of error rate with existing mapping approaches, with best value obtained
TP1 TP2 TP3 RB LRV BNDE BMV
Dataset Optimal solution Er Er Er Er Er Er Er
att48 33,523 184.30 194.02 190.24 190.27 169.34 123.81 75.04
eil51 429 159.20 160.60 157.80 162.00 151.28 101.16 78.55
berlin52 7544.37 172.85 148.76 160.72 167.29 164.79 108.87 76.18
st70 675 277.77 271.25 288.88 279.11 281.33 181.77 136.59
pr76 108,159 113.28 118.25 284.91 194.65 267.74 187.93 117.68
eil76 545.39 234.62 231.68 241.04 246.54 241.95 155.59 117.09
for the other two cases of comparing the approaches with average values and worst
values, also the BNDE failed to outperform BMV. Table 4 presents the error rate (Er )
calculated with the best values. The comparison shows the BNDE could outperform
TP1, TP2, TP3, RB, and LRV (except for pr76 with TP1 and TP2). In overall, it
is observed that the BNDE could outperform existing mapping approaches, except
BMV. However, performance of BNDE was comparable with BMV. Further tuning
of BNDE to make it better than BMV is taken as future study of this work.
To validate these findings, statistical significance analysis was performed using
Wilcoxon Signed Ranks Test for paired samples. The Wilcoxon Signed Ranks
Test for Paired Samples with two tails was used for the comparison. Optimal
value and error rate are measured for the independent run of the algorithm. The
parameters used for this test were number of samples n, test statistics (T), critical
value of T (T Crit), level of significance (α), the z-score, the p-value, and the
significance. The observations are summarized in Table 5. The ‘+’ indicates that
the performance difference between the BNDE and the counterpart approach is
statistically significant, and the ‘-’ indicates that the performance difference is
not statistically significant. For the BNDE-TP1, BNDE-TP2, and BNDE-TP3 pairs,
the outperformance of BNDE was statistically significant for all the datasets,
except the att48 dataset. For the BNDE-RB and BNDE-LRV pairs, it is observed
that the outperformance of BNDE was statistically significant for all the datasets.
Interestingly, for BNDE-BMV pair, though BMV empirically outperformed BNDE,
the performance differences were not statistically significant.
318 A. Radhakrishnan and G. Jeyakumar
From the statistical analysis examined from Table 5, we can perceive that BNDE
could outperform significantly the existing algorithms (except BMV). Also, for other
few exceptional cases of TP1, TP2, and TP3 for the dataset att48, the statistical
significance on the improved performance of BNDE was not shown.
In the second phase of the experiment, an attempt to apply BNDE mapping approach
for finding the optimal parameter combination of transformation function of image
contrast enhancement was made. Since the tuning of parameters for image contrast
enhancement is a combinatorial optimization problem, this mapping approach could
explore and exploit the optimal parameter combinations.
Transformation Function Local enhancement method [38] was applied using the
contrast improvement mentioned in Eq. (8):
Modified Discrete Differential Evolution with Neighborhood Approach for. . . 319
k.M
h (x, y) = [c ∗ f (x, y) − m (x, y)] + m(x, y)a (9)
σ (x, y) + b
255
H (T i) = pi.log2 (pi) (11)
0
pi denotes the probability of ith intensity value of an image that ranges from [0, 255].
where
h−1 w−1
1
MSE(T i) = [ T i (i, j ) − T 0 (i, j )]2 (13)
h.w
i=0 j =0
where μ Ti denotes the mean value of the transformed image pixel value, μ T0
denotes the mean value of the pixel values for original image, σ Ti is the transformed
image variance, σ T0 is the variance of original image, and h denotes height and w
denotes width of the image.
enhancement was applied, and fitness for each transformed image were evaluated.
The DE/rand/1/bin variant was used. Three random vectors from the population
were selected for the rand/1 mutation. The genes of the mutant vector were replaced
with the best gene after comparing with best, average, and worst candidates. Since
the proposed mapping approach was applied to an image processing application, the
BNDE approach was modified according to the application. Replacement of best
gene was carried out for the selected gene with probability greater than 0.25.
The ideal combination of the parameters a, b, c, and k was selected based on the
finest fitness value, and the transformed image with these parameters was considered
for the evaluation. The original image was compared with the enhanced image, and
its histogram was analyzed. The result is shown in Table 6 (Tables 6.a and 6.b). First
column from the left represents the original image and its edge detected using canny
edge detector. The second column represents the original image histogram. Third
column from the left represents the enhanced image and its edge detected. Fourth
column represents the histogram of the enhanced image. Based on the analysis of
edge detected for the enhanced image, it is observed that the number of edges pixels
detected are more. By comparing the histograms of original and enhanced images,
it is observed that the contrast is enhanced in the enhanced images.
Comparison of the BNDE approach with the existing algorithm was also
performed. Two existing algorithms Histogram Equalization and CLAHE were
considered for the analysis. Table 7 (Tables 7.a, 7.b and 7.c) shows the result
obtained. Though BNDE approach could outperform the Histogram Equalization,
it was observed that enhanced image by CLAHE algorithm could detect more edges.
Enhanced image by Histogram Equalization generated noisy images.
The performance analysis of state-of-the-art algorithm was assessed using the
metrics Peak to Signal Noise Ratio (PSNR), Mean Squared Error (MSE), entropy,
and Structural Similarity Index (SSIM). Table 8 summarizes the result obtained.
The comparison of BNDE with Histogram Equalization and CLAHE is sum-
marized in Table 8. From the analysis of values obtained for metrics, PSNR
value obtained for BNDE is better than the other algorithms. MSE is inversely
proportional to the PSNR. MSE obtained for BNDE is less compared to the
Histogram Equalization and CLAHE. Entropy is another metrics used for measuring
the quality of image. It is a measure of randomness in image. BNDE approach
could obtain less entropy when comparing the performance with the state-of-the-art
algorithm. Reduced entropy value shows that enhanced image is more homogeneous
than the input image. SSIM obtained for BNDE outperforms the other method. SSIM
measures the similarity of the enhanced image with input image. BNDE approach
could enhance the image, but still similarity to input image is ensured.
322
Table 6.a Comparison of original image and the enhanced images (1–6)
Original image Original image Histogram Enhanced image (BNDE approach) BNDE approach Histogram
A. Radhakrishnan and G. Jeyakumar
Table 6.b Comparison of original image and the enhanced images (7–12)
Original image Original image Histogram Enhanced image (BNDE approach) BNDE approach Histogram
Modified Discrete Differential Evolution with Neighborhood Approach for. . .
323
324 A. Radhakrishnan and G. Jeyakumar
7 Conclusions
This chapter presented a study in two phases. The Phase I of the study proposed an
improved Differential Evolution algorithm, named as BNDE. The BNDE is added
with a proposed mapping approach to improve the exploration and exploitation
nature of DE. The proposed mapping technique made DE suitable to solve discrete
optimization problems. The performance of the proposed algorithm was evaluated
on benchmarking TSPs and compared with state-of-the-art six different similar
Modified Discrete Differential Evolution with Neighborhood Approach for. . . 325
algorithms. The empirical studies revealed that the proposed algorithm works better
than all the approaches, except a one (the BMV approach, to which the performance
difference was not significant statistically). Though it could not outperform BMV
approach, their performance was comparable. The statistical studies highlighted that
in the cases where BNDE outperformed, it was statistically significant. In phase
II, the proposed algorithm was tested on an image enhancement application and
326 A. Radhakrishnan and G. Jeyakumar
compared with two classical image enhancement algorithms. The results obtained
for the performance metrics, used in the experiment, reiterated the quality of the
proposed algorithm that it could outperform the classical algorithms by the PSNR,
MSE, and SSIM values.
This approach is validated in grayscale images. For validating in RGB images,
the mapping approach can be applied on red channel, green channel, and blue
Table 8 Comparison of BNDE with existing algorithms
Dataset HISTO CLAHE BNDE
PSNR Entropy MSE SSIM PSNR Entropy MSE SSIM PSNR Entropy MSE SSIM
7_19_M16 11.21 7.91 105.47 0.1547 28.06 5.45 31.37 0.8711 69.74 4.311 0.00022135 0.9999985
7_19_2ME2 11.343 7.90 105.62 0.1549 26.07 5.72 38.72 0.8700 66.10 4.40 0.00011719 0.99999817
7_19_2ME4 11.28 7.90 107.06 0.1471 26.92 5.51 31.96 0.8738 65.27 4.241 0.00013346 0.99999771
7_19_2ME5 11.29 7.89 103.63 0.1399 27.01 5.49 32.86 0.8728 64.115 4.20 0.00020833 0.99999754
7_19_2ME6 11.29 7.93 102.05 0.1775 27.43 5.50 31.97 0.8803 64.11 4.44 0.00020833 0.99999754
7_19_2ME8 11.39 7.91 104.19 0.1848 27.30 5.53 30.465 0.8808 62.17 4.49 0.00022135 0.99999736
7_19_2ME10 11.185 7.913 03.002 0.17365 27.98 5.35 27.11 0.8859 62.17 4.37 0.00022135 0.99999734
7_19_2ME13 11.294 7.92 102.04 0.17923 27.55 5.55 33.14 0.8794 68.85 4.41 0.00013346 0.99999845
7_19_2ME15 11.576 9.97 104.567 0.2567 28.44 5.80 33.04 0.8356 63.82 4.93 0.00028971 0.99999757
7_19_1ME1 11.23 7.95 105.30 0.1299 27.44 5.66 38.56 0.8176 63.45 4.43 0.00021159 0.99999744
7_19_1ME9 11.292 7.94 105.650 0.129 27.45 5.67 37.43 0.8223 62.66 4.447 0.00020833 0.99999748
7_19_1ME10 11.269 7.94 105.25 0.1255 27.50 5.66 37.96 0.8210 64.41 4.38 0.00018555 0.99999768
Modified Discrete Differential Evolution with Neighborhood Approach for. . .
327
328 A. Radhakrishnan and G. Jeyakumar
channel. The future scope of this work is to extend the mapping approach in color
images. This study also will be enhanced further to hybridize other optimization
techniques with DE to make BNDE outstanding among all the state-of-the-art
mapping techniques.
References
1. Ali, I. M., Essam, D., & Kasmarik, K. (2019). A novel differential evolution mapping technique
for generic combinatorial optimization problems. Applied Soft Computing, 80, 297–309.
2. Santucci, V., Baioletti, M., Di Bari, G., & Milani, A. (2019). A binary algebraic differential
evolution for the multidimensional two-way number partitioning problem. European Confer-
ence on Evolutionary Computation in Combinatorial Optimization, 11451, 17–22.
3. Ming, Z., Zhao Linglin, S., Xiaohong, M. P., & Yanhang, Z. (2017). Improved discrete mapping
differential evolution for multi-unmanned aerial vehicles cooperative multi-targets assignment
under unified model. International Journal of Machine Learning and Cybernetics, 8(3), 765–
780.
4. Goudos, S. (2017). Antenna design using binary differential evolution: Application to discrete-
valued design problems. IEEE antennas and propagation magazine, 59(1), 74–93.
5. Cuevas, E., Zaldivar, D., Perez Cisneros, M. A., & Ramirez-Ortegon, M. A. (2011). Circle
detection using discrete differential evolution optimization. Pattern Analysis and Applications,
14(1), 93–107.
6. Davendra, D., & Onwubolu, G. (2009). Forward backward transformation. In Differential
evolution: A handbook for global permutation-based combinatorial optimization (pp. 35–80).
Springer.
7. Wang, L., Pan, Q.-K., Suganthan, P. N., & Wang, W. (2010). A novel hybrid discrete differential
evolution algorithm for blocking flow shop scheduling problems. Computers & Operations
Research, 37(3), 509–520.
8. Viale Jacopo, B., ThiemoKrink, S. M., & Paterlini, S. (2009). Differential evolution and
combinatorial search for constrained index-tracking. Annals of Operations Research, 172(1),
39–59.
9. Wagdy, A. (2016). A new modified binary differential evolution algorithm and its applications.
Applied Mathematics & Information Sciences, 10(5), 1965–1969.
10. Sauer, J. G., & Coelho, L. (2008). Discrete differential evolution with local search to solve
the traveling salesman problem: Fundamentals and case studies. In Proceedings of 7th IEEE
international conference on conference: cybernetic intelligent systems.
11. Uher, V., Gajdo, P., Radecky, M., & Snasel, V. (2016). Utilization of the discrete differential
evolution for optimization in multidimensional point clouds. Computational Intelligence and
Neuroscience, 13(1–14).
12. Lingjuan, H. O. U., & Zhijiang, H. O. U. (2013). A novel discrete differential evolution
algorithm. Indonesian Journal of Electrical Engineering, 11(4).
13. Rubini, N., Prashanthi, C. V., Subanidha, S., & Jeyakumar, G. (2017). An optimization
framework for solving RFID reader placement problem using differential evolution algorithm.
In Proceedings of ICCSP-2017 – International conference on communication and signal
proceedings.
14. Abraham, K. T., Ashwin, M., Sundar, D., Ashoor, T., & Jeyakumar, G. (2017). Empirical
comparison of different key frame extraction approaches with differential evolution based
algorithms. In Proceedings of ISTA-2017 – 3rs international symposium on intelligent system
technologies and applications.
Modified Discrete Differential Evolution with Neighborhood Approach for. . . 329
15. Shinde, S. S., Devika, K., Thangavelu, S., & Jeyakumar, G. Multi-objective evolutionary
algorithm based approach for solving RFID reader placement problem using weight-vector
approach with opposition-based learning method. International Journal of Recent Technology
and Engineering (IJRTE) 2277–3878, 7(5), 177–184.
16. Lu, X., Wang, Y., & Yuan, Y. (2013). Graph-regularized low-rank representation for destriping
of hyperspectral images. IEEE Transactions on Geoscience and Remote Sensing, 51(7), 4009–
4018.
17. Lu, X., & Li, X. (2014). Multiresolution imaging. IEEE Transactions on Cybernetics, 44(1),
149–160.
18. Sujee, R., & Padmavathi, S. (2017). Image enhancement through pyramid histogram matching.
International Conference on Computer Communication and Informatics (ICCCI), 2017, 1–5.
https://doi.org/10.1109/ICCCI.2017.8117748
19. Zhu, H., Chan, F. H., & Lam, F. K. (1999). Image contrast enhancement by constrained
local histogram equalization. Computer Vision and Image Understanding, 73, 281–290. https:/
/doi.org/10.1006/cviu.1998.0723
20. Chithirala, N., et al. (2016). Weighted mean filter for removal of high density salt and pepper
noise. In 2016 3rd international conference on advanced computing and communication
systems (ICACCS) (Vol. 1). IEEE.
21. Radhakrishnan, A., & Jeyakumar, G. (2021). Evolutionary algorithm for solving combinatorial
optimization—A review. In H. S. Saini, R. Sayal, A. Govardhan, & R. Buyya (Eds.),
Innovations in computer science and engineering (Lecture notes in networks and systems)
(Vol. 171). Springer.
22. Gorai, A., & Ghosh, A. (2009). Gray-level image enhancement by particle swarm optimization.
Proc IEEE World Cong Nature Biol Inspired Comput, 72–77.
23. Munteanu, C., & Rosa, A. Gray-scale image enhancement as an automatic process driven by
evolution. IEEE Transactions on Systems, Man, and Cybernetics: Systems.
24. Pal, S. K., Bhandari, D., & Kundu, M. K. (1994). Genetic algorithms for optimal image
enhancement. Pattern Recognition Letters, 15(3), 261–271.
25. Saitoh, F. (1999). Image contrast enhancement using genetic algorithm. Proceedings of IEEE
International Conference on Systems, Man and Cybernetics, 4, 899–904.
26. Braik, M., Sheta, A., & Ayesh, A. (2007). Particle swarm optimisation enhancement approach
for improving image quality. International Journal of Innovative Computing and Applications,
1(2), 138–145.
27. dos Santos Coelho, L., Sauer, J. G., & Rudek, M. (2009). Differential evolution optimization
combined with chaotic sequences for image contrast enhancement. Chaos, Solitons & Fractals,
42(1), 522–529.
28. Shanmugavadivu, P., & Balasubramanian, K. (2014). Particle swarm optimized multi-objective
histogram equalization for image enhancement. Optics & Laser Technology, 57, 243–251.
29. Mahapatra, P. K., Ganguli, S., & Kumar, A. (2015). A hybrid particle swarm optimization
and artificial immune system algorithm for image enhancement. Soft Computing, 19(8), 2101–
2109.
30. Suresh, S., & Lal, S. (2017). Modified differential evolution algorithm for contrast and
brightness enhancement of satellite images. Applied Soft Computing, 61, 622–641.
31. Harichandana, M., Sowmya, V., Sajithvariyar, V. V., & Sivanpillai, R. (2020). Comparison of
image enhancement techniques for rapid processing of post flood images. The International
Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Xliv-M-2-
2020, 45–50.
32. Sony, O., Palanisamy, T., & Paramanathan, P. (2021). A study on the effect of thresholding
enhancement for the classification of texture images. Journal of The Institution of Engineers
(India): Series B, 103, 29. https://doi.org/10.1007/s40031-021-00610-9
33. Storn, R., & Price, K. (1997). Differential evolution–a simple and efficient heuristic for global
optimization over continuous spaces. Journal of Global Optimization, 11(4), 341–335.
34. Rönkkönen, J., Kukkonen, S., & Price, K. V. (2005). Real-parameter optimization with
differential evolution. Congress on Evolutionary Computation, 506–513.
330 A. Radhakrishnan and G. Jeyakumar
35. Li, H., & Zhang, L. (2014). A discrete hybrid differential evolution algorithm for solving
integer programming problems. Engineering Optimization, 46(9), 1238–1268.
36. Liu, B., Wang, L., & Jin, Y.-H. (2007). An effective pso-based memetic algorithm for flow
shop scheduling. IEEE Transactions on Systems Man and Cybernetics Part B, 37(1), 18–27.
37. Li, X., & Yin, M. (2013). A hybrid cuckoo search via lévy flights for the permutation flow shop
scheduling problem. International Journal of Production Research, 51(16), 4732–4754.
38. Keerthanaa, K., & Radhakrishnan, A. (2020). Performance enhancement of adaptive image
contrast approach by using artificial bee colony algorithm. 2020 Fourth International Confer-
ence on Computing Methodologies and Communication (ICCMC), 255–260.
Swarm-Based Methods Applied to
Computer Vision
María-Luisa Pérez-Delgado
Abbreviations
AA Artificial ants
ABC Artificial bee colony
ALO Ant lion optimizer
BA Bat algorithm
BFO Bacterial foraging optimization
CRS Crow search
CSO Cat swarm optimization
CT computed tomography
CUS Cuckoo search
FA Firefly algorithm
FPA Flower pollination algorithm
FSA Fish swarm algorithm
GWO Gray wolf optimization
MR magnetic resonance
PSO Particle swarm optimization
RGB Red, Green, Blue
WO Whale optimization
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 331
B. V. Kumar et al. (eds.), Smart Computer Vision, EAI/Springer Innovations in
Communication and Computing, https://doi.org/10.1007/978-3-031-20541-5_16
332 M.-L. Pérez-Delgado
1 Introduction
Nowadays, computer vision has become a very important element in many sectors,
such as the development of autonomous vehicles, the surveillance and supervision
systems, the manufacturing industry, or the health care sector [1]. It involves
the application of different image processing operations to analyze the data and
extract relevant information. The dimensionality of the data makes many of these
operations have a high computational cost. This requires applying methods with
reasonable execution time to generate solutions. Among such methods, swarm-
based algorithms have been successfully applied in various image processing
operations.
This chapter shows the application of this type of solution to various image
processing tasks related to computer vision. The objective is not to include an
exhaustive list of articles, since the length of the chapter does not allow it. Rather,
the chapter focuses on the most recent and interesting proposals where swarm-based
solutions have been successfully applied.
4: for t = 1 to T MAX do
5: Compute vi (t), xi (t) and bi (t), for i = 1, . . . , P , according to Eqs. 2, 3 and 4, respectively
xi (t) if f itness(xi (t)) > f itness(bi (t − 1))
.bi (t) = (4)
bi (t − 1) otherwise
Fig. 1 PSO determines the new position of particle i (.xi (t)), taking into account its previous
position (.xi (t − 1)), the best position found by the particle (.bi (t − 1)), the best position found by
the swarm (.g(t − 1)), and the current velocity of the particle (.vi (t))
Computer vision systems require handling noisy, complex, and dynamic images.
For the system to be useful, it must interpret the image data accurately and quickly.
Many operations related to computer vision can be formulated as optimization
problems (segmentation, classification, tracking, etc.). Many of these problems are
difficult to solve for different reasons (the high dimensionality of the data, the large
volume of data to be processed, the noise in the data, the size and characteristics
of the solution space, etc.). Therefore, the resulting problems are often high-
dimensional optimization problems with complex search spaces that can include
complex constraints. Finding the optimal solution to these problems is very difficult,
and operations require a lot of execution time. The characteristics of these problems
make the classical optimization techniques not suitable for their resolution. For this
Swarm-Based Methods Applied to Computer Vision 335
reason, various optimization techniques have been proposed in recent years to avoid
the problems of classical techniques. These methods have been applied to solve
optimization problems for which classical techniques do not work satisfactorily.
Swarm-based methods have been successfully applied to solve several computer
vision tasks, providing a good solution since they avoid getting stuck in local optima.
Swarm-based algorithms were developed to solve optimization problems, and
they have been successfully applied to many problems in different areas [2, 3].
These algorithms have been applied to complex non-linear optimization problems.
They are also useful to solve high-dimensional and multimodal problems. Further-
more, these methods require little a priori knowledge of the problem and have low
computational cost.
The characteristics of the swarm-based methods make them have several advan-
tages over classical optimization algorithms:
• Individuals are very simple, which facilitates their implementation.
• Individuals do their work independently, so swarm-based algorithms are highly
parallelizable.
• The system is flexible, as the swarm can respond to internal disturbances
and changes in the environment. In addition, the swarm can adapt to both
predetermined stimuli and new stimuli.
• The system is scalable because it can include from very few individuals to a lot
of them.
• The control of the swarm is distributed among its members, and this allows
the swarm to give a rapid local response to a change. This operation is quick
because it is not necessary to communicate with a central control or with all the
individuals in the swarm.
• The system is robust. Since there is no central control in the swarm, the system
can obtain a solution even if several individuals fail.
• Individuals interact locally with other individuals and also with the environment.
This behavior is useful for problems where there is no global knowledge of the
environment. In this case, the individuals exchange locally available information,
and this allows obtaining a good global solution to the problem. In addition, the
system can adapt to changes in its environment.
• In order to apply some classical methods, it is necessary to make assumptions
about the characteristics of the problem to be solved or the data to be processed.
In contrast, swarm-based methods do not make assumptions and can be applied
to a wide range of problems.
• These methods can explore a larger region of the search space and can avoid
premature convergence. Since the swarm evaluates several feasible solutions
in parallel, this prevents the system from being trapped in local minima.
Although some individual falls into a local optimum, other individuals may find
a promising solution.
• These algorithms include few parameters, and they do not require to be fine-tuned
for the algorithm to work.
336 M.-L. Pérez-Delgado
Many of the image processing operations discussed below are closely related and are
often applied sequentially to an image. However, the operations have been separated
into several sections, each citing swarm-based solutions that focus on the specific
operation.
Feature extraction is a preliminary task for other image processing, since it allows
reducing the dimensionality of the data to be handled in said processing. This
operation obtains the most relevant information from the image and represents it
in a lower dimensional space. The set of features obtained by this operation can be
used as input information to apply other processing to the image.
When a feature set has been extracted from an image, feature selection allows
selecting a subset of features from the entire set of candidate features. This is
a complex task, and swarm-based solutions have been proposed to reduce the
computational cost. In general, the interesting feature subset is conditioned by the
image processing that will be applied to those features. For this reason, the feature
selection operation is usually the previous step to another more general operation
that conditions the features to be selected (Fig. 2). For example, this occurs when
selecting features for image classification. Several swarm-based methods have been
used for feature selection to classify images, such as AA [22], PSO [23, 24], or
ABC [25].
Feature selection is an important aspect of hyperspectral image processing, as
it allows selecting the relevant bands of the image in order to reduce the dimen-
sionality. PSO was used in [26] to select features and then apply a convolutional
neural network to classify hyperspectral images. PSO was also applied in [27], but
combining two swarms: one of them estimates the optimal number of bands and the
other selects the bands. The proposal of [28] combines PSO with genetic algorithms
for feature selection. PSO operations are applied to update the particles, and then a
Fig. 2 Feature extraction and feature selection are two initial steps for other image processing
operations
Swarm-Based Methods Applied to Computer Vision 337
new population is generated by applying the operators of the genetic algorithm. The
method automatically determines the number of features to select. Other researchers
have applied various swarm algorithms to address the same problem, including
GWO [29], ALO [30], CUS [31], ABC [32], or FA [33].
Feature selection is also important for image steganalysis, which is the process of
detecting hidden messages in an image. It has been performed by ABC [34], PSO
[35, 36], or GWO [37]. In addition, other articles that apply swarms to this problem
are described in [38].
Detecting an object or region of interest within an image is highly dependent on
the image features being analyzed. The objective of feature detection is to identify
features, such as edges, shapes, or specific points. Swarm-based solutions reduce
the time required to perform this operation.
Several articles describe the use of artificial ants for edge detection. The proposal
presented in [39] uses the algorithm called ant system, while the methods described
in [40] and [41] use the ant colony system algorithm. In all the cases, ants are
used to obtain the edge information. On the other hand, the method described in
[42] applies artificial ants as a second operation to improve the edge information
obtained by other conventional edge detection algorithms (the Sobel and Canny
edge detection approaches). Other proposals for the application of artificial ants for
edge detection are described in [43] and [44]. Other swarm-based methods that
have been applied for edge detection are PSO [45] and ABC [46].
There are also articles that describe the application of swarms for shape detection.
PSO and genetic algorithms were combined in [47] to define a method that detects
circles. ABC was applied in [48] to detect circular shapes, while BFO was applied
in [49].
Fig. 3 Example of segmentation process applied to extract the objects from the background
338 M.-L. Pérez-Delgado
Fig. 4 Example of a classification system that can distinguish ripe and unripe tomatoes from an
image
in [89, 90], and [91] to define the optimal architecture of a convolutional neural
network applied to classify images. A system to classify fruit images was proposed
in [92] that applies a variant of ABC to train the neural network that performs
the classification. The solution proposed in [93] to classify remote-sensing images
uses a Naïve Bayes classifier and applies CUS to define the classifier weights. The
system for identifying and classifying plant leaf diseases described in [94] uses
BFO to define the weights of a radial basis function neural network.
Swarm algorithms have also been applied to define classification methods that
allow classifying parts of an image.
Omran et al. described two applications of PSO for this type of image classifi-
cation, using each particle to represent the mean of all the clusters. In the first case,
the fitness function tries to minimize the intra-cluster distance and to maximize the
inter-cluster distance [95]. In the second case, the function includes a third element
to minimize the quantization error [96].
The method described in [97] uses artificial ants to classify remote-sensing
images, so that different land uses are identified in the image. The same problem
was solved in [98], but applying PSO. The method described in [99] classifies a
high-resolution urban satellite image to identify 5 land cover classes: vegetation,
water, urban, shadow, and road. The article proposes two classification methods,
which apply artificial ants and PSO, respectively. Another crop classification system
was defined in [100]. This system uses PSO to train a neural network that can
differentiate 13 types of crops in a radar image.
Object detection consists of finding the image of a specific object into another more
general image or in a video sequence (Fig. 5). Automatic object detection is a very
important operation in computer vision, but it is difficult due to many factors such
as rotation, scale, occlusion, or viewpoint changes. The practical applications of this
Swarm-Based Methods Applied to Computer Vision 341
Model-based methods include graph models, which break the object into parts
and represent each one by a graph vertex. This approach is considered in [108],
which applies artificial ants for road extraction from very high-resolution satellite
images. First, the image is segmented to generate image objects. These objects are
then used to define the nodes of the graph that the ants will traverse to define a binary
roadmap. At the end of the process, the binary roadmap is vectorized to obtain the
center lines of the road network.
A cuckoo search-based method was applied in [109] to detect vessels in a
synthetic aperture radar image.
The proposals described in [110] and [111] define two template-matching
methods that apply ABC. Template-matching methods try to find a sub-image,
called template, within another image. The objective function proposed in [110]
computes the difference between the RGB level histograms corresponding to the
target object and the template object. On the other hand, the absolute sum of
differences between pixels of the target image and the template image was used
in [111] to define the fitness function.
A method for visual target recognition for low altitude aircraft was described
in [112]. It is a shape matching method that uses ABC to optimize the matching
parameters.
Before concluding this section, it should be noted that object recognition is
a necessary operation for object tracking. Several applications of PSO for object
tracking appear in [113, 114], or [115]. Other swarm-based solutions considered
are CUS [116, 117], BA [118], and FA [119].
Humans show many emotions through facial expressions (happiness, sadness, anger,
etc.). The recognition of these expressions is useful for the analysis of customer
satisfaction, video games, or virtual reality, among other applications. Several
swarm-based methods have been proposed for the automatic recognition of facial
expressions. They use FA [136], PSO [137], CSO [138], or GWO [139]. On
the other hand, the method described in [140] proposes a three-dimensional facial
expression recognition model that uses artificial ants and PSO.
Face recognition and facial expression recognition are related to head pose
estimation. Head pose estimation is a difficult problem in computer vision. This
problem was addressed in [141] by a method that uses images from a depth camera
and applies the PSO algorithm to solve the problem as an optimization problem.
The method presented in [142] is a PSO-based solution for three-dimensional head
pose estimation that uses a commodity depth camera. A variant of ABC was used in
[143] for face pose estimation from a single image.
Human motion recognition is a process that requires detecting changes in the
position of a human posture or gesture in a video sequence. The starting point of
this process is the identification of the initial position. Tracking human motion from
344 M.-L. Pérez-Delgado
References
1. Szeliski, R. (2010). Computer vision: Algorithms and applications, Springer Science &
Business Media.
2. Panigrahi, B. K., Shi, Y., & Lim, M. H. (2011). Handbook of swarm intelligence: Concepts,
principles and applications (Vol. 8). Springer Science & Business Media.
3. Yang, X. S., Cui, Z., Xiao, R., Gandomi, A. H., & Karamanoglu, M. (2013). Swarm
intelligence and bio-inspired computation: theory and applications. Newnes.
4. Abraham, A., Guo, H., & Liu, H. (2006). Swarm intelligence: foundations, perspectives and
applications. In Swarm intelligent systems (pp. 3–25). Springer.
5. Abdulrahman, S. M. (2017). Using swarm intelligence for solving NP-hard problems.
Academic Journal of Nawroz University, 6(3), 46–50.
6. Hassanien, A. E., & Emary, E. (2018). Swarm intelligence: Principles, advances, and
applications. CRC Press.
7. Slowik, A. (2021). Swarm intelligence algorithms: Modifications and applications. CRC
Press.
8. Karaboga, D., & Basturk, B. (2007). A powerful and efficient algorithm for numerical func-
tion optimization: Artificial bee colony (ABC) algorithm. Journal of Global Optimization,
39(3), 459–471.
9. Dorigo, M., & Stützle, T. (2019). Ant colony optimization: overview and recent advances. In
Handbook of metaheuristics (pp. 311–351).
10. Mirjalili, S. (2015). The ant lion optimizer. Advances in Engineering Software, 83, 80–98.
11. Yang, X. S. (2010) A new metaheuristic bat-inspired algorithm. In González, J., Pelta, D.,
Cruz, C., Terrazas, G., & Krasnogor, N. (Eds.), Nature Inspired Cooperative Strategies for
Optimization (NICSO 2010) (pp. 65–74). Springer. 10.1007/978-3-642-12538-6_6
12. Passino, K. M. (2002). Biomimicry of bacterial foraging for distributed optimization and
control. IEEE Control Systems Magazine, 22(3), 52–67.
13. Yang, X. S., & Deb, S. (2009). Cuckoo search via Lévy flights. In 2009 World
Congress on Nature & Biologically Inspired Computing (NaBIC) (pp. 210–214). IEEE.
10.1109/NABIC.2009.5393690
14. Chu, S. C., & Tsai, P. W. (2007). Computational intelligence based on the behavior of cats.
International Journal of Innovative Computing, Information and Control, 3(1), 163–173.
15. Askarzadeh, A. (2016). A novel metaheuristic method for solving constrained engineering
optimization problems: Crow search algorithm. Computers & Structures, 169, 1–12.
16. Yang, X. S., Karamanoglu, M., & He, X. (2014). Flower pollination algorithm: a novel
approach for multiobjective optimization. Engineering Optimization, 46(9), 1222–1237.
17. Yang, X. S., & He, X. (2013). Firefly algorithm: recent advances and applications. Interna-
tional Journal of Swarm Intelligence, 1(1), 36–50.
18. Li, X. L., Shao, Z. J., & Qian, J. X. (2002). An optimizing method based on autonomous
animats: Fish-swarm algorithm. Systems Engineering - Theory and Practice, 22(11), 32–38.
19. Mirjalili, S., Mirjalili, S. M., & Lewis, A. (2014). Grey wolf optimizer. Advances in
Engineering Software, 69, 46–61.
20. Kennedy, J., & Eberhart, R. (1995). Particle swarm optimization. In Proceedings of
ICNN’95-International Conference on Neural Networks (Vol. 4, pp. 1942–1948). IEEE.
10.1109/ICNN.1995.488968
21. Mirjalili, S., & Lewis, A. (2016). The whale optimization algorithm. Advances in Engineering
Software, 95, 51–67.
22. Chen, B., Chen, L., & Chen, Y. (2013) Efficient ant colony optimization for image feature
selection. Signal Processing, 93(6), 1566–1576.
23. Kumar, A., Patidar, V., Khazanchi, D., & Saini, P. (2016). Optimizing feature selection using
particle swarm optimization and utilizing ventral sides of leaves for plant leaf classification.
Procedia Computer Science, 89, 324–332.
348 M.-L. Pérez-Delgado
24. Naeini, A. A., Babadi, M., Mirzadeh, S. M. J., & Amini, S. (2018). Particle swarm
optimization for object-based feature selection of VHSR satellite images. IEEE Geoscience
and Remote Sensing Letters, 15(3), 379–383.
25. Andrushia, A. D., & Patricia, A. T. (2020). Artificial bee colony optimization (ABC) for grape
leaves disease detection. Evolving Systems, 11(1), 105–117.
26. Ghamisi, P., Chen, Y., & Zhu, X. X. (2016). A self-improving convolution neural network
for the classification of hyperspectral data. IEEE Geoscience and Remote Sensing Letters,
13(10), 1537–1541.
27. Su, H., Du, Q., Chen, G., & Du, P. (2014). Optimized hyperspectral band selection using
particle swarm optimization. IEEE Journal of Selected Topics in Applied Earth Observations
and Remote Sensing, 7(6), 2659–2670.
28. Ghamisi, P., & Benediktsson, J. A. (2014). Feature selection based on hybridization of genetic
algorithm and particle swarm optimization. IEEE Geoscience and Remote Sensing Letters,
12(2), 309–313.
29. Medjahed, S. A., Saadi, T. A., Benyettou, A., & Ouali, M. (2016). Gray wolf optimizer for
hyperspectral band selection. Applied Soft Computing, 40, 178–186.
30. Wang, M., Wu, C., Wang, L., Xiang, D., & Huang, X. (2019). A feature selection approach for
hyperspectral image based on modified ant lion optimizer. Knowledge-Based Systems, 168,
39–48.
31. Medjahed, S. A., Saadi, T. A., Benyettou, A., & Ouali, M. (2015). Binary cuckoo search
algorithm for band selection in hyperspectral image classification. IAENG International
Journal of Computer Science, 42(3), 183–191.
32. Xie, F., Li, F., Lei, C., Yang, J., & Zhang, Y. (2019). Unsupervised band selection based on
artificial bee colony algorithm for hyperspectral image classification. Applied Soft Computing,
75, 428–440.
33. Su, H., Cai, Y., & Du, Q. (2016). Firefly-algorithm-inspired framework with band selection
and extreme learning machine for hyperspectral image classification. IEEE Journal of
Selected Topics in Applied Earth Observations and Remote Sensing, 10(1), 309–320.
34. Mohammadi, F. G., & Abadeh, M. S. (2014). Image steganalysis using a bee colony based
feature selection algorithm. Engineering Applications of Artificial Intelligence, 31, 35–43.
35. Chhikara, R. R., Sharma, P., & Singh, L. (2016). A hybrid feature selection approach based on
improved PSO and filter approaches for image steganalysis. International Journal of Machine
Learning and Cybernetics, 7(6), 1195–1206.
36. Adeli, A., & Broumandnia, A. (2018). Image steganalysis using improved particle swarm
optimization based feature selection. Applied Intelligence, 48(6), 1609–1622.
37. Pathak, Y., Arya, K., & Tiwari, S. (2019). Feature selection for image steganalysis using Levy
flight-based grey wolf optimization. Multimedia Tools and Applications, 78(2), 1473–1494.
38. Zebari, D. A., Zeebaree, D. Q., Saeed, J. N., Zebari, N. A., & Adel, A. Z. (2020). Image
steganography based on swarm intelligence algorithms: A survey. Test Engineering and
Management, 7(8), 22257–22269.
39. Nezamabadi-Pour, H., Saryazdi, S., & Rashedi, E. (2006). Edge detection using ant algo-
rithms. Soft Computing, 10(7), 623–628.
40. Tian, J., Yu, W., & Xie, S. (2008). An ant colony optimization algorithm for image edge
detection. In 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on
Computational Intelligence) (pp. 751–756). IEEE. 10.1109/CEC.2008.4630880
41. Baterina, A. V., & Oppus, C. (2010). Image edge detection using ant colony optimization.
WSEAS Transactions on Signal Processing, 6(2), 58–67.
42. Lu, D. S., & Chen, C. C. (2008). Edge detection improvement by ant colony optimization.
Pattern Recognition Letters, 29(4), 416–425.
43. Verma, O. P., Hanmandlu, M., & Sultania, A. K. (2010). A novel fuzzy ant system for edge
detection. In 2010 IEEE/ACIS 9th International Conference on Computer and Information
Science (pp. 228–233). IEEE. 10.1109/ICIS.2010.145
44. Etemad, S. A., & White, T. (2011). An ant-inspired algorithm for detection of image edge
features. Applied Soft Computing, 11(8), 4883–4893.
Swarm-Based Methods Applied to Computer Vision 349
45. Setayesh, M., Zhang, M., & Johnston, M. (2009). A new homogeneity-based approach to edge
detection using PSO. In 2009 24th International Conference Image and Vision Computing
New Zealand (pp. 231–236). IEEE. 10.1109/IVCNZ.2009.5378404
46. Yigitbasi, E. D., & Baykan, N. A. (2013). Edge detection using artificial bee colony algorithm
(ABC). International Journal of Information and Electronics Engineering, 3(6), 634–638.
47. Dong, N., Wu, C. H., Ip, W. H., Chen, Z. Q., Chan, C. Y., & Yung, K. L. (2012). An
opposition-based chaotic GA/PSO hybrid algorithm and its application in circle detection.
Computers & Mathematics with Applications, 64(6), 1886–1902.
48. Cuevas, E., Sención-Echauri, F., Zaldivar, D., & Pérez-Cisneros, M. (2012) Multi-circle
detection on images using artificial bee colony (ABC) optimization. Soft Computing, 16(2),
281–296.
49. Dasgupta, S., Das, S., Biswas, A., & Abraham, A. (2010). Automatic circle detection on
digital images with an adaptive bacterial foraging algorithm. Soft Computing, 14(11), 1151–
1164.
50. Li, H., He, H., & Wen, Y. (2015). Dynamic particle swarm optimization and k-means
clustering algorithm for image segmentation. Optik, 126(24), 4817–4822.
51. Omran, M.G., Salman, A., & Engelbrecht, A. P. (2006). Dynamic clustering using particle
swarm optimization with application in image segmentation. Pattern Analysis and Applica-
tions, 8(4), 332–344.
52. Chu, X., Zhu, Y., Shi, J., & Song, J. (2010). Method of image segmentation based on
fuzzy c-means clustering algorithm and artificial fish swarm algorithm. In 2010 International
Conference on Intelligent Computing and Integrated Systems (pp. 254–257). IEEE.
53. Malisia, A. R., & Tizhoosh, H. R. (2006). Image thresholding using ant colony optimization.
In The 3rd Canadian Conference on Computer and Robot Vision (CRV’06) (pp. 26–26). IEEE.
10.1109/CRV.2006.42
54. Han, Y., & Shi, P. (2007). An improved ant colony algorithm for fuzzy clustering in image
segmentation. Neurocomputing, 70(4–6), 665–671.
55. Yang, X., Zhao, W., Chen, Y., & Fang, X. (2008). Image segmentation with a fuzzy clustering
algorithm based on ant-tree. Signal Processing, 88(10), 2453–2462.
56. Ye, Z., Hu, Z., Wang, H., & Chen, H. (2011). Automatic threshold selection based on
artificial bee colony algorithm. In 2011 3rd International Workshop on Intelligent Systems
and Applications (pp. 1–4). IEEE. 10.1109/ISA.2011.5873357
57. Horng, M. H. (2010). A multilevel image thresholding using the honey bee mating optimiza-
tion. Applied Mathematics and Computation, 215(9), 3302–3310.
58. Zhang, Y., & Wu, L. (2011). Optimal multi-level thresholding based on maximum Tsallis
entropy via an artificial bee colony approach. Entropy, 13(4), 841–859.
59. Akay, B. (2013). A study on particle swarm optimization and artificial bee colony algorithms
for multilevel thresholding. Applied Soft Computing, 13(6), 3066–3091.
60. Bhandari, A. K., Kumar, A., & Singh, G. K. (2015). Modified artificial bee colony based
computationally efficient multilevel thresholding for satellite image segmentation using
Kapur’s, Otsu and Tsallis functions. Expert Systems with Applications, 42(3), 1573–1601.
61. Sri Madhava Raja, N., Rajinikanth, V., & Latha, K. (2014). Otsu based optimal multilevel
image thresholding using firefly algorithm. Modelling and Simulation in Engineering, 2014.
10.1155/2014/794574
62. Brajevic, I., & Tuba, M. (2014). Cuckoo search and firefly algorithm applied to multilevel
image thresholding. In Yang, X. (Ed.), Cuckoo search and firefly algorithm. Studies in
Computational Intelligence (pp. 115–139). Springer.
63. Manic, K. S., Priya, R. K., & Rajinikanth, V. (2016). Image multithresholding based on
Kapur/Tsallis entropy and firefly algorithm. Indian Journal of Science and Technology, 9(12),
1–6. 10.17485/ijst/2016/v9i12/89949
64. He, L., & Huang, S. (2017). Modified firefly algorithm based multilevel thresholding for color
image segmentation. Neurocomputing, 240, 152–174.
350 M.-L. Pérez-Delgado
65. Pare, S., Bhandari, A. K., Kumar, A., & Singh, G. K. (2018). A new technique for multilevel
color image thresholding based on modified fuzzy entropy and Lévy flight firefly algorithm.
Computers & Electrical Engineering, 70, 476–495.
66. Horng, M. H., & Liou, R. J. (2011). Multilevel minimum cross entropy threshold selection
based on the firefly algorithm. Expert Systems with Applications, 38(12), 14805–14811.
67. Bhandari, A. K., Singh, V. K., Kumar, A., & Singh, G. K. (2014). Cuckoo search algorithm
and wind driven optimization based study of satellite image segmentation for multilevel
thresholding using Kapur’s entropy. Expert Systems with Applications, 41(7), 3538–3560.
68. Agrawal, S., Panda, R., Bhuyan, S., & Panigrahi, B. K. (2013). Tsallis entropy based
optimal multilevel thresholding using cuckoo search algorithm. Swarm and Evolutionary
Computation, 11, 16–30.
69. Pare, S., Kumar, A., Bajaj, V., & Singh, G. K. (2017). An efficient method for multilevel color
image thresholding using cuckoo search algorithm based on minimum cross entropy. Applied
Soft Computing, 61, 570–592.
70. Suresh, S., & Lal, S. (2016). An efficient cuckoo search algorithm based multilevel threshold-
ing for segmentation of satellite images using different objective functions. Expert Systems
with Applications, 58, 184–209.
71. Gao, H., Xu, W., Sun, J., & Tang, Y. (2009). Multilevel thresholding for image segmentation
through an improved quantum-behaved particle swarm algorithm. IEEE Transactions on
Instrumentation and Measurement, 59(4), 934–946.
72. Liu, Y., Mu, C., Kou, W., & Liu, J. (2015). Modified particle swarm optimization-based
multilevel thresholding for image segmentation. Soft Computing, 19(5), 1311–1327.
73. Ghamisi, P., Couceiro, M. S., Martins, F. M., & Benediktsson, J. A. (2013). Multilevel
image segmentation based on fractional-order Darwinian particle swarm optimization. IEEE
Transactions on Geoscience and Remote Sensing, 52(5), 2382–2394.
74. Maitra, M., & Chatterjee, A. (2008). A hybrid cooperative–comprehensive learning based
PSO algorithm for image segmentation using multilevel thresholding. Expert Systems with
Applications, 34(2), 1341–1350.
75. Duraisamy, S. P., & Kayalvizhi, R. (2010). A new multilevel thresholding method using
swarm intelligence algorithm for image segmentation. Journal of Intelligent Learning Systems
and Applications, 2(03), 126–138.
76. Yin, P. Y. (2007). Multilevel minimum cross entropy threshold selection based on particle
swarm optimization. Applied Mathematics and Computation, 184(2), 503–513.
77. Li, L., Sun, L., Guo, J., Qi, J., Xu, B., & Li, S. (2017). Modified discrete grey wolf optimizer
algorithm for multilevel image thresholding. Computational Intelligence and Neuroscience,
2017. 10.1155/2017/3295769
78. Khairuzzaman, A. K. M., & Chaudhury, S. (2017). Multilevel thresholding using grey wolf
optimizer for image segmentation. Expert Systems with Applications, 86, 64–76.
79. Satapathy, S. C., Raja, N. S. M., Rajinikanth, V., Ashour, A. S., & Dey, N. (2018). Multi-
level image thresholding using Otsu and chaotic bat algorithm. Neural Computing and
Applications, 29(12), 1285–1307.
80. Alihodzic, A., & Tuba, M. (2014). Improved bat algorithm applied to multilevel image
thresholding. The Scientific World Journal, 2014. 10.1155/2014/176718
81. Liang, Y. C., Chen, A. H. L., & Chyu, C. C. (2006). Application of a hybrid ant colony
optimization for the multilevel thresholding in image processing. In King, I., Wang, J., Chan,
L., & Wang, D. (Eds.), International Conference on Neural Information Processing. Lecture
Notes in Computer Science (Vol. 4233, pp. 1183–1192). Springer.
82. Abd El Aziz, M., Ewees, A. A., Hassanien, A. E., Mudhsh, M., & Xiong, S. (2018). Multi-
objective whale optimization algorithm for multilevel thresholding segmentation. In Advances
in Soft Computing and Machine Learning in Image Processing (pp. 23–39). Springer.
83. Upadhyay, P., & Chhabra, J. K. (2020). Kapur’s entropy based optimal multi-
level image segmentation using crow search algorithm. Applied Soft Computing, 97.
10.1016/j.asoc.2019.105522
Swarm-Based Methods Applied to Computer Vision 351
84. Otsu, N. (1979). A threshold selection method from gray-level histograms. IEEE Transactions
on Systems, Man, and Cybernetics, 9(1), 62–66.
85. Kapur, J. N., Sahoo, P. K., & Wong, A. K. (1985). A new method for gray-level picture
thresholding using the entropy of the histogram. Computer Vision, Graphics, and Image
Processing, 29(3), 273–285.
86. Tsallis, C. (1988). Possible generalization of Boltzmann-Gibbs statistics. Journal of Statisti-
cal Physics, 52(1), 479–487.
87. Li, C. H., & Lee, C. (1993). Minimum cross entropy thresholding. Pattern Recognition, 26(4),
617–625.
88. Chandramouli, K., & Izquierdo, E. (2006). Image classification using chaotic particle swarm
optimization. In 2006 International Conference on Image Processing (pp. 3001–3004). IEEE.
10.1109/ICIP.2006.312968
89. Wang, B., Sun, Y., Xue, B., & Zhang, M. (2018). Evolving deep convolutional neural net-
works by variable-length particle swarm optimization for image classification. In 2018 IEEE
Congress on Evolutionary Computation (CEC) (pp. 1–8). IEEE. 10.1109/CEC.2018.8477735
90. Fielding, B., & Zhang, L. (2018). Evolving image classification architectures with enhanced
particle swarm optimisation. IEEE Access, 6, 68560–68575.
91. Junior, F. E. F., & Yen, G. G. (2019). Particle swarm optimization of deep neural networks
architectures for image classification. Swarm and Evolutionary Computation, 49, 62–74.
92. Wang, S., Zhang, Y., Ji, G., Yang, J., Wu, J., & Wei, L. (2015). Fruit classification by
wavelet-entropy and feedforward neural network trained by fitness-scaled chaotic ABC and
biogeography-based optimization. Entropy, 17(8), 5711–5728.
93. Yang, J., Ye, Z., Zhang, X., Liu, W., & Jin, H. (2017). Attribute weighted Naive Bayes for
remote sensing image classification based on cuckoo search algorithm. In 2017 International
Conference on Security, Pattern Analysis, and Cybernetics (SPAC) (pp. 169–174). IEEE.
10.1109/SPAC.2017.8304270
94. Chouhan, S. S., Kaul, A., Singh, U. P., & Jain, S. (2018). Bacterial foraging optimization
based radial basis function neural network (BRBFNN) for identification and classification of
plant leaf diseases: An automatic approach towards plant pathology. IEEE Access, 6, 8852–
8863.
95. Omran, M. G., Engelbrecht, A. P., & Salman, A. (2004). Image classification using
particle swarm optimization. In K. Tan, M. Lim, X. Yao, & L. Wang (Eds.),
Recent Advances in Simulated Evolution and Learning (pp. 347–365). World Scientific.
10.1142/9789812561794_0019
96. Omran, M., Engelbrecht, A. P., & Salman, A. (2005). Particle swarm optimization method
for image clustering. International Journal of Pattern Recognition and Artificial Intelligence,
19(03), 297–321.
97. Liu, X., Li, X., Liu, L., He, J., & Ai, B. (2008). An innovative method to classify remote-
sensing images using ant colony optimization. IEEE Transactions on Geoscience and Remote
Sensing, 46(12), 4198–4208.
98. Liu, X., Li, X., Peng, X., Li, H., & He, J. (2008). Swarm intelligence for classification of
remote sensing data. Science in China Series D: Earth Sciences, 51(1), 79–87.
99. Omkar, S., Kumar, M. M., Mudigere, D., & Muley, D. (2007). Urban satellite image
classification using biologically inspired techniques. In 2007 IEEE International Symposium
on Industrial Electronics (pp. 1767–1772). IEEE. 10.1109/ISIE.2007.4374873
100. Zhang, Y., & Wu, L. (2011). Crop classification by forward neural network with adaptive
chaotic particle swarm optimization. Sensors, 11(5), 4721–4743.
101. Owechko, Y., & Medasani, S. (2005). Cognitive swarms for rapid detection of objects and
associations in visual imagery. In Proceedings of 2005 IEEE Swarm Intelligence Symposium,
2005. SIS 2005. (pp. 420–423). IEEE.
102. Singh, N., Arya, R., & Agrawal, R. (2014). A novel approach to combine features for salient
object detection using constrained particle swarm optimization. Pattern Recognition, 47(4),
1731–1739.
352 M.-L. Pérez-Delgado
103. Ugolotti, R., Nashed, Y. S., Mesejo, P., Ivekovič, Š., Mussi, L., & Cagnoni, S. (2013). Particle
swarm optimization and differential evolution for model-based object detection. Applied Soft
Computing, 13(6), 3092–3105.
104. Mussi, L., Cagnoni, S., & Daolio, F. (2009). GPU-based road sign detection using particle
swarm optimization. In 2009 Ninth International Conference on Intelligent Systems Design
and Applications (pp. 152–157). IEEE.
105. Maldonado, S., Acevedo, J., Lafuente, S., Fernández, A., & López-Ferreras, F. (2010). An
optimization on pictogram identification for the road-sign recognition task using SVMs.
Computer Vision and Image Understanding, 114(3), 373–383.
106. Tseng, C. C., Hsieh, J. G., & Jeng, J. H. (2009). Active contour model via multi-population
particle swarm optimization. Expert Systems with Applications, 36(3), 5348–5352.
107. Horng, M. H., Liou, R. J., & Wu, J. (2010). Parametric active contour model by using the
honey bee mating optimization. Expert Systems with Applications, 37(10), 7015–7025.
108. Maboudi, M., Amini, J., Hahn, M., & Saati, M. (2017). Object-based road extraction from
satellite images using ant colony optimization. International Journal of Remote Sensing,
38(1), 179–198.
109. Iwin, S., Sasikala, J., & Juliet, D. S. (2019). Optimized vessel detection in marine environment
using hybrid adaptive cuckoo search algorithm. Computers & Electrical Engineering, 78,
482–492.
110. Banharnsakun, A., & Tanathong, S. (2014). Object detection based on template matching
through use of best-so-far ABC. Computational Intelligence and Neuroscience, 2014.
10.1155/2014/919406
111. Chidambaram, C., & Lopes, H. S. (2009). A new approach for template matching in digital
images using an artificial bee colony algorithm. In 2009 World Congress on Nature & Bio-
logically Inspired Computing (NaBIC) (pp. 146–151). IEEE. 10.1109/NABIC.2009.5393631
112. Xu, C., & Duan, H. (2010). Artificial bee colony (ABC) optimized edge potential function
(EPF) approach to target recognition for low-altitude aircraft. Pattern Recognition Letters,
31(13), 1759–1772.
113. Zhang, X., Hu, W., Qu, W., & Maybank, S. (2010). Multiple object tracking via species-
based particle swarm optimization. IEEE Transactions on Circuits and Systems for Video
Technology, 20(11), 1590–1602.
114. Kobayashi, T., Nakagawa, K., Imae, J., & Zhai, G. (2007). Real time object tracking on
video image sequence using particle swarm optimization. In 2007 International Conference
on Control, Automation and Systems (pp. 1773–1778). IEEE. 10.1109/ICCAS.2007.4406632
115. Ramakoti, N., Vinay, A., & Jatoth, R. K. (2009). Particle swarm optimization aided Kalman
filter for object tracking. In 2009 International Conference on Advances in Computing,
Control, and Telecommunication Technologies (pp. 531–533). IEEE. 10.1109/ACT.2009.135
116. Walia, G. S., & Kapoor, R. (2014). Intelligent video target tracking using an evolutionary
particle filter based upon improved cuckoo search. Expert Systems with Applications, 41(14),
6315–6326.
117. Ljouad, T., Amine, A., & Rziza, M. (2014). A hybrid mobile object tracker based on the
modified cuckoo search algorithm and the Kalman filter. Pattern Recognition, 47(11), 3597–
3613.
118. Gao, M. L., Shen, J., Yin, L. J., Liu, W., Zou, G. F., Li, H. T., & Fu, G. X. (2016). A novel
visual tracking method using bat algorithm. Neurocomputing, 177, 612–619.
119. Gao, M. L., He, X. H., Luo, D. S., Jiang, J., & Teng, Q. Z. (2013). Object tracking using
firefly algorithm. IET Computer Vision, 7(4), 227–237. 10.1049/iet-cvi.2012.0207.
120. Kanan, H. R., & Faez, K. (2008). An improved feature selection method based on ant
colony optimization (ACO) evaluated on face recognition system. Applied Mathematics and
Computation, 205(2), 716–725.
121. Kotia, J., Bharti, R., Kotwal, A., & Mangrulkar, R. (2020). Application of firefly algorithm
for face recognition. In Dey, N. (Ed.), Applications of firefly algorithm and its variants (pp.
147–171). Springer.
Swarm-Based Methods Applied to Computer Vision 353
122. Ramadan, R. M., & Abdel-Kader, R. F. (2009). Face recognition using particle swarm
optimization-based selected features. International Journal of Signal Processing, Image
Processing and Pattern Recognition, 2(2), 51–65.
123. Krisshna, N. A., Deepak, V. K., Manikantan, K., & Ramachandran, S. (2014). Face recogni-
tion using transform domain feature extraction and PSO-based feature selection. Applied Soft
Computing, 22, 141–161.
124. Tiwari, V. (2012). Face recognition based on cuckoo search algorithm. Indian Journal of
Computer Science and Engineering, 3(3), 401–405.
125. Jakhar, R., Kaur, N., & Singh, R. (2011). Face recognition using bacteria foraging
optimization-based selected features. International Journal of Advanced Computer Science
and Applications, 1(3), 106–111.
126. Kumar, D. (2017). Feature selection for face recognition using DCT-PCA and bat algorithm.
International Journal of Information Technology, 9(4), 411–423.
127. Raghavendra, R., Dorizzi, B., Rao, A., & Kumar, G. H. (2011). Particle swarm optimization
based fusion of near infrared and visible images for improved face verification. Pattern
Recognition, 44(2), 401–411.
128. Yadav, D., Vatsa, M., Singh, R., & Tistarelli, M. (2013). Bacteria foraging fusion for face
recognition across age progression. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition Workshops (pp. 173–179). 10.1109/CVPRW.2013.33
129. Wei, J., Jian-Qi, Z., & Xiang, Z. (2011). Face recognition method based on support vector
machine and particle swarm optimization. Expert Systems with Applications, 38(4), 4390–
4393.
130. Panda, R., Naik, M. K., & Panigrahi, B. K. (2011). Face recognition using bacterial foraging
strategy. Swarm and Evolutionary Computation, 1(3), 138–146.
131. Chakrabarty, A., Jain, H., & Chatterjee, A. (2013). Volterra kernel based face recognition
using artificial bee colony optimization. Engineering Applications of Artificial Intelligence,
26(3), 1107–1114.
132. Lu, Y., Zeng, N., Liu, Y., & Zhang, N. (2015). A hybrid wavelet neural network and switching
particle swarm optimization algorithm for face direction recognition. Neurocomputing, 155,
219–224.
133. Naik, M. K., & Panda, R. (2016). A novel adaptive cuckoo search algorithm for intrinsic
discriminant analysis based face recognition. Applied Soft Computing, 38, 661–675.
134. Nebti, S., & Boukerram, A. (2017). Swarm intelligence inspired classifiers for facial
recognition. Swarm and Evolutionary Computation, 32, 150–166.
135. Sánchez, D., Melin, P., & Castillo, O. (2017). Optimization of modular granular neural
networks using a firefly algorithm for human recognition. Engineering Applications of
Artificial Intelligence, 64, 172–186.
136. Zhang, L., Mistry, K., Neoh, S. C., & Lim, C. P. (2016). Intelligent facial emotion recognition
using moth-firefly optimization. Knowledge-Based Systems, 111, 248–267.
137. Mistry, K., Zhang, L., Neoh, S. C., Lim, C. P., & Fielding, B. (2016). A micro-GA embedded
PSO feature selection approach to intelligent facial emotion recognition. IEEE Transactions
on Cybernetics, 47(6), 1496–1509.
138. Sikkandar, H., & Thiyagarajan, R. (2021). Deep learning based facial expression recognition
using improved cat swarm optimization. Journal of Ambient Intelligence and Humanized
Computing, 12(2), 3037–3053.
139. Sreedharan, N. P. N., Ganesan, B., Raveendran, R., Sarala, P., & Dennis, B. (2018). Grey
wolf optimisation-based feature selection and classification for facial emotion recognition.
IET Biometrics, 7(5), 490–499.
140. Mpiperis, I., Malassiotis, S., Petridis, V., & Strintzis, M. G. (2008). 3D facial expression
recognition using swarm intelligence. In 2008 IEEE International Conference on Acoustics,
Speech and Signal Processing (pp. 2133–2136). IEEE. 10.1109/ICASSP.2008.4518064
141. Padeleris, P., Zabulis, X., & Argyros, A. A. (2012). Head pose estimation on depth data based
on particle swarm optimization. In 2012 IEEE Computer Society Conference on Computer
Vision and Pattern Recognition Workshops (pp. 42–49). IEEE.
354 M.-L. Pérez-Delgado
142. Meyer, G. P., Gupta, S., Frosio, I., Reddy, D., & Kautz, J. (2015). Robust model-based 3D
head pose estimation. In Proceedings of the IEEE International Conference on Computer
Vision (pp. 3649–3657).
143. Zhang, Y., & Wu, L. (2011). Face pose estimation by chaotic artificial bee colony. Interna-
tional Journal of Digital Content Technology and its Applications, 5(2), 55–63.
144. Oikonomidis, I., Kyriazis, N., & Argyros, A. A. (2010). Markerless and efficient 26-DOF
hand pose recovery. In Asian Conference on Computer Vision (pp. 744–757). Springer.
145. Ye, Q., Yuan, S., & Kim, T. K. (2016). Spatial attention deep net with partial PSO
for hierarchical hybrid hand pose estimation. In B. Leibe, J. Matas, N. Sebe, & M.
Welling (Eds.), European Conference on Computer Vision (pp. 346–361). Springer.
10.1007/978-3-319-46484-8_21
146. Oikonomidis, I., Kyriazis, N., & Argyros, A. A. (2011). Full DOF tracking of a hand inter-
acting with an object by modeling occlusions and physical constraints. In 2011 International
Conference on Computer Vision (pp. 2088–2095). IEEE. 10.1109/ICCV.2011.6126483
147. Oikonomidis, I., Kyriazis, N., & Argyros, A. A. (2011). Efficient model-based 3D tracking of
hand articulations using Kinect. In J. Hoey, S. McKenna, & E. Trucco (Eds.), British Machine
Vision Conference (Vol. 1, pp. 2088–2095). 10.5244/C.25.101
148. Ivekovič, Š., Trucco, E., & Petillot, Y. R. (2008). Human body pose estimation with particle
swarm optimisation. Evolutionary Computation, 16(4), 509–528.
149. Akhtar, S., Ahmad, A., & Abdel-Rahman, E. M. (2012). A metaheuristic bat-inspired
algorithm for full body human pose estimation. In 2012 Ninth Conference on Computer and
Robot Vision (pp. 369–375). IEEE. 10.1109/CRV.2012.55
150. Robertson, C., & Trucco, E. (2006). Human body posture via hierarchical evolutionary
optimization. In British Machine Vision Conference (Vol. 6, pp. 111–118). 10.5244/C.20.102
151. Balaji, S., Karthikeyan, S., & Manikandan, R. (2021). Object detection using metaheuristic
algorithm for volley ball sports application. Journal of Ambient Intelligence and Humanized
Computing, 12(1), 375–385.
152. John, V., Trucco, E., & Ivekovic, S. (2010). Markerless human articulated tracking using
hierarchical particle swarm optimisation. Image and Vision Computing, 28(11), 1530–1547.
153. Zhang, X., Hu, W., Wang, X., Kong, Y., Xie, N., Wang, H., Ling, H., & Maybank, S. (2010). A
swarm intelligence based searching strategy for articulated 3D human body tracking. In 2010
IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops
(pp. 45–50). IEEE.
154. Thida, M., Eng, H. L., Monekosso, D. N., & Remagnino, P. (2013). A particle swarm
optimisation algorithm with interactive swarms for tracking multiple targets. Applied Soft
Computing, 13(6), 3106–3117.
155. Hancer, E., Ozturk, C., & Karaboga, D. (2013). Extraction of brain tumors from MRI
images with artificial bee colony based segmentation methodology. In 2013 8th International
Conference on Electrical and Electronics Engineering (ELECO) (pp. 516–520). IEEE.
0.1109/ELECO.2013.6713896
156. Taherdangkoo, M., Yazdi, M., & Rezvani, M. (2010). Segmentation of MR brain images using
FCM improved by artificial bee colony (ABC) algorithm. In Proceedings of the 10th IEEE
International Conference on Information Technology and Applications in Biomedicine (pp.
1–5). IEEE. 10.1109/ITAB.2010.5687803
157. Menon, N., & Ramakrishnan, R. (2015). Brain tumor segmentation in MRI images
using unsupervised artificial bee colony algorithm and FCM clustering. In 2015 Interna-
tional Conference on Communications and Signal Processing (ICCSP) (pp. 6–9). IEEE.
10.1109/ICCSP.2015.7322635
158. Mostafa, A., Fouad, A., Abd Elfattah, M., Hassanien, A. E., Hefny, H., Zhu, S. Y., & Schaefer,
G. (2015). CT liver segmentation using artificial bee colony optimisation. Procedia Computer
Science, 60, 1622–1630.
159. Pereira, C., Gonçalves, L., & Ferreira, M. (2015). Exudate segmentation in fundus images
using an ant colony optimization approach. Information Sciences, 296, 14–24.
160. Huang, P., Cao, H., & Luo, S. (2008). An artificial ant colonies approach to medical image
segmentation. Computer Methods and Programs in Biomedicine, 92(3), 267–273.
Swarm-Based Methods Applied to Computer Vision 355
161. Lee, M. E., Kim, S. H., Cho, W. H., Park, S. Y., & Lim, J. S. (2009). Segmentation of brain MR
images using an ant colony optimization algorithm. In 2009 Ninth IEEE International Con-
ference on Bioinformatics and Bioengineering (pp. 366–369). IEEE. 10.1109/BIBE.2009.58
162. Karnan, M., & Logheshwari, T. (2010). Improved implementation of brain MRI image seg-
mentation using ant colony system. In 2010 IEEE International Conference on Computational
Intelligence and Computing Research (pp. 1–4) IEEE. 10.1109/ICCIC.2010.5705897
163. Chakraborty, S., Chatterjee, S., Dey, N., Ashour, A. S., Ashour, A. S., Shi, F., & Mali, K.
(2017). Modified cuckoo search algorithm in microscopic image segmentation of hippocam-
pus. Microscopy Research and Technique, 80(10), 1051–1072.
164. Ilunga-Mbuyamba, E., Cruz-Duarte, J. M., Avina-Cervantes, J. G., Correa-Cely, C. R.,
Lindner, D., & Chalopin, C. (2016). Active contours driven by cuckoo search strategy for
brain tumour images segmentation. Expert Systems with Applications, 56, 59–68.
165. Li, Y., Jiao, L., Shang, R., & Stolkin, R. (2015). Dynamic-context cooperative quantum-
behaved particle swarm optimization based on multilevel thresholding applied to medical
image segmentation. Information Sciences, 294, 408–422.
166. Mekhmoukh, A., & Mokrani, K. (2015). Improved fuzzy C-means based particle swarm
optimization (PSO) initialization and outlier rejection with level set methods for MR brain
image segmentation. Computer Methods and Programs in Biomedicine, 122(2), 266–281.
167. Kavitha, P., & Prabakaran, S. (2019). A novel hybrid segmentation method with particle
swarm optimization and fuzzy c-mean based on partitioning the image for detecting lung
cancer. International Journal of Engineering and Advanced Technology, 8(5), 1223–1227.
168. Mandal, D., Chatterjee, A., & Maitra, M. (2014). Robust medical image segmentation
using particle swarm optimization aided level set based global fitting energy active contour
approach. Engineering Applications of Artificial Intelligence, 35, 199–214.
169. Wen, L., Wang, X., Wu, Z., Zhou, M., & Jin, J. S. (2015). A novel statistical cerebrovascular
segmentation algorithm with particle swarm optimization. Neurocomputing, 148, 569–577.
170. Parsian, A., Ramezani, M., & Ghadimi, N. (2017). A hybrid neural network-gray wolf
optimization algorithm for melanoma detection. Biomedical Research, 28(8), 3408–3411.
171. Wang, R., Zhou, Y., Zhao, C., & Wu, H. (2015). A hybrid flower pollination algorithm based
modified randomized location for multi-threshold medical image segmentation. Bio-medical
Materials and Engineering, 26(s1), S1345–S1351. 10.3233/BME-151432
172. Alagarsamy, S., Kamatchi, K., Govindaraj, V., Zhang, Y. D., & Thiyagarajan, A. (2019).
Multi-channeled MR brain image segmentation: A new automated approach combining bat
and clustering technique for better identification of heterogeneous tumors. Biocybernetics and
Biomedical Engineering, 39(4), 1005–1035.
173. Rajinikanth, V., Raja, N. S. M., & Kamalanand, K. (2017). Firefly algorithm assisted
segmentation of tumor from brain MRI using Tsallis function and Markov random field.
Journal of Control Engineering and Applied Informatics, 19(3), 97–106.
174. Agrawal, V., & Chandra, S. (2015). Feature selection using artificial bee colony algorithm
for medical image classification. In 2015 Eighth International Conference on Contemporary
Computing (IC3) (pp. 171–176). IEEE. 10.1109/IC3.2015.7346674
175. Ahmed, H. M., Youssef, B. A., Elkorany, A. S., Saleeb, A. A., & Abd El-Samie, F. (2018).
Hybrid gray wolf optimizer–artificial neural network classification approach for magnetic
resonance brain images. Applied Optics, 57(7), B25–B31.
176. Zhang, Y., Wang, S., Ji, G., & Dong, Z. (2013). An MR brain images classifier system via
particle swarm optimization and kernel support vector machine. The Scientific World Journal,
2013. 10.1155/2013/130134
177. Dheeba, J., Singh, N. A., & Selvi, S. T. (2014). Computer-aided detection of breast cancer on
mammograms: A swarm intelligence optimized wavelet neural network approach. Journal of
Biomedical Informatics, 49, 45–52.
178. Senapati, M. R., & Dash, P. K. (2013). Local linear wavelet neural network based breast tumor
classification using firefly algorithm. Neural Computing and Applications, 22(7), 1591–1598.
179. Tan, T. Y., Zhang, L., Neoh, S. C., & Lim, C. P. (2018). Intelligent skin cancer detection using
enhanced particle swarm optimization. Knowledge-based Systems, 158, 118–135.
356 M.-L. Pérez-Delgado
180. Jothi, G., & Hannah Inbarani, H. (2016). Hybrid tolerance rough set–firefly based supervised
feature selection for MRI brain tumor image classification. Applied Soft Computing, 46, 639–
651.
181. Santhi, S., & Bhaskaran, V. (2014). Modified artificial bee colony based feature selection: A
new method in the application of mammogram image classification. International Journal of
Scientific and Technology Research, 3(6), 1664–1667.
182. Shankar, K., Lakshmanaprabu, S., Khanna, A., Tanwar, S., Rodrigues, J. J., & Roy, N.
R. (2019). Alzheimer detection using group grey wolf optimization based features with
convolutional classifier. Computers & Electrical Engineering, 77, 230–243.
183. Sahoo, A., & Chandra, S. (2017). Multi-objective grey wolf optimizer for improved cervix
lesion classification. Applied Soft Computing, 52, 64–80.
184. Kaur, T., Saini, B. S., & Gupta, S. (2018). A novel feature selection method for brain tumor
MR image classification based on the Fisher criterion and parameter-free bat optimization.
Neural Computing and Applications, 29(8), 193–206.
185. Sudha, M., & Selvarajan, S. (2016). Feature selection based on enhanced cuckoo search for
breast cancer classification in mammogram image. Circuits and Systems, 7(04), 327–338.
186. Kavitha, C., & Chellamuthu, C. (2014). Medical image fusion based on hybrid intelligence.
Applied Soft Computing, 20, 83–94.
187. Wachowiak, M. P., Smolíková, R., Zheng, Y., Zurada, J. M., & Elmaghraby, A. S. (2004). An
approach to multimodal biomedical image registration utilizing particle swarm optimization.
IEEE Transactions on Evolutionary Computation, 8(3), 289–301.
188. Talbi, H., & Batouche, M. (2004). Hybrid particle swarm with differential evolution for mul-
timodal image registration. In 2004 IEEE International Conference on Industrial Technology,
2004. IEEE ICIT’04. (Vol. 3, pp. 1567–1572). IEEE. 10.1109/ICIT.2004.1490800
189. Rundo, L., Tangherloni, A., Militello, C., Gilardi, M. C., & Mauri, G. (2016). Mul-
timodal medical image registration using particle swarm optimization: A review. In
2016 IEEE Symposium Series on Computational Intelligence (SSCI) (pp. 1–8). IEEE.
10.1109/SSCI.2016.7850261
190. Daniel, E., Anitha, J., Kamaleshwaran, K., & Rani, I. (2017). Optimum spectrum mask
based medical image fusion using gray wolf optimization. Biomedical Signal Processing and
Control, 34, 36–43.
191. Parvathy, V. S., & Pothiraj, S. (2020). Multi-modality medical image fusion using hybridiza-
tion of binary crow search optimization. Health Care Management Science, 23(4), 661–669.
Index
A D
Affective computing, v, 127–148 Dataset, 1, 68, 104, 128, 153, 184, 204, 231,
Audio, 4–8, 14–23, 26, 27, 127–134, 136–147 243, 254, 274, 296, 309
Auto-encoder, 253–271 Deep features for CBIR, 151–177
Automatic colorization, vi, 253–271 Deep learning, v, vi, 1, 5, 10, 11, 13, 61, 63,
64, 67, 69–70, 72, 76, 78, 98, 127–148,
152, 153, 175, 181–183, 185–191,
203–207, 210, 214, 223–240, 243–250,
B
255, 259, 263, 273–292, 296, 297, 303
Bio inspired CNN, vi
Deep neural network (DNN), 23, 134, 138,
139, 141, 146, 158, 182, 184, 185, 189,
199, 224, 226, 231, 232, 234, 237, 239,
C 245, 254
Capsule network, 203–230 Diabetic, vi, 295–304
Cheby-shev polynomial approximation, Differential evolution, vi, 307–328
282–285, 289, 290, 292 Digital image processing, 181
Chest X-ray images, vi, 182–187, 195, 199, Dimensionality reduction, 37, 274, 278–282,
205, 211, 219 292, 299
Classification, 5, 37, 81, 103, 136, 153, 182, 3D point cloud processing, vi, 243–250
207, 226, 243, 273, 295, 334 Dynamic mode decomposition (DMD), vi,
Combinatorial optimization, 311, 318 274, 275, 279, 280, 282–287, 289–292
Computer vision, 8, 9, 29, 35, 61–78, 81–99,
186, 188, 223, 230, 237, 243, 253, 303,
307, 308, 310, 331–346 E
Confusion matrix (CM), 12, 118, 123, 213, 239 Emotions, 4, 104, 127–132, 134–138, 141–148,
Content based image retrieval, 151–177 343
Convolutional neural network (CNN), vi, 8, 10, Entropy, 11, 37–42, 44, 45, 47–52, 55, 57, 58,
13–19, 22, 23, 27, 28, 61, 63, 69, 97, 98, 104, 105, 110, 215, 217, 247, 309, 319,
141, 144–146, 152–154, 161, 181–199, 321, 327, 338, 339, 345
204, 209–211, 214–217, 243–257, 259,
261, 263, 295–304
COVID-19, vi, 98, 181–199, 203–220 F
Cricket video summarization, 7, 14, 19, 22 Feature, 4, 36, 61, 81, 103, 128, 151, 182, 204,
Cuckoo search approach, 334, 342 224, 244, 254, 273, 296, 308, 336
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 357
B. V. Kumar et al. (eds.), Smart Computer Vision, EAI/Springer Innovations in
Communication and Computing, https://doi.org/10.1007/978-3-031-20541-5
358 Index
T
M Text, 6, 7, 15, 18, 19, 23, 25, 26, 55, 72, 83,
Machine learning (ML), v, 1–29, 36, 61–78, 104, 130–132, 135, 136, 138, 140–143,
97, 103–123, 127–148, 183, 184, 188, 145–147, 151
189, 205, 210, 231, 253, 271, 274, 297, Texture image, v, 103–123
302 Travelling salesman (TSP) problem, 307, 308,
Multimodal, v, 14, 15, 23, 26, 127–148, 335 310–312, 314, 315, 320
N V
Nearest neighborhood search, vi, 226, 246 Video, 1, 35, 62, 104, 127, 230, 340
Video summarization, v, 1–29
O
Object recognition, v, 81, 83–85, 90, 95, 342 X
Octree, 245–246, 248 X Ray, vi, 181–199, 203–208, 211, 213, 215,
OpenSet Domain adaptation, 274, 291 218, 219, 344