0% found this document useful (0 votes)
59 views17 pages

Development of An End-To-End Deep Learning Framework For Sign Language Recognition Translation and Video Generation

The document describes a deep learning framework for sign language recognition, translation, and video generation. The framework uses a hybrid CNN+Bi-LSTM model for sign language recognition and a hybrid NMT+MediaPipe+GAN model for generating sign language videos from text. The proposed model achieves over 95% classification accuracy for recognition and produces high quality videos. Evaluation metrics show improvements over existing approaches, including a 38.06 BLEU score and scores above 0.9 for SSIM, PSNR, and other metrics measuring recognition accuracy and video quality.

Uploaded by

Dharshini B
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views17 pages

Development of An End-To-End Deep Learning Framework For Sign Language Recognition Translation and Video Generation

The document describes a deep learning framework for sign language recognition, translation, and video generation. The framework uses a hybrid CNN+Bi-LSTM model for sign language recognition and a hybrid NMT+MediaPipe+GAN model for generating sign language videos from text. The proposed model achieves over 95% classification accuracy for recognition and produces high quality videos. Evaluation metrics show improvements over existing approaches, including a 38.06 BLEU score and scores above 0.9 for SSIM, PSNR, and other metrics measuring recognition accuracy and video quality.

Uploaded by

Dharshini B
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Received 12 September 2022, accepted 25 September 2022, date of publication 28 September 2022,

date of current version 7 October 2022.


Digital Object Identifier 10.1109/ACCESS.2022.3210543

Development of an End-to-End Deep Learning


Framework for Sign Language Recognition,
Translation, and Video Generation
B. NATARAJAN1 , E. RAJALAKSHMI1 , R. ELAKKIYA 1 , KETAN KOTECHA 2 ,
AJITH ABRAHAM 3 , (Senior Member, IEEE), LUBNA ABDELKAREIM GABRALLA 4,

AND V. SUBRAMANIYASWAMY 1
1 Schoolof Computing, SASTRA Deemed University, Thanjavur 613401, India
2 Symbiosis Centre for Applied Artificial Intelligence, Symbiosis International (Deemed University), Pune 412115, India
3 Machine Intelligence Research Laboratories (MIR Labs), Auburn, WA 98071, USA
4 Department of Computer Science and Information Technology, College of Applied, Princess Nourah Bint Abdul Rahman University, Riyadh 11671, Saudi Arabia

Corresponding authors: R. Elakkiya ([email protected]) and V. Subramaniyaswamy ([email protected])


The financial support for the publication is done by the Princess Nourah bint Abdulrahman University Researchers Supporting Project
number (PNURSP2022R178), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia. and the research project was
sanctioned by the Science and Engineering Research Board (SERB), India under the Start-up Research Grant (SRG/2019/001338).

ABSTRACT The recent developments in deep learning techniques evolved to new heights in various
domains and applications. The recognition, translation, and video generation of Sign Language (SL) still
face huge challenges from the development perspective. Although numerous advancements have been made
in earlier approaches, the model performance still lacks recognition accuracy and visual quality. In this
paper, we introduce novel approaches for developing the complete framework for handling SL recognition,
translation, and production tasks in real-time cases. To achieve higher recognition accuracy, we use the
MediaPipe library and a hybrid Convolutional Neural Network + Bi-directional Long Short Term Memory
(CNN + Bi-LSTM) model for pose details extraction and text generation. On the other hand, the production
of sign gesture videos for given spoken sentences is implemented using a hybrid Neural Machine Translation
(NMT) + MediaPipe + Dynamic Generative Adversarial Network (GAN) model. The proposed model
addresses the various complexities present in the existing approaches and achieves above 95% classification
accuracy. In addition to that, the model performance is tested in various phases of development, and the
evaluation metrics show noticeable improvements in our model. The model has been experimented with using
different multilingual benchmark sign corpus and produces greater results in terms of recognition accuracy
and visual quality. The proposed model has secured a 38.06 average Bilingual Evaluation Understudy
(BLEU) score, remarkable human evaluation scores, 3.46 average Fréchet Inception Distance to videos
(FID2vid) score, 0.921 average Structural Similarity Index Measure (SSIM) values, 8.4 average Inception
Score, 29.73 average Peak Signal-to-Noise Ratio (PSNR) score, 14.06 average Fréchet Inception Distance
(FID) score, and an average 0.715 Temporal Consistency Metric (TCM) Score which is evidence of the
proposed work.

INDEX TERMS Deep learning, generative adversarial networks, sign language recognition, sign language
translation, video generation.

I. INTRODUCTION and unique style of communication in sign language across


Communication is essential for all human lives to explore different countries. The sign languages are obviously visual
their requirements and interactions with other people. Based cues and co-ordinate the human manual and non-manual
on recent studies, various researchers found an interesting components dramatically. It greatly supports the hard-of-
hearing and speech-impaired society in getting education,
The associate editor coordinating the review of this manuscript and jobs, and societal rights. The governments of various nations
approving it for publication was Agostino Forestiero . amended the multiple acts to standardize the sign language to

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
104358 VOLUME 10, 2022
B. Natarajan et al.: Development of an End-to-End Deep Learning Framework

benefit the hard-of-hearing and speech-impaired community.


Since, the sign language performs important role in hard-
of-hearing and speech-impaired communication, the under-
standing and responding by the normal people requires addi-
tional training and knowledge. This creates a communication
gap between ordinary people and the impaired community.
The recent advancements in deep learning techniques handle
such task efficiently by encompassing numerous mechanisms
and mathematical approaches. The development of such sys-
tems incurs huge complexities in various phases of devel-
opment, such as misclassification, self-occlusion, movement
epenthesis, ambiguity, noise, and blurred output. We investi-
gated all these challenges in a novel way to provide a better
solution and aimed to build a powerful architecture to provide
greater performance.
The emergence of deep learning techniques entered all the
fields to exhibit their strength towards robust model devel-
opment. The deep learning techniques produces impressive
results in areas such as agriculture [1], anomaly detection
[2], activity recognition [3], business analysis [4], [5], crop
selection [6], defect monitoring [7], DNA systems [8], earth
analysis [9], fraud detection [10], genomic prediction [11],
human activity recognition [12], image classification [13],
FIGURE 1. Overview of the proposed H-DNA framework (a) SL recognition
job matching [14], kinematic analysis [15], location predic- (b) SL gesture based video generation.
tion [16], medical systems [17], [18], [19], [20], network
traffic analysis [21], number plate recognition [22], object
detection [23], predictive maintenance [24], quality control the functions of the human brain to explore tremendous
[25], robotics [26], stock prediction [27], time series data performance over wider tasks and diverse domain applica-
analysis [28], and text generation [29], unmanned vehicle tions. The research studies on such implementations pro-
path findings [30], vehicle monitoring [31], weather fore- vide detailed information about the layer information, hyper
casting [32], x-ray imaging [33], YouTube video analysis parameters and advancements.
[34], zone segmentation [35]. These developments highly In this paper, we introduce a method to leverage the new
motivate us to pursue research in the deep learning area. advancements in deep neural networks to produce plausible
Deep learning models are highly powerful and have produced results in translation and video generation tasks. In fact,
intelligible achievements in a wider range of applications. the proposed ideology further extends to developing user
However, due to the complex structures and higher number of interface based applications for handling real-time cases
layers, the model training process and producing the greater and potentially addressing the various challenges of exist-
accuracy performances create additional challenges during ing approaches. Our contributions comprise the creation of
the model development. These reasons cause their applicabil- Indian Sign Language (ISL) related sentence level video
ity to produce powerful models for handling complex tasks. datasets using multiple signers without involving any spe-
We propose a Hybrid Deep Neural Architecture (H-DNA) cialized components like color gloves and sensors. We used a
which integrates the sign language recognition, translation, digital SLR camera and web camera devices for recording the
and video generation tasks into a single application as shown gesture videos. The proposed H-DNA systems are capable of
in Figure 1. We proposed a Hybrid- Deep Neural Architecture producing high quality videos given spoken sentence input,
(H-DNA) which is designed to learn the different modalities processing the sign gestures and translating them into spoken
of sign gestures in a signer-independent environment. This text. This two way mechanism is found to be superior to exist-
enhances the model to understand the underlying complex ing developments and comparably produces greater results
relationship between the input and output. The experimental in recognition and translation tasks. The experimental results
results explore the effectiveness of the proposed work in have been plotted to showcase the performance of the pro-
terms of recognition accuracy and visual quality. In order posed model for handling different sign corpus. We enlisted
to achieve greater flexibility and simultaneous processing of 50 student and staff volunteers to evaluate the model’s per-
gesture sequences, we use attention mechanisms and mathe- formance and tabulated the scores of their evaluation by con-
matical approaches to enhance the performance of the deep sidering different parameters. Overall, the proposed H-DNA
model. To justify these factors, we have shown the sample systems are designed and implemented to handle the various
output screens and outcomes of the proposed methods in nuances of traditional approaches and yield better results in
section 4. The main goal of deep neural networks is to mimic Sign Language Recognition and Translation (SLRT) tasks.

VOLUME 10, 2022 104359


B. Natarajan et al.: Development of an End-to-End Deep Learning Framework

Although the numerous developments have made for Chinese Language Recognition (SLR) and Sign Language Translation
sign Language (CSL), American Sign Language (ASL) and (SLT). Using Neural Machine Translation (NMT), MediaPipe
German Sign Language (GSL), still the performance of the library and Dynamic GAN the proposed H-DNA will be
model lacks in continuous cases and fails to handle the real developed for generating the high resolution videos. The
time inputs. proposed work simplifies the translation of spoken text to
The proposed H-DNA systems facilitate the real-time and subunit signs and then defines the mapping between glosses
accurate recognition of multimodal and multilingual sign and sign gesture images using the open pose library. Further
gestures. It allows an opportunity for developing the robust the SL videos are produced using and DynamicGAN model.
applications to handle various countries based sign languages On the other hand, using CNN, LSTM and MediaPipe library,
and provides solutions for communication gap exists between the proposed H-DNA recognizes the multilingual datasets
normal and impaired community. The proposed work has which comprises of isolated signs and continuous sign sen-
been developed as a User Interface (UI) application for han- tences by considering multimodal features. The H-DNA was
dling multilingual inputs, recognizing the multimodal sign developed and implemented on GPU-powered workstations.
gestures, generating the sign videos, and providing accurate The collection of benchmark datasets and the recording of
results over the translation and recognition tasks. To achieve own datasets are carried out as the first steps in implemen-
the expectations, the model development underwent various tation. To evaluate the performance of proposed H-DNA, the
stages of development to handle multimodal features and experimentation is performed to have three folds: The first
variations of multilingual sign corpuses. The proposed model fold deals with SL recognition, and the second fold focuses
has been trained using 40K videos for continuous recognition on SL video generation. The SL recognition model achieves
and 35K images for world level recognition tasks. The pro- an accuracy of not less than 98% and shows the improved
posed system explores solutions for the real time interactions performance of the proposed H-DNA. Criteria like robust-
of hard-of-hearing and speech-impaired people with normal ness, flexibility, and scalability are considered in the third
people. fold. We summarize the overall objectives of the proposed
The detailed investigation and various refinements of work as follows:
Convolutional Neural Network (CNN), Long Short Term • To create & integrate heterogeneous data sources and
Memory, Gated Recurrent Unit and Generative Adversarial to build a novel knowledge base consisting of multilingual
Networks (GAN) models yield better translation results and and multimodal sign sentences with minimal sign glosses and
generate high quality photorealistic videos. Sign Language skeletal level annotations by breaking down the signs into
plays vital role in the communication of hard-of-hearing and dedicated subunits.
speech-impaired community due to their inability towards • To augment and generate sign videos based on subunits
reading and writing the native language. Since the various from spoken language sentences to facilitate communication
studies dealing with SLRT research, the earlier developments between normal and impaired (hard-of-hearing and speech-
have their own limitations and are still unable to be used for impaired) communities.
continuous cases. Some of the research has been known to be • To track and recognize the signs consisting of isolated
successful for recognizing sign language, but it requires an words and continuous sign sentences including manual (one-
expensive set up and sensor devices to handle it. The tracking handed and two-handed signs) and non-manual gestures in
and recognition of specialized multimodal gesture signs is real-time scenarios.
very crucial, especially in recognizing signs of different lan- • To build a novel application with end-to-end video gen-
guages (multilingual). The research study on SL recognition eration and recognition capabilities by sharing the qualitative
focuses on the translation of sign gestures into English sen- and quantitative results of generated sign sequences without
tences and produces the text transcription for the sequence using animated avatars or sensors, and to ensure accuracy
of signs. This is due to the misconception that deaf people with minimal cost.
are comfortable with reading spoken language and therefore The further discussions about the proposed model are dis-
do not require translation into sign language. To facilitate cussed as follows. Section 2 investigates the earlier devel-
easy and clear communication between the hearing and the opments and provides the research gap in SLRT research
impaired community, it is vital to build robust systems that and seeks the advancements in various phases of develop-
can translate spoken languages into sign languages and vice ment. The proposed system details are wisely explained in
versa. This two way process can be facilitated using sign Section 3 and provide sufficient details about the model
language recognition, translation, and video generation. With development. The experimental outcome of the proposed
this motivation, the proposed approach intends to develop model is shown in section 4, and finally, the conclusion and
and build a novel H-DNA framework for SL recognition and future work part summarizes the entire information about the
translation systems as well as enhance the interactive com- proposed work.
munication between the normal and impaired community.
To the best of our knowledge, the proposed H-DNA II. RELATED WORK
is the first novel unified deep learning framework which Sign language communication explores the powerfulness of
addresses two different problem dimensions in SL: Sign human intelligence through hand actions and movements.

104360 VOLUME 10, 2022


B. Natarajan et al.: Development of an End-to-End Deep Learning Framework

TABLE 1. Comparison of existing SL recognition frameworks.

Despite relying on a single component (hand), it involves Xiao et al. [39] introduced continuous SL recognition using
numerous human upper body components such as head, NMT approaches. The author, Elakkiya et al. [40] proposed
mouth, and gaze movements to provide a real understanding an SL recognition framework using GAN+3D-CNN+LSTM
of gesture sequences in real time. Sign languages are made up Techniques. This approach utilizes the deep reinforcement
of visual actions and do not have a unique pattern to identify learning based evaluation strategy to produce highly accurate
their motion sequences. It greatly follows different styles results. The various details of the earlier literature are shown
based on its own country’s nature and culture. Understanding in Table 1 for exploring the new advancements with different
and processing such inputs is extremely difficult for tradi- sign languages such as American Sign Language (ASL),
tional machine learning approaches. It mainly supports the Chinese Sign Language (CSL) and German Sign Language
hard-of-hearing and speech-impaired society by getting those (GSL). The conventional sensor-based approaches demand
benefits such as education, employment, and engaging them extra equipment to be worn by the signer. The use of data
in societal activities. There have been numerous research gloves, color gloves, depth cameras, and leap motion con-
efforts made to produce better translation models. The real trollers creates additional overhead for the signer to commu-
time recognition and translation of sign languages requires nicate normally and poses huge limitations [41]. Although it
careful investigation of various features to produce plausible gives good prediction results, drastically loses the scope in
output without any misclassification and wrong sign output. real time applications. In addition to that, it creates discomfort
The progress Deep Learning approaches steps towards for the child and normal people during the conversation.
newer heights and produces fabulous results in computer The optimization of hyper parameter values and the impos-
vision and human action recognition applications. The ing of various constraints produces plausible outcomes and
introduction of hybrid models and ensembling techniques attracts the researchers. The primary version of the CNN
advances the capabilities of such models to handle tedious model is introduced by authors Chen and Koltun [43] pro-
tasks. The recent research works in CNN, LSTM, GRU duces images from semantic layouts. The model investi-
and GAN techniques has been investigated related to the gates the different loss functions and produces photographic
SL recognition, translation and video generation tasks and results. The model performance bottlenecks while handling
helps to introduce the novel contributions to build a powerful the large scale of images and adds the various intrinsic
framework. challenges. Similarly, the researchers in Oord et. al. [44]
The author, Barbhuiya et al. [36] proposed CNN+SVM discussed the development of gating mechanism based Pix-
based hand gesture recognition methods for static signs. elCNN models for image generation. The model has been
This approach mainly deals with alphabets and numerals. evaluated using the datasets CIFAR-10 and ImageNet. Since
The authors, Aly et al. [37] proposed a system for han- the model applies different conditions on embedding features
dling the words of Arabic SL using DeepLabv3+ gesture to produce quality image generation results, and extending
segmentation techniques and Bi-LSTM. The ASL recog- the performance for videos creates additional overheads.
nition system for 26 alphabet level sign gesture recogni- Although numerous advancements were made in the research
tion tasks is proposed by the author Lee et al. [38] uses work [45], the production of sign gesture videos is blurred and
LSTM with KNN techniques to provide higher recogni- spatial details are incoherent.
tion results. This work deals with world-level sign lan- The development of ambient models such as FUNIT [46],
guage communications. In addition to that, the researchers StarGAN [47], StarGAN v2 [48], MoCoGAN [49], LPGAN

VOLUME 10, 2022 104361


B. Natarajan et al.: Development of an End-to-End Deep Learning Framework

[50], InfoGAN [51], pix2pix [52], and CycleGAN [53] deals are the major drawbacks in the existing systems. In case
with the image generation and video production tasks effi- of Sign Generation, we have considered the limitation to
ciently. Since SL communication involves the various man- small size vocabulary, model performance improvement, low
ual and non-manual cues of humans and their facial, eye, model complexity, proper alignment of the key points, signs
gaze, and mouth expressions, it demands some advancement in spatial domain etc.
in the earlier approaches. In addition to that, the ordering
of gesture sequences greatly varies from the English sen- III. THE PROPOSED H-DNA SYSTEM
tence order. In order to address these aforementioned chal- The proposed hybrid H-DNA framework model comprises
lenges, we introduce a novel approach for aligning the frame various phases of development, such as SL recognition, mul-
sequences and generating the intermediary frames between tilingual sentences into sign word conversion, pose estimation
the sign gesture images. The proposed model deals with the using MediaPipe, and SL video generation. The proposed
various nuances of SL gestures and its components and pro- H-DNA framework aims to integrate all these modules and
duces plausible outcomes. The GAN networks are found to be provide a real time solution to SLRT research challenges.
highly capable of producing plausible results across a wider Neural Machine Translation (NMT) is the process of translat-
range of diverse domain applications. The applications such ing sentences from one language into another. It uses artificial
as security [54], [55], baggage inspection [56], infected leaf neural networks to yield highly translatable results. The iden-
identification [57], covid-19 prediction [58], agriculture [59], tification of human poses in images or videos is performed
business process monitoring [60], Brain MRI synthesis [61], using the mediapipe library. It helps to predict the various
flood visualization [62], estimating the standards of gold [63], poses of humans in various environments. Pose estimation
ECG wave synthesis [64], Internet of Things (IoT) [65] and is based on a number of key points on the human body.It
Dengue Fever sampling [66]. uses the Parity Affinity Fields approach to implement it. The
The two major components of GAN networks are generator VGG-19 model is used for classifying the different gesture
and discriminator, and they play a vital role in image or styles. It uses different 3 × 3 filters in the convolution layers.
video generation. The discriminating capability of the dis- The convolution layers provide a feature map by scanning the
criminator helps to produce high quality videos in diverse image features. The role of pooling layers is to reduce the
domains and is further investigated in the proposed work information generated by the convolution layers. To vectorize
for qualitative production of sign gesture videos. The incor- the output as a single array, the fully connected layer is
poration of CNN models with conditional GAN networks used. The incorporation of dynamic GAN [86] provides high
produces drastic improvements in video generation quality quality video generation results by encompassing the various
and efficiently handles the various traits of details present in approaches such as frame generation and video completion
an image or video. Based on the discriminator classification, techniques. The LSTM network is used for predicting the text
the generator networks underwent the fine-tuned training equivalents of the sign gestures and further helps to produce
process to produce photorealistic results. The authors, Mirza the language sentences. The following subsections explore
and Osindero [67], introduced the conditional-based GAN the various technical details and summarize the powerfulness
network model by applying constraints on label information. of each technique.
We use this approach in our work to produce videos based This section explains the implementation details of the
on the conditioned labeling approach. The advent of Dynam- proposed H- DNA framework. In the first fold, we developed
icGAN models addresses the existing challenges by using the SL recognition model using the MediaPipe library and the
strided mechanisms in convolution operations to produce VGG-19 model. Furthermore, we incorporate the Bi-LSTM
improved results. The video generation process using GAN network for text generation. In SLR, the input of continuous
networks encompasses the additional approaches to produce gesture sequences is processed by the MediaPipe library to
photo-realistic videos and keeps the coherent spatial details capture the pose sequences, angle between fingers, hand
clear. movements and locations, orientations, mouth expressions,
Although there are enormous research going on in the field and facial actions. Based on these key points, the VGG-19
of Sign Language Recognition, translation and generation model estimates the class of gestures. The incorporation of
systems the existing systems still face a lots of challenges. CNN and LSTM networks in such a hybrid way produces
The primary challenge with the Sign Language recognition higher recognition accuracy and noticeable performance. The
and generation system is the lack of availability of large-scale temporal details are analyzed sequentially to predict the trans-
open-source Indian Sign Language Dataset with natural con- lation text without any misclassification.
ditions. To overcome this issue we have developed a multi- We trained our model using 40,000 videos for 320 classes
signer, multi-model Sign Language dataset and have provided to provide wider support over multilingual sign corpus com-
it as open-source resource for further research purposes. For prises of multimodal features. The sample gesture images
Sign Language recognition systems, we have built the recog- of our own created ISL-CSLTR dataset [42] are shown in
nition model in such a way that it detects the signs irre- Figure 2 and greatly support the ISL-related SLRT research.
spective of the complex backgrounds, multi-modality, signer In general, the SL video generation process is treated as a
skin tone, signer clothing constraints, sign speed etc., which highly intensive task due to the production of sign gesture

104362 VOLUME 10, 2022


B. Natarajan et al.: Development of an End-to-End Deep Learning Framework

aids the better assistance over classification and prediction


tasks. The incorporation of the CNN based pre-trained model
VGG-19 helps to automate the SL recognition tasks in a
better way. The various basic operations, such as convolution
operations and max pooling, are applied repeatedly to learn
the finer details of the images. The VGG-19 model pro-
vides better classification results than the multilingual sign
language datasets. The results are passed to the Bi-LSTM
networks to predict the target sentences matches with the
video sequences. The intermediary feature map results of
the proposed hybrid CNN+Bi-LSTM model are shown in
Figure 3.
FIGURE 2. Sample word level sign gesture images of ISL-CSLTR dataset.

videos from English sentences. The qualitative production of


SL videos for the new input sentences poses various levels
of difficulties by considering the manual and non-manual
cues of the signers. Such a translation process demands more
attention at each step to produce high quality results. The
emergence of various deep generative models has advanced
and secured new milestones in photorealistic image genera- FIGURE 3. Visualization of the intermediate feature map results of
VGG-19 network.
tion and video production.
The VGG-19 model produces the vector representation of
A. SL RECOGNITION images and classification results. Based on such input, the
In the first phase, the development of SL recognition using LSTM layers process the information and generate the tex-
hybrid CNN+Bi-LSTM techniques is carried out. The main tual descriptions. In this context, the textual descriptions are
objective of this hybrid approach is to sequence predict language sentences that match with gesture sequences. The
in SL videos. The CNN layers are used for gesture class entire CNN model is handled by the time distribution layers
identification and LSTM networks for predicting the class to handle multiple inputs for different time steps. The LSTM
sequence. The combination of these two networks processes units apply back propagation to tune the hyper parameters
the spatio-temporal details of SL input videos and produces such as learning rate, batch sizes. The weight and bias values
the text output. The first segment uses CNN layers and is are also updated to build a powerful framework. We set the
further utilized by the Bi-LSTM networks with dense layers learning rate value as 0.01 and the batch size as 64. The
to yield plausible results. We used the VGG-19 model [68], LSTM networks [69] are found to be powerful components
which consists of 16 convolution layers and 3 fully connected in text generation, image captioning, and machine translation
layers. The CNN network processes images of size 254 × tasks. The LSTM network has three gates: (i) input gate,
254 and the first and second layers are convolutional layers. (ii) output gate, and (iii) forget gate. The separate memory
It uses 3 × 3 filters with stride level 1. The max pooling cell is added and handles a higher number of layers than
operation is performed using stride level 2 and a window size GRU. The forget gate decides the kind of information to be
of 2 × 2. After this process, the dimensions of pixels are discarded from memory and uses sigmoid activation func-
reduced to 112 × 112 × 64. Further, the convolution layer tions to squish the values between zero and one. Due to
of varying filter size 128, 56, 28 is applied and reduces the this functionality, the values multiplied by 0 become zero
size of the image as well as focuses the important features. and can be easily removed. The input gate updates the cell
The fully connected layer summarizes all classes of inputs state for processing the new inputs. The memory cells remain
and produces the probability of prediction values using the the amount of information for time stamp t. The output gate
softmax layer. The network is trained to handle 35K images of finalizes the information to be output from the model. The
192 classes representing different gesture poses based images forget gate functions are represented using the following
for different words. After completion of preprocessing steps, Equation (1) The cell state is a key for the LSTM network,
the videos of high resolution to be 1920 × 1080 and con- passing through the entire chain link of LSTM modules and
verted into numpy arrays for easier processing using skvideo governed by the aforementioned three layers. The forget gate
packages. Each class of sign gestures is recorded with 50 decides the information to be thrown away from the memory.
repetitions to provide better learning and prediction perfor- The role of input layers is to provide the desired inputs and
mance of the model. The key points based pose information update the cell state values. The output layers produce the
is captured parallel to maintain the gesticulation details and text results. We use sigmoid and tanh activation functions to

VOLUME 10, 2022 104363


B. Natarajan et al.: Development of an End-to-End Deep Learning Framework

produce plausible outcomes. We use a bi-directional LSTM


approach to focus on the text generation tasks efficiently.
The proposed hybrid CNN + Bi-LSTM techniques based SL
recognition system architecture is shown in Figure 4.

FIGURE 4. SL recognition system.


FIGURE 5. Flow chart for hybrid CNN+Bi-LSTM technique.
The LSTM network is envisioned as a strong method to
handle sequential tasks. It provides a solution to the vanish-
ing gradient problem. Since it handles the longer sequen-
tial inputs, which are applied in domains such as image The LSTM network utilizes the memory cell component
captioning, text generation and time series based applica- explicitly, and the cell states regulate the kind of information
tions. The LSTM network was introduced by the authors, to be kept or discarded from memory. During each iteration
Cho et al. [70]. The forget gate (frt ) operations are repre- cycle process, the LSTM network processes the previous
sented using Equation 1. It decides the information to be hidden state values (ht-1), current input values (Int) and the
discarded from cell states by applying the sigmoid activation previous cell state values (ht). The parameters weight and bias
function. The value 1 represents keeping the information vectors are updated regularly during the back propagation
and 0 denotes its removal. The general equations describing process to produce accurate translation results. We use Adam
the various operations of the LSTM Network are stated in optimizer and drop out regularization techniques to obtain
Equation 1. greater results over the benchmark datasets.

frt = σ Wfr · [ht−1 , Int ]+bfr



(1) B. MULTILINGUAL SENTENCES INTO SIGN WORD
CONVERSION
The next step processes the sequence of inputs and decides
the next information to be fed into the cell state. The input This section explains the translation process of language
layer represented using Equation 2 denotes the next value to sentences into sign words using the NMT and attention mech-
be updated in the cell state. Next, Equation 3 represents the anism. The conventional NMT techniques have proven to
vector values of candidate results. have appreciable performance in language translation tasks.
We use a hybrid NMT + Attention mechanism for translat-
it = σ Wi · [ht−1 , Int ]+bi

(2) ing the multilingual sentences into sign words. The NMT
Cat = tanh WCa · [ht−1 , Int ]+bCa
0

(3) technique uses RNN and its variants to process the longer
sequences and produces better results in different domain
The update of new values (Cat ) by using multiplication oper- applications. We introduce the novel deep-stacked GRU
ations and the refinement of old cell state values takes place technique in machine translation tasks to achieve greater
using Equation 4. translation results over multilingual input sentences. The
translation process is carried out using the following steps:
Cat = frt ∗Cat−1 + it ∗Cat 0 (4) The first step deals with the text preprocessing of the spoken
The output gate operations are denoted using Equation 5 and sentences. The spoken sentences are cleaned by removing the
Equation 6. It decides the information to be passed as output. special characters, punctuation marks, and symbols. We add
the <START> token at the beginning of the sentences and
outt = σ Wout · [ht−1 , Int ]+bout <END> tokens at the end of the sentences. This approach

(5)
ht = outt ∗tanh(Cat ) (6) benefits the model learning process of where to start and
stop. The word embedding techniques are used to convert the
The proposed hybrid CNN + Bi-LSTM Technique is shown tokens into dense vectors and pass them to the next level. The
in Figure 5. The detailed steps of our implementation and pro- proposed deep-stacked GRU technique efficiently handles
vides the step-by-step procedures. It gives a detailed overview the translation tasks and produces accurate results. The GRU
of the execution of VGG-19 model training. The LSTM networks use two gates: (i) the update gate and (ii) the forget
network operations to produce the language sentence output gate. The update gate governs the information to be newly
are clearly elaborated in the rest of the sections. added and the forget gate regulates the information to be kept

104364 VOLUME 10, 2022


B. Natarajan et al.: Development of an End-to-End Deep Learning Framework

or thrown away. The following equations clearly explore the


various operations of GRU units.
The deep-stacked GRU units are chain-link based on differ-
ent modules which are executed iteratively in order to produce
the sequential outputs. The input value from the current step
is denoted as xt and the input of previous hidden layers is
represented as ht−1 . The operations of the update gate (Zt )
are represented using Equation 7.
 
Zt = σ W(z) xt + Ur ht−1 (7) FIGURE 6. Proposed deep stacked GRU system.

The current input value (xt ) and the weight (W) values are
multiplied in the first part, and the second part multiplies the
previous hidden state values (ht−1 ) and its weights (U) and
finally the values are summed up to provide the new values
to the update gate. The sigmoid (σ ) activation function is
applied over the resultant values to round up the prediction
results in the range of zero to one. The update gate concludes
the volume of information to be passed to the next state. The
reset gate decides the removal of information based on the
importance of particular vector towards the prediction of next
sequences. The executions of reset gate are demonstrated
using the Equation 8 as follows. FIGURE 7. Execution flow of GRU based encoder decoder system.
 
rt = σ W(r) xt + Ur ht−1 (8)
The dense vector values of each word are passed to a feed
The reset gate (rt ) combines the results of the multiplication forward neural network to learn the source representation.
operation performed on the input (xt ) and weight (W) values The proposed hybrid NMT model handles varying lengths of
as well as the previous hidden node values (ht−1 ) and its sentences and changes the translation results accordingly. The
weight values (U). The sigmoid activation is applied to the score values produced by the networks are further processed
results. The current values (hcur ) to be present in the memory by the softmax function and yield attention weights. The
unit are computed using Equation 9. context vectors are calculated by multiplying the attention
hcur = tanh (Wxt + rt Uht−1 ) (9) weight values and hidden state values.
We incorporated the attention mechanism proposed by the
The current and previous node values are multiplied with researcher Bahdanau et al. [71] to yield the accurate trans-
weight values. The Hadamard product, known as element- lation results. The attention vector is estimated by concate-
wise multiplication, is performed over the reset gate and pre- nating the context vectors and previous output. Finally, the
vious hidden states values. Finally, the non-linear activation decoder network produces the target sign gloss output. The
function tanh is computed on the final outcome. The last step proposed hybrid NMT + Attention model is evaluated using
results in being recorded in memory units (hf ) at time step t the three benchmark sign corpus datasets such as RWTH
is computed using Equation 10. PHOENIX Weather 2014T dataset [72], How2Sign Dataset
0 [73], and ISL-CSLTR Dataset [41] and the results are shown
hf = Zt ht−1 + (1 − Zt ) ht (10)
in section 4. The computation of attention weights is done
The deep stacked approach provides better results over a using Equation 11.
wider range of applications and reduces the computational
exp(score(ht , h̄s ))
complexity of the model drastically. The deep stacked GRU αts = PS (11)
has several units of GRU blocks and performs the model s0 =1 exp(score(ht , h̄s0 ))
training in parallel. The detailed structure of deep stacked The context vector is calculated by using Equation 12.
GRU units is depicted in Figure 6. X
Further, we incorporate the attention mechanism proposed ct = αts h̄s (12)
by Bahdanau et al. [71]. The attention mechanism focuses s

on the particular context in encoder unit matching with target The Bahdanau’s attention vector is calculated by using
translation to yield high quality results. The cyclic execution Equation 13.
of the deep stacked GRU units is shown clearly in Figure 7.
at = f (ct , ht ) = tanh (Wc [ct ;ht ]) (13)
The GRU units process the spoken sentences input using
encoder and decoder based approach. The encoder network The proposed Deep stacked GRU algorithm uses stacked
of GRU processes the source format of input sentences. layers of GRU to effectively process the sequential inputs and

VOLUME 10, 2022 104365


B. Natarajan et al.: Development of an End-to-End Deep Learning Framework

translate them into target form. We apply the Bahdanau et al. generator network to produce the photo-realistic high quality
[71] attention mechanism to compute distinct context vector sign gesture videos. The integrated architecture for translat-
values and get good results. The recursive nature of GRU ing the multilingual sentences to sign video generation is
processes the entire source sentences and translates them into shown in Figure 9.
target sentences. We use beam size 10 and tanh and sigmoid
activation functions. The proposed model totally processes
40k sentences by combining multilingual sign corpus col-
lected from different sources.

C. POSE ESTIMATION USING MEDIAPIPE


The MediaPipe library was developed to provide human pose
estimation results over image and video files. This framework
is stated as an impressive one to track the details of human
activity in public environments, sign gesture pose recogni-
tion, fraud monitoring, and yoga pose analysis. We use the
MediaPipe library to estimate the poses of different signers
and key points, which are used for generating the new poses FIGURE 9. SL gesture video generation using NMT+openpose+GAN
techniques.
using the deep generative networks. The sample results of the
MediaPipe library are shown in Figure 8.
The GAN network consists of two units known as the
generator unit and the discriminator unit. The generator unit
produces the new images or videos from the noise distribution
of real data. The latent space provides various details of real
data based on which, it produces the new images or videos.
The conditional GAN model uses conditioned labels to pro-
duce the sharp images. The generated results are verified by
the other unit known as the discriminator. The discriminator
unit classifies the real and fake samples as shown in Figure 10.

FIGURE 8. Sample human pose estimation results and 3D plots using


MediaPipe library.

FIGURE 10. Discriminator network classification results.


D. SL VIDEO GENERATION
The sign gesture video generation tasks are performed using Depending on the classification results, the generator
deep generative models. We introduce the novel Dynamic- networks fine tune their performance to produce plausible
GAN network for producing plausible, photo-realistic high images and videos. We use a U-Net-like framework [74]
quality videos. The video generation involves a series of step- for learning the structure of real data distribution. From
by-step approaches to produce high-quality results. We care- which, the target pose images are generated quantitatively.
fully investigated the various mathematical models and deep The encoder network performs the convolution function,
generative frameworks to develop the novel framework. The batch normalization and activation function for Leaky ReLU.
advancements of GAN networks have found them proficient The decoder network utilizes the transposed convolution
in generating high quality images and videos. The GAN function, batch normalization techniques, dropout regular-
networks synthesis the medical images efficiently as well. izer, and finally, ReLU activation functions. The loss value
We incorporate the conditional GAN model [67] as the basic for the generator network is computed using the sigmoid
framework for our proposed DynamicGAN model. Further- cross-entropy loss. Further, the L1 loss calculates the mean
more, we use the VGG-19 pre-trained CNN network for sign absolute error between the real and generated results and
gesture classification. The techniques such as intermediary aids in producing high quality results. The Discriminator unit
frame generation, deblurring and image alignment, pixel nor- incorporates the PatchGAN [52] classification techniques to
malization, video completion are added additionally with the discriminate the real and fake samples. The Convolution

104366 VOLUME 10, 2022


B. Natarajan et al.: Development of an End-to-End Deep Learning Framework

function, Batch normalization, and the Leaky ReLU activa- and validated to produce better results. We inputted 25k
tion function are applied sequentially to produce plausible images for training and 5k images for validation purposes.
outcomes. The discriminator network estimates the realness The model performance is shown in Figure 11. The proposed
of the generated results. It uses sigmoid cross-entropy loss model achieves significant improvements in classification
function to measure the quality of generated results com- accuracy and recognition performance. Furthermore, we plot
pared with real ones. The proposed Dynamic GAN model the confusion matrix for obtaining the classification perfor-
is implemented in high end GPU based environment. The mance. The confusion matrix results are shown in Figure 12.
Dell Precision 7820 Tower workstation is used to accomplish This demonstrates the improved performance of the proposed
the entire development process. It comprises pairs of Intel hybrid CNN-LSTM model.
Xeon Silver 4210 2.2. GHz processors and 10 cores. The
Nvidia Quadro RTX40000 provides GPU support for model
training. We use batch normalization and Adam optimization
techniques with the values α = 2e-4, β = 0.5 and β = 0.999.
We set the batch size value as 128, dropout is 0.01 and initial
learning rate as 0.01. The Leaky ReLu value is set as 0.1
and ReLu activation functions are further applied. The mini
batch size is set as 100 and the momentum is 0.05. The pro-
posed DynamicGAN framework is experimented using the
multilingual sign corpus such as RWTH-PHOENIX-Weather
2014T dataset, ISL-CSLTR dataset, and How2Sign dataset.
The results are shown in section 4. FIGURE 11. Accuracy and loss evaluation of hybrid CNN-LSTM model.
We use the Mean Squared Error (MSE) Metric to evaluate
the loss values in the generator network outcomes stated in
Equation 14.
LMSE (xt, gn) = `MSE (G (Xxt ) , gn) = kG (xt) −gnk2
(14)
The Sigmoid Cross-Entropy loss combines the activation
function sigmoid as well as the Cross-Entropy loss function.
Due to the independent execution of these loss functions,
it does not affect the results of one on another.

E. DATASET
RWTH-PHOENIX-Weather 2014T dataset: This dataset
deals with the SLRT research for German sign language [72].
It consists of 40k videos for sentence level. The videos are
recorded using 9 native signers. FIGURE 12. Confusion matrix results of the hybrid CNN-LSTM Model.

ISL-CSLTR dataset: The ISL-CSLTR dataset was pub-


lished by the researchers [41] to conduct the SLRT research The classification performance of the proposed hybrid
in Indian sign language. It consists of 700 videos for 100 sen- CNN-LSTM model is evaluated using the following metrics.
tences each. The videos are recorded using seven differ- The accuracy of the proposed model is compared with the
ent signers. How2Sign dataset: The How2Sign dataset [73] existing work and the comparison results are tabulated in
contains the SLRT research for American Sign Language. Table 2.
It consists of 2,456 videos for sentences. The videos were We further investigated the proposed hybrid CNN+Bi-
recorded using 11 different signers. LSTM model performance using the following equations.
The precision is computed using the Equation 15, the Recall
IV. EXPERIMENTAL RESULTS is calculated using Equation 16, F1 Score is calculated using
This section provides the experimental results of various the Equation 17 and the accuracy is computed using the
phases of development, which are performed and investigated Equation 18 where TP, TN, FP, FN denotes true positive, true
to build a complete framework for SLRT research challenges. negative, false positive and false negative values. The various
The proposed H-DNA framework functionalities are tested quality metrics are computed using the following Equations
in different stages of the development cycle. In addition to and the results are tabulated in Table 3.
that, we have shown the user interface screens of the final TP
application. During the first phase of development, the SL Precision = (15)
TP + FP
recognition model is implemented using hybrid CNN+Bi- TP
LSTM techniques. The proposed model has been trained Recall = (16)
TP + FN
VOLUME 10, 2022 104367
B. Natarajan et al.: Development of an End-to-End Deep Learning Framework

TABLE 2. Comparison of existing SL recognition models with hybrid TABLE 3. Hybrid CNN-LSTM model performance comparison using
CNN-LSTM model. existing frameworks.

Recall ∗ Precision
F1 Score = 2 ∗ (17)
Recall + Precision
TP + TN
Accuracy = (18)
TP + FP + TN + FN
The performance of the hybrid NMT + Attention model is
evaluated using the BLEU metrics depicted in Figure 13.
It shows the performance of the proposed hybrid NMT +
Attention model compared with existing work. Further, the
performance of the hybrid NMT + Attention model is
analyzed using the attention plots depicted in Figure 14.
The attention plot shows the real translation performance
of the model by comparing the source and target sentences.
The blocks are highlighted in white color representing the
role of attention mechanism in the context of particular word
translation.

FIGURE 14. (a), (b) Attention plot results for ISL-CSLTR dataset,
(c) How2Sign dataset, (d) RWTH-PHOENIX-Weather 2014T dataset.

We compared the proposed Dynamic GAN model perfor-


mance in terms of quality and quantity by experimenting with
multilingual sign language datasets and the results are shown
below. Figure 15 depicts the video generation results of
RWTH-PHOENIX-Weather 2014T dataset, Figure 16 shows
the video generation results of the ISL-CSLTR dataset and
Figure 17 depicts the video generation results of RWTH-
FIGURE 13. Hybrid NMT+Attention model evaluation results using BLEU PHOENIX-Weather 2014T dataset. We further compared the
score metric by considering the various word length. proposed Dynamic GAN model with existing deep generative

104368 VOLUME 10, 2022


B. Natarajan et al.: Development of an End-to-End Deep Learning Framework

TABLE 4. Comparison of DynamicGAN model performance with existing


works by human evaluation metrics.

FIGURE 15. Video generation results of DynamicGAN model for


RWTH-PHOENIX-Weather 2014T dataset.

TABLE 5. Performance Comparison of DynamicGAN model for FID2vid


metrics.

FIGURE 16. Video generation results of DynamicGAN model for ISL-CSLTR


dataset.

TABLE 6. Cmparison of Structural Similarity Index Measure (SSIM) metric.

FIGURE 17. Video generation results of DynamicGAN model for


How2Sign dataset.

models. The quantitative evaluation is carried out using the


benchmark sign corpus. The results show the improved per-
formance of our approach compared with existing models.
Table 4 depicts the performance of the proposed model com-
pared with existing models in terms of realism, relevance, and
coherence using human evaluators. We validated the gener-
ated frame quality and temporal coherence using FID2Vid
scores shown in Table 5.
The Structural Similarity Index Measure (SSIM) metric
represented using Equation 19 is used for assessing the image
quality. We use the SSIM metric for comparing the model’s
performance with existing approaches. This metric assesses
the model’s performance over multiple domains and the gen-
the structural information degradation of generated video
eration capability of the generator. The computation of IS is
frames and the results are shown in Table 6.
performed using the following Equation 20.
2µx µy + C1
l (x, y) = 2 (19)
IS(G) = exp Ex∼pg DKL (p (y | x) kp(y) )

µx + µ2y + C1 (20)
The proposed DynamicGAN model performance has experi- Let x denotes the generated images of the generator network
mented with inception score metrics. The high score denotes G, let p (y | x) denotes the class distribution of generated

VOLUME 10, 2022 104369


B. Natarajan et al.: Development of an End-to-End Deep Learning Framework

TABLE 7. Comparison of inception score of dynamic GAN model. TABLE 9. Fréchet inception distance (FID) metric evaluation for
multilingual sign corpus.

TABLE 8. PSNR Score evaluation for multilingual sign corpus.

TABLE 10. Temporal consistency metric (TCM) metric evaluation for


multilingual sign corpus.

samples and the marginal probability function, denoted as


p (y) . The Inception score results are depicted in Table 7.
The PSNR metric provides a comparison result between the generated results with real data distribution. The results
real and generated results. The high PSNR indicates the of the FID metric are shown in Table 9.
improved quality of the generated results. The PSNR metric
is compared between different sign corpus for analyzing the d2 ((mr , 6r ) , (mf , 6f )) = kmr − mfk22
  
proposed DynamicGAN Model performance. The results of +Tr 6r 6f − 2 (6r 6f )1/2
the PSNR metric are shown in Table 8 and calculated using
Equation 21 and Equation 22. (23)
!
2552 We further evaluated our model performance using Tem-
PSNR (gt, ge) = 10log10 (21) poral Consistency Metric (TCM) metric to provide real
MSE (gt, ge)
score for videos related to consistency in the temporal
M N
1 XX 2 sequences to produce high quality videos rather than com-
MSE (gt, ge) = gtij − geij (22) paring with single frame level. The table 10 list the evaluation
MN
i=1 j=1
scores for TCM Metric and compares with other benchmark
The Frechet Inception Distance (FID) metric is used to assess datasets.
the quality of generated video frames and is computed using The user interface based H-DNA implementations are
Equation 23. The quality of pixels and temporal consistency shown in Figures 18 and 19. It shows the sample SL recog-
are measured. The lowest FID scores indicate better results. nition and SL video generation results.
The mean and covariance values are computed to compare Figure 18 and Figure 19.

104370 VOLUME 10, 2022


B. Natarajan et al.: Development of an End-to-End Deep Learning Framework

Child Care and Learning Centre, Ayyalurimetta, Nandyal,


Andhra Pradesh, India, for their support and contribution.

REFERENCES
[1] G. Delnevo, R. Girau, C. Ceccarini, and C. Prandi, ‘‘A deep learning and
social IoT approach for plants disease prediction toward a sustainable agri-
FIGURE 18. Sample UI based sign gesture recognition results using the culture,’’ IEEE Internet Things J., vol. 9, no. 10, pp. 7243–7250, May 2021.
H-DNA framework. [2] I. Siniosoglou, P. Radoglou-Grammatikis, G. Efstathopoulos, P. Fouliras,
and P. Sarigiannidis, ‘‘A unified deep learning anomaly detection and
classification approach for smart grid environments,’’ IEEE Trans. Netw.
Service Manage., vol. 18, no. 2, pp. 1137–1151, Jun. 2021.
[3] G. Vallathan, A. John, C. Thirumalai, S. Mohan, G. Srivastava, and
J. C.-W. Lin, ‘‘Suspicious activity detection using deep learning in
secure assisted living IoT environments,’’ J. Supercomput., vol. 77, no. 4,
pp. 3242–3260, Apr. 2021.
[4] S. Ramaswamy and N. DeClerck, ‘‘Customer perception analysis using
FIGURE 19. Sample UI application results for video generation using the deep learning and NLP,’’ Proc. Comput. Sci., vol. 140, pp. 170–178,
H-DNA framework. Jan. 2018.
[5] V. Pasquadibisceglie, A. Appice, G. Castellano, and D. Malerba, ‘‘A multi-
view deep learning approach for predictive business process monitoring,’’
V. CONCLUSION IEEE Trans. Services Comput., vol. 15, no. 4, pp. 2382–2395, Jul. 2021.
This paper contributes to the development of a deep learn- [6] T. Islam, T. A. Chisty, and A. Chakrabarty, ‘‘A deep neural network
ing framework for end-to-end sign language recognition, approach for crop selection and yield prediction in Bangladesh,’’ in Proc.
IEEE Region 10th Hum. Technol. Conf. (R-HTC), Dec. 2018, pp. 1–6.
translation, and generation. We addressed the challenges [7] J. Williams, P. Dryburgh, A. Clare, P. Rao, and A. Samal, ‘‘Defect detection
that persist with earlier SL recognition and video generation and monitoring in metal additive manufactured parts through deep learning
approaches using the proposed H-DNA framework. We eval- of spatially resolved acoustic spectroscopy signals,’’ Smart Sustain. Manuf.
Syst., vol. 2, no. 1, Nov. 2018, Art. no. 20180035.
uated the model performance using the RWTH-PHOENIX- [8] B. Alipanahi, A. Delong, M. T. Weirauch, and B. J. Frey, ‘‘Predicting
Weather 2014T dataset, the How2Sign dataset, and the the sequence specificities of DNA- and RNA-binding proteins by deep
ISL-CSLTR datasets quantitatively and qualitatively. The learning,’’ Nature Biotechnol., vol. 338, pp. 831–838, Aug. 2015.
[9] M. Reichstein, G. Camps-Valls, B. Stevens, M. Jung, J. Denzler,
proposed H-DNA framework is also evaluated qualitatively N. Carvalhais, and Prabhat, ‘‘Deep learning and process understand-
using various quality metrics. The generated video frames ing for data-driven earth system science,’’ Nature, vol. 566, no. 7743,
show the quality of the outcome of our work. We achieved pp. 195–204, Feb. 2019.
[10] A. Roy, J. Sun, R. Mahoney, L. Alonzi, S. Adams, and P. Beling, ‘‘Deep
a comparatively greater recognition rate and generating per- learning detecting fraud in credit card transactions,’’ in Proc. Syst. Inf. Eng.
formance than earlier approaches. The proposed model has Design Symp. (SIEDS), Apr. 2018, pp. 129–134.
achieved the above 95% classification accuracy towards SL [11] P. Bellot, G. de los Campos, and M. Pérez-Enciso, ‘‘Can deep learning
improve genomic prediction of complex human traits?’’ Genetics, vol. 210,
recognition, 38.56 average BLEU score, remarkable human no. 3, pp. 809–819, Nov. 2018.
evaluation scores, 3.46 average FID2vid score, 0.921 aver- [12] D. Liciotti, M. Bernardini, L. Romeo, and E. Frontoni, ‘‘A sequential deep
age SSIM values, 8.4 average Inception Score, 29.73 aver- learning application for recognising human activities in smart Homes,’’
age PSNR score, 14.06 average FID score, and an average Neurocomputing, vol. 396, pp. 501–513, Jul. 2020.
[13] T.-H. Chan, K. Jia, S. Gao, J. Lu, Z. Zeng, and Y. Ma, ‘‘PCANet: A
0.715 TCM Score. These scores are notably higher than ear- simple deep learning baseline for image classification?’’ IEEE Trans.
lier models. The evaluation of realism, relevance, and coher- Image Process., vol. 24, no. 12, pp. 5017–5032, Dec. 2015.
ence factors is carried out by employing human evaluators [14] Y. Lin, H. Lei, P. Clement Addo, and X. Li, ‘‘Machine learned resume-job
matching solution,’’ 2016, arXiv:1607.07657.
and produces good results in real time scenarios. [15] N. J. Cronin, T. Rantalainen, J. P. Ahtiainen, E. Hynynen, and B. Waller,
‘‘Markerless 2D kinematic analysis of underwater running: A deep learn-
FUNDING SUPPORT ing approach,’’ J. Biomech., vol. 87, pp. 75–82, Apr. 2019.
[16] W. Zhang, L. Sun, X. Wang, Z. Huang, and B. Li, ‘‘SEABIG: A deep
This work was financially supported by the Princess learning-based method for location prediction in pedestrian semantic tra-
Nourah bint Abdulrahman University Researchers Support- jectories,’’ IEEE Access, vol. 7, pp. 109054–109062, 2019.
ing Project number (PNURSP2022R178), Princess Nourah [17] R. Elakkiya, ‘‘Machine learning based intelligent automated neonatal
epileptic seizure detection,’’ J. Intell. Fuzzy Syst., vol. 40, no. 5, pp. 1–9,
bint Abdulrahman University, Riyadh, Saudi Arabia. 2021.
[18] R. Elakkiya, K. S. S. Teja, L. J. Deborah, C. Bisogni, and C. Medaglia,
ACKNOWLEDGMENT ‘‘Imaging based cervical cancer diagnostics using small object detection-
generative adversarial networks,’’ Multimedia Tools Appl., vol. 81,
The research project was sanctioned by the Science and pp. 1–17, Jan. 2021.
Engineering Research Board (SERB), India under the Start- [19] R. Elakkiya, P. Vijayakumar, and M. Karuppiah, ‘‘COVID_SCREENET:
up Research Grant (SRG/2019/001338). The authors thank COVID-19 screening in chest radiography images using deep transfer
SASTRA Deemed University for providing infrastructural stacking,’’ Inf. Syst. Frontiers, vol. 23, no. 6, pp. 1–15, 2021.
[20] G. Padmapriya, R. Elakkiya, and M. Prakash, ‘‘Deep learning based
support to conduct the research. They thank all the students Parkinson’s disease prediction system,’’ in Machine Learning and IoT
for their contribution in collecting the sign videos and the suc- for Intelligent Systems and Smart Applications. Boca Raton, FL, USA:
cessful completion of the ISL-CSLTR Corpus. And also, they CRC Press, 2021, pp. 97–111.
[21] R. Vinayakumar, K. P. Soman, and P. Poornachandran, ‘‘Applying deep
would like to thank Navajeevan, Residential School for the learning approaches for network traffic prediction,’’ in Proc. Int. Conf. Adv.
Deaf, College of Spl. D.Ed. and B.Ed., Vocational Centre, and Comput., Commun. Informat. (ICACCI), Sep. 2017, pp. 2353–2358.

VOLUME 10, 2022 104371


B. Natarajan et al.: Development of an End-to-End Deep Learning Framework

[22] R. N. Babu, V. Sowmya, and K. P. Soman, ‘‘Indian car number plate [45] S. Stoll, N. C. Camgoz, S. Hadfield, and R. Bowden, ‘‘Text2Sign: Towards
recognition using deep learning,’’ in Proc. 2nd Int. Conf. Intell. Comput., sign language production using neural machine translation and genera-
Instrum. Control Technol. (ICICICT), Jul. 2019, pp. 1269–1272. tive adversarial networks,’’ Int. J. Comput. Vis., vol. 1284, pp. 891–908,
[23] Z.-Q. Zhao, P. Zheng, S.-T. Xu, and X. Wu, ‘‘Object detection with deep Apr. 2020.
learning: A review,’’ IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 11, [46] M.-Y. Liu, X. Huang, A. Mallya, T. Karras, T. Aila, J. Lehtinen, and
pp. 3212–3232, Nov. 2019. J. Kautz, ‘‘Few-shot unsupervised image-to-image translation,’’ in Proc.
[24] K. T. P. Nguyen and K. Medjaher, ‘‘A new dynamic predictive maintenance IEEE/CVF Int. Conf. Comput. Vis., Oct. 2019, pp. 10551–10560.
framework using deep learning for failure prognostics,’’ Rel. Eng. Syst. [47] Y. Choi, M. Choi, M. Kim, J. W. Ha, S. Kim, and J. Choo, ‘‘Star-
Saf., vol. 188, pp. 251–262, Aug. 2019. GAN: Unified generative adversarial networks for multi-domain image-
[25] K. Orita, K. Sawada, R. Koyama, and Y. Ikegaya, ‘‘Deep learning- to-image translation,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
based quality control of cultured human-induced pluripotent stem cell- Jun. 2018, pp. 8789–8797.
derived cardiomyocytes,’’ J. Pharmacol. Sci., vol. 140, no. 4, pp. 313–316, [48] Y. Choi, Y. Uh, J. Yoo, and J. W. Ha, ‘‘StarGAN v2: Diverse image
Aug. 2019. synthesis for multiple domains,’’ in Proc. IEEE/CVF Conf. Comput. Vis.
[26] N. Sünderhauf, O. Brock, W. Scheirer, R. Hadsell, D. Fox, J. Leitner, Pattern Recognit., Jun. 2020, pp. 8188–8197.
B. Upcroft, P. Abbeel, W. Burgard, M. Milford, and P. Corke, ‘‘The limits [49] S. Tulyakov, M. Y. Liu, X. Yang, and J. Kautz, ‘‘MoCoGAN: Decomposing
and potentials of deep learning for robotics,’’ Int. J. Robot. Res., vol. 37, motion and content for video generation,’’ in Proc. IEEE Conf. Comput.
nos. 4–5, pp. 405–420, Apr. 2018. Vis. Pattern Recognit., Jun. 2018, pp. 1526–1535.
[27] R. Singh and S. Srivastava, ‘‘Stock prediction using deep learning,’’ Mul- [50] E. Denton, S. Chintala, A. Szlam, and R. Fergus, ‘‘Deep generative
timedia Tools Appl., vol. 7618, pp. 18569–18584, Sep. 2017. image models using a Laplacian pyramid of adversarial networks,’’ 2015,
[28] B. Lim and S. Zohren, ‘‘Time-series forecasting with deep learning: A arXiv:1506.05751.
survey,’’ Phil. Trans. Roy. Soc. A, vol. 379, Apr. 2021, Art. no. 20200209. [51] X. Chen, Y. Duan, R. Houthooft, J. Schulman, and I. A. P. Sutskever, ‘‘Info-
[29] T. Iqbal and S. Qureshi, ‘‘The survey: Text generation models in deep GAN: Interpretable representation learning by information maximizing
learning,’’ J. King Saud Univ. Comput. Inf. Sci., vol. 34, no. 6, Apr. 2020. generative adversarial nets,’’ in Proc. 30th Int. Conf. Neural Inf. Process.
[30] D. Wang, W. Li, X. Liu, N. Li, and C. Zhang, ‘‘UAV environmental Syst., Dec. 2016, pp. 2180–2188.
perception and autonomous obstacle avoidance: A deep learning and [52] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, ‘‘Image-to-image translation
depth camera combined solution,’’ Comput. Electron. Agricult., vol. 175, with conditional adversarial networks,’’ in Proc. IEEE Conf. Comput. Vis.
Aug. 2020, Art. no. 105523. Pattern Recognit., Jul. 2017, pp. 1125–1134.
[31] M. Gjoreski, M. Ž Gams, M. Luštrek, P. Genc, J.-U. Garbas, and T. Hassan, [53] J. Y. Zhu, T. Park, P. Isola, and A. A. Efros, ‘‘Unpaired image-to-image
‘‘Machine learning and End-to-End deep learning for monitoring driver translation using cycle-consistent adversarial networks,’’ in Proc. IEEE Int.
distractions from physiological and visual signals,’’ IEEE Access, vol. 8, Conf. Comput. Vis., Oct. 2017, pp. 2223–2232.
pp. 70590–70603, 2020. [54] I. K. Dutta, B. Ghosh, A. Carlson, M. Totaro, and M. Bayoumi, ‘‘Gen-
erative adversarial networks in security: A survey,’’ in Proc. 11th IEEE
[32] P. Hewage, M. Trovati, E. Pereira, and A. Behera, ‘‘Deep learning-based
Annu. Ubiquitous Comput., Electron. Mobile Commun. Conf. (UEMCON),
effective fine-grained weather forecasting model,’’ Pattern Anal. Appl.,
Oct. 2020, pp. 0399–0405.
vol. 241, pp. 343–366, Feb. 2021.
[55] R. H. Randhawa, N. Aslam, M. Alauthman, H. Rafiq, and F. Comeau,
[33] M. Toǧaçar, B. Ergen, and Z. Cömert, ‘‘COVID-19 detection using deep
‘‘Security hardening of botnet detectors using generative adversarial net-
learning models to exploit social mimic optimization and structured chest
works,’’ IEEE Access, vol. 9, pp. 78276–78292, 2021.
X-ray images using fuzzy color and stacking approaches,’’ Comput. Biol.
[56] Y. Zhu, Y. Zhang, H. Zhang, J. Yang, and Z. Zhao, ‘‘Data augmentation
Med., vol. 121, Jun. 2020, Art. no. 103805.
of X-ray images in baggage inspection based on generative adversarial
[34] S. Reddy, N. Srikanth, and G. S. Sharvani, ‘‘Development of kid-friendly
networks,’’ IEEE Access, vol. 8, pp. 86536–86544, 2020.
Youtube access model using deep learning,’’ in Data Science and Security.
[57] R. Sujatha, J. M. Chatterjee, N. Z. Jhanjhi, and S. N. Brohi, ‘‘Performance
Singapore: Springer, 2021, pp. 243–250.
of deep learning vs machine learning in plant leaf disease detection,’’
[35] O. Zavala-Romero, A. L. Breto, I. R. Xu, Y. C. C. Chang, N. Gautney, Microprocessors Microsyst., vol. 80, Feb. 2021, Art. no. 103615.
P. A. Dal, and R. Stoyanova, ‘‘Segmentation of prostate and prostate
[58] N. Eldeen M. Khalifa, M. Hamed N. Taha, A. E. Hassanien, and S. Elgham-
zones using deep learning,’’ Strahlentherapie und Onkologie, vol. 19610,
rawy, ‘‘Detection of coronavirus (COVID-19) associated pneumonia based
pp. 932–942, Oct. 2020.
on generative adversarial networks and a fine-tuned deep transfer learning
[36] A. A. Barbhuiya, R. K. Karsh, and R. Jain, ‘‘CNN based feature extraction model using chest X-ray dataset,’’ 2020, arXiv:2004.01184.
and classification for sign language,’’ Multimedia Tools Appl., vol. 802, [59] B. Espejo-Garcia, N. Mylonas, L. Athanasakos, E. Vali, and
pp. 3051–3069, Jan. 2021. S. Fountas, ‘‘Combining generative adversarial networks and agricultural
[37] S. Aly and W. Aly, ‘‘DeepArSLR: A novel signer-independent deep learn- transfer learning for weeds identification,’’ Biosystems Eng., vol. 204,
ing framework for isolated Arabic sign language gestures recognition,’’ pp. 79–89, Apr. 2021.
IEEE Access, vol. 8, pp. 83199–83212, 2020. [60] F. Taymouri, R. M. La, S. Erfani, Z. D. Bozorgi, and I. Verenich, ‘‘Predic-
[38] C. K. Lee, K. K. Ng, C. H. Chen, H. C. Lau, S. Y. Chung, and tive business process monitoring via generative adversarial nets: The case
T. Tsoi, ‘‘American sign language recognition and training method with of next event prediction,’’ in Proc. Int. Conf. Bus. Process Manage. Cham,
recurrent neural network,’’ Expert Syst. Appl., vol. 167, Apr. 2021, Switzerland: Springer, 2020, pp. 237–256.
Art. no. 114403. [61] Y. Gu, Y. Peng, and H. Li, ‘‘AIDS brain MRIs synthesis via generative
[39] Q. Xiao, X. Chang, X. Zhang, and X. Liu, ‘‘Multi-information spatial– adversarial networks based on attention-encoder,’’ in Proc. IEEE 6th Int.
temporal LSTM fusion continuous sign language neural machine transla- Conf. Comput. Commun. (ICCC), Dec. 2020, pp. 629–633.
tion,’’ IEEE Access, vol. 8, pp. 216718–216728, 2020. [62] B. Lütjens, B. Leshchinskiy, C. Requena-Mesa, F. Chishtie,
[40] R. Elakkiya, P. Vijayakumar, and N. Kumar, ‘‘An optimized generative N. Díaz-Rodríguez, O. Boulais, A. Sankaranarayanan, A. Piña,
adversarial network based continuous sign language classification,’’ Expert Y. Gal, C. Raïssi, A. Lavin, and D. Newman, ‘‘Physically-consistent
Syst. Appl., vol. 182, Nov. 2021, Art. no. 115276. generative adversarial networks for coastal flood visualization,’’ 2021,
[41] R. Elakkiya, ‘‘Machine learning based sign language recognition: A review arXiv:2104.04785.
and its research frontier,’’ J. Ambient Intell. Hum. Comput., vol. 12, no. 7, [63] S. Liu, B. Zhang, Y. Liu, A. Han, H. Shi, T. Guan, and Y. He, ‘‘Unpaired
pp. 1–20, 2020. stain transfer using pathology-consistent constrained generative adversar-
[42] R. Elakkiya and B. NATARAJAN, ‘‘ISL-CSLTR: Indian sign language ial networks,’’ IEEE Trans. Med. Imag., vol. 40, no. 8, pp. 1977–1989,
dataset for continuous sign language translation and recognition,’’ Mende- Aug. 2021.
ley Data Repository, V1, 2021, doi: 10.17632/kcmpdxky7p.1. [64] K. Vo, E. K. Naeini, A. Naderi, D. Jilani, A. M. Rahmani, N. Dutt, and
[43] Q. Chen and V. Koltun, ‘‘Photographic image synthesis with cascaded H. Cao, ‘‘P2E-WGAN: ECG waveform synthesis from PPG with condi-
refinement networks,’’ in Proc. IEEE Int. Conf. Comput. Vis., Oct. 2017, tional Wasserstein generative adversarial networks,’’ in Proc. 36th Annu.
pp. 1511–1520. ACM Symp. Appl. Comput., 2021, pp. 1030–1036.
[44] A. van den Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, [65] Y. Shi, X. Zhang, Q. Hu, and H. Cheng, ‘‘Data recovery algorithm based
and K. Kavukcuoglu, ‘‘Conditional image generation with PixelCNN on generative adversarial networks in crowd sensing Internet of Things,’’
decoders,’’ 2016, arXiv:1606.05328. Personal Ubiquitous Comput., vol. 2020, pp. 1–14, Jul. 2020.

104372 VOLUME 10, 2022


B. Natarajan et al.: Development of an End-to-End Deep Learning Framework

[66] C. Davi and U. Braga-Neto, ‘‘A semi-supervised generative adversarial E. RAJALAKSHMI received the B.E. degree in
network for prediction of genetic disease outcomes,’’ in Proc. IEEE 31st information technology from the Cummins Col-
Int. Workshop Mach. Learn. Signal Process. (MLSP), Oct. 2021, pp. 1–6. lege of Engineering, Pune, in 2018, and the
[67] M. Mirza and S. Osindero, ‘‘Conditional generative adversarial nets,’’ M.Tech. degree in computer science and engineer-
2014, arXiv:1411.1784. ing from SASTRA Deemed University, Thanjavur,
[68] K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks for in 2020, where she is currently pursuing the Ph.D.
large-scale image recognition,’’ 2014, arXiv:1409.1556. degree.
[69] S. Hochreiter and J. Schmidhuber, ‘‘Long short-term memory,’’ Neural
She is currently working as a Project Associate
Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
with SASTRA Deemed University. Her research
[70] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio, ‘‘On the
properties of neural machine translation: Encoder-decoder approaches,’’ interests include sign language recognition, music
2014, arXiv:1409.1259. emotion recognition, deep neural network, image processing, and com-
[71] D. Bahdanau, K. Cho, and Y. Bengio, ‘‘Neural machine translation by puter vision. She has contributed various articles and chapters for many
jointly learning to align and translate,’’ 2014, arXiv:1409.0473. high-quality Scopus and SCI/SCIE indexed journals, conferences, and
[72] N. C. Camgoz, S. Hadfield, O. Koller, H. Ney, and R. Bowden, ‘‘Neural books. She is a Lifetime Member of International Association of Engineers
sign language translation,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern and a member of Association for Computing Machinery.
Recognit., Jun. 2018, pp. 7784–7793.
[73] A. Duarte, S. Palaskar, L. Ventura, D. Ghadiyaram, K. DeHaan, F. Metze,
and X. Giro-i-Nieto, ‘‘How2Sign: A large-scale multimodal dataset for
continuous American sign language,’’ in Proc. IEEE/CVF Conf. Comput.
Vis. Pattern Recognit., Jun. 2021, pp. 2735–2744.
[74] L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and L. Van Gool, ‘‘Pose
guided person image generation,’’ 2017, arXiv:1705.09368.
[75] A. Hao, Y. Min, and X. Chen, ‘‘Self-mutual distillation learning for contin-
uous sign language recognition,’’ in Proc. IEEE/CVF Int. Conf. Comput.
Vis., Oct. 2021, pp. 11303–11312. R. ELAKKIYA received the Doctor of Philosophy
[76] T. S. Dias, J. J. A. M. Júnior, and S. F. Pichorim, ‘‘An instrumented glove for from Anna University, Chennai, in 2018. She is
recognition of Brazilian sign language alphabet,’’ IEEE Sensors J., vol. 22, currently working as an Assistant Professor with
no. 3, pp. 2518–2529, 2021. the Department of Computer Science and Engi-
[77] R. Rastgoo, K. Kiani, and S. Escalera, ‘‘Hand sign language recognition neering, School of Computing, SASTRA Univer-
using multi-view hand skeleton,’’ Expert Syst. Appl., vol. 150, Jul. 2020, sity, Thanjavur. She got three patents. She has
Art. no. 113336. published more than 20 research papers in lead-
[78] D. Li, C. Rodriguez, X. Yu, and H. Li, ‘‘Word-level deep sign language ing journals, conference proceedings, and book
recognition from video: A new large-scale dataset and methods compar-
including IEEE, Elsevier, and Springer. She is cur-
iso,’’ in Proc. IEEE/CVF winter Conf. Appl. Comput. Vis., Mar. 2020,
rently an Editor of Information Engineering and
pp. 1459–1469.
[79] J. Zhao, M. Mathieu, and Y. LeCun, ‘‘Energy-based generative adversarial Applied Computing journal and also a Life Time Member of International
network,’’ 2016, arXiv:1609.03126. Association of Engineers.
[80] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas,
‘‘Stackgan: Text to photo-realistic image synthesis with stacked generative
adversarial networks,’’ in Proc. IEEE Int. Conf. Comput. Vis., Oct. 2017,
pp. 5907–5915.
[81] D. Berthelot, T. Schumm, and L. Metz, ‘‘BEGAN: Boundary equilibrium
generative adversarial networks,’’ 2017, arXiv:1703.10717.
[82] T. Karras, T. Aila, S. Laine, and J. Lehtinen, ‘‘Progressive grow-
ing of GANs for improved quality, stability, and variation,’’ 2017,
arXiv:1710.10196.
[83] A. Odena, ‘‘Semi-supervised learning with generative adversarial net- KETAN KOTECHA is currently an Administrator
works,’’ 2016, arXiv:1606.01583. and a Teacher of deep learning with Symbiosis
[84] J. Donahue, P. Krähenbühl, and T. Darrell, ‘‘Adversarial feature learning,’’ Centre for Applied Artificial Intelligence, Sym-
2016, arXiv:1605.09782. biosis International (Deemed University), Pune.
[85] B. Natarajan, R. Elakkiya, and M. L. Prasad, ‘‘Sentence2SignGesture: A He has expertise and experience in cutting-edge
hybrid neural machine translation network for sign language video gener- research and projects in A.I. and deep learning
ation,’’ J. Ambient Intell. Hum. Comput., vol. 2022, pp. 1–15, Jan. 2022. for the last 25 years. He has published more than
[86] B. Natarajan and R. Elakkiya, ‘‘Dynamic GAN for high-quality sign 100 widely in several excellent peer-reviewed jour-
language video generation from skeletal poses using generative adversarial nals on various topics ranging from cutting edge
networks,’’ Soft Comput., vol. 2022, pp. 1–23, Jun. 2022. A.I., education policies, teaching-learning prac-
tices, and A.I. for all. He has published three patents and delivered keynote
speeches at various national and international forums, including at the
Machine Intelligence Laboratory, USA, IIT Bombay under the World Bank
Project, the International Indian Science Festival organized by the Depart-
ment of Science and Technology, Government of India, and many more.
His research interests include artificial intelligence, computer algorithms,
B. NATARAJAN received the Bachelor of machine learning, and deep learning. He was a recipient of the two SPARC
Engineering degree, in 2011, and the Master of Projects worth INR 166 lakhs from MHRD Government of India in A.I.
Engineering degree, in 2015. He is currently pur- in collaboration with Arizona State University, USA, and The University
suing the Ph.D. degree with the School of Comput- of Queensland, Australia. He was also a Recipient of numerous prestigious
ing, SASTRA University, Thanjavur. He has seven awards, such as Erasmus+ Faculty Mobility Grant to Poland, the DUO-India
years of teaching experience and has published Professors Fellowship for research in responsible A.I. in collaboration with
many articles in leading international journals. Brunel University, U.K., the LEAP Grant at Cambridge University, U.K.,
His research interests include computer vision, the UKIERI Grant with Aston University, U.K., and a Grant from the Royal
machine learning, deep learning, and sign lan- Academy of Engineering, U.K., under Newton Bhabha Fund. He is an
guage development. Associate Editor of IEEE ACCESS journal.

VOLUME 10, 2022 104373


B. Natarajan et al.: Development of an End-to-End Deep Learning Framework

AJITH ABRAHAM (Senior Member, IEEE) V. SUBRAMANIYASWAMY received the B.E.


received the Master of Science degree from degree in computer science and engineering
Nanyang Technological University, Singapore, and the M.Tech. degree in information technol-
in 1998, and the Ph.D. degree in computer science ogy from Bharathidasan University, India, and
from Monash University, Melbourne, Australia, Sathyabama University, India, and the Ph.D.
in 2001. He is currently the Director of the degree from Anna University, India, and contin-
Machine Intelligence Research Laboratories (MIR ued the extension work with the Department of
Laboratories), a Not-for-Profit Scientific Network Science and Technology support as a Young Sci-
for Innovation and Research Excellence Connect- entist Award Holder. He is currently working as a
ing Industry and Academia. The Network with HQ Professor with the SASTRA Deemed University,
in Seattle, USA, is currently more than 1,500 scientific members from over Thanjavur, India. In total, he has 18 years of experience in academia. He has
105 countries. As an Investigator/a Co-Investigator, he has won research contributed more than 160 papers and chapters for many high-quality Scopus
grants worth over more than U.S. $100 Million. Currently, he holds two uni- and SCI/SCIE indexed journals and books. He is on the reviewer board of
versity professorial appointments. He works as a Professor in artificial intelli- several international journals and has been a program committee member
gence at Innopolis University, Russia, and the Yayasan Tun Ismail Mohamed for several international/national conferences and workshops. He also serves
Ali Professorial Chair in Artificial Intelligence at UCSI, Malaysia. He works as a guest editor for various special issues of reputed international journals.
in a multi-disciplinary environment. He has authored/coauthored more than He is serving as a research supervisor and also a visiting expert to various
1,400 research publications out of which there are more than 100 books universities in India. His technical competencies lie in recommender sys-
covering various aspects of computer science. One of his books was trans- tems, social networks, the Internet of Things, information security, and big
lated into Japanese and a few other articles were translated into Russian and data analytics.
Chinese. He has more than 46,000 academic citations (H-index of more than
102 as Per Google Scholar). He has given more than 150 plenary lectures
and conference tutorials (in more than 20 countries). He was the Chair of
IEEE Systems Man and Cybernetics Society Technical Committee on Soft
Computing (which has over more than 200 members), from 2008 to 2021,
and served as a Distinguished Lecturer of IEEE Computer Society repre-
senting Europe (2011–2013). He was the Editor-in-Chief of Engineering
Applications of Artificial Intelligence (EAAI), from 2016 to 2021, and
serves/served on the editorial board for over 15 international journals indexed
by Thomson ISI.

LUBNA ABDELKAREIM GABRALLA received


the B.S.C. and M.Sc. degrees in computer sci-
ence from the University of Khartoum, and the
Ph.D. degree in computer science from the Sudan
University of Science and Technology, Khartoum,
Sudan. She is currently an Associate Professor
with the Department of Computer Science and
Information Technology, Princess Nourah Bint
Abdulrahman University, Saudi Arabia. Her cur-
rent research interests include soft computing,
machine learning, and deep learning. She became a Senior Fellow (SFHEA),
in 2021.

104374 VOLUME 10, 2022

You might also like