Development of An End-To-End Deep Learning Framework For Sign Language Recognition Translation and Video Generation
Development of An End-To-End Deep Learning Framework For Sign Language Recognition Translation and Video Generation
AND V. SUBRAMANIYASWAMY 1
1 Schoolof Computing, SASTRA Deemed University, Thanjavur 613401, India
2 Symbiosis Centre for Applied Artificial Intelligence, Symbiosis International (Deemed University), Pune 412115, India
3 Machine Intelligence Research Laboratories (MIR Labs), Auburn, WA 98071, USA
4 Department of Computer Science and Information Technology, College of Applied, Princess Nourah Bint Abdul Rahman University, Riyadh 11671, Saudi Arabia
ABSTRACT The recent developments in deep learning techniques evolved to new heights in various
domains and applications. The recognition, translation, and video generation of Sign Language (SL) still
face huge challenges from the development perspective. Although numerous advancements have been made
in earlier approaches, the model performance still lacks recognition accuracy and visual quality. In this
paper, we introduce novel approaches for developing the complete framework for handling SL recognition,
translation, and production tasks in real-time cases. To achieve higher recognition accuracy, we use the
MediaPipe library and a hybrid Convolutional Neural Network + Bi-directional Long Short Term Memory
(CNN + Bi-LSTM) model for pose details extraction and text generation. On the other hand, the production
of sign gesture videos for given spoken sentences is implemented using a hybrid Neural Machine Translation
(NMT) + MediaPipe + Dynamic Generative Adversarial Network (GAN) model. The proposed model
addresses the various complexities present in the existing approaches and achieves above 95% classification
accuracy. In addition to that, the model performance is tested in various phases of development, and the
evaluation metrics show noticeable improvements in our model. The model has been experimented with using
different multilingual benchmark sign corpus and produces greater results in terms of recognition accuracy
and visual quality. The proposed model has secured a 38.06 average Bilingual Evaluation Understudy
(BLEU) score, remarkable human evaluation scores, 3.46 average Fréchet Inception Distance to videos
(FID2vid) score, 0.921 average Structural Similarity Index Measure (SSIM) values, 8.4 average Inception
Score, 29.73 average Peak Signal-to-Noise Ratio (PSNR) score, 14.06 average Fréchet Inception Distance
(FID) score, and an average 0.715 Temporal Consistency Metric (TCM) Score which is evidence of the
proposed work.
INDEX TERMS Deep learning, generative adversarial networks, sign language recognition, sign language
translation, video generation.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
104358 VOLUME 10, 2022
B. Natarajan et al.: Development of an End-to-End Deep Learning Framework
Although the numerous developments have made for Chinese Language Recognition (SLR) and Sign Language Translation
sign Language (CSL), American Sign Language (ASL) and (SLT). Using Neural Machine Translation (NMT), MediaPipe
German Sign Language (GSL), still the performance of the library and Dynamic GAN the proposed H-DNA will be
model lacks in continuous cases and fails to handle the real developed for generating the high resolution videos. The
time inputs. proposed work simplifies the translation of spoken text to
The proposed H-DNA systems facilitate the real-time and subunit signs and then defines the mapping between glosses
accurate recognition of multimodal and multilingual sign and sign gesture images using the open pose library. Further
gestures. It allows an opportunity for developing the robust the SL videos are produced using and DynamicGAN model.
applications to handle various countries based sign languages On the other hand, using CNN, LSTM and MediaPipe library,
and provides solutions for communication gap exists between the proposed H-DNA recognizes the multilingual datasets
normal and impaired community. The proposed work has which comprises of isolated signs and continuous sign sen-
been developed as a User Interface (UI) application for han- tences by considering multimodal features. The H-DNA was
dling multilingual inputs, recognizing the multimodal sign developed and implemented on GPU-powered workstations.
gestures, generating the sign videos, and providing accurate The collection of benchmark datasets and the recording of
results over the translation and recognition tasks. To achieve own datasets are carried out as the first steps in implemen-
the expectations, the model development underwent various tation. To evaluate the performance of proposed H-DNA, the
stages of development to handle multimodal features and experimentation is performed to have three folds: The first
variations of multilingual sign corpuses. The proposed model fold deals with SL recognition, and the second fold focuses
has been trained using 40K videos for continuous recognition on SL video generation. The SL recognition model achieves
and 35K images for world level recognition tasks. The pro- an accuracy of not less than 98% and shows the improved
posed system explores solutions for the real time interactions performance of the proposed H-DNA. Criteria like robust-
of hard-of-hearing and speech-impaired people with normal ness, flexibility, and scalability are considered in the third
people. fold. We summarize the overall objectives of the proposed
The detailed investigation and various refinements of work as follows:
Convolutional Neural Network (CNN), Long Short Term • To create & integrate heterogeneous data sources and
Memory, Gated Recurrent Unit and Generative Adversarial to build a novel knowledge base consisting of multilingual
Networks (GAN) models yield better translation results and and multimodal sign sentences with minimal sign glosses and
generate high quality photorealistic videos. Sign Language skeletal level annotations by breaking down the signs into
plays vital role in the communication of hard-of-hearing and dedicated subunits.
speech-impaired community due to their inability towards • To augment and generate sign videos based on subunits
reading and writing the native language. Since the various from spoken language sentences to facilitate communication
studies dealing with SLRT research, the earlier developments between normal and impaired (hard-of-hearing and speech-
have their own limitations and are still unable to be used for impaired) communities.
continuous cases. Some of the research has been known to be • To track and recognize the signs consisting of isolated
successful for recognizing sign language, but it requires an words and continuous sign sentences including manual (one-
expensive set up and sensor devices to handle it. The tracking handed and two-handed signs) and non-manual gestures in
and recognition of specialized multimodal gesture signs is real-time scenarios.
very crucial, especially in recognizing signs of different lan- • To build a novel application with end-to-end video gen-
guages (multilingual). The research study on SL recognition eration and recognition capabilities by sharing the qualitative
focuses on the translation of sign gestures into English sen- and quantitative results of generated sign sequences without
tences and produces the text transcription for the sequence using animated avatars or sensors, and to ensure accuracy
of signs. This is due to the misconception that deaf people with minimal cost.
are comfortable with reading spoken language and therefore The further discussions about the proposed model are dis-
do not require translation into sign language. To facilitate cussed as follows. Section 2 investigates the earlier devel-
easy and clear communication between the hearing and the opments and provides the research gap in SLRT research
impaired community, it is vital to build robust systems that and seeks the advancements in various phases of develop-
can translate spoken languages into sign languages and vice ment. The proposed system details are wisely explained in
versa. This two way process can be facilitated using sign Section 3 and provide sufficient details about the model
language recognition, translation, and video generation. With development. The experimental outcome of the proposed
this motivation, the proposed approach intends to develop model is shown in section 4, and finally, the conclusion and
and build a novel H-DNA framework for SL recognition and future work part summarizes the entire information about the
translation systems as well as enhance the interactive com- proposed work.
munication between the normal and impaired community.
To the best of our knowledge, the proposed H-DNA II. RELATED WORK
is the first novel unified deep learning framework which Sign language communication explores the powerfulness of
addresses two different problem dimensions in SL: Sign human intelligence through hand actions and movements.
Despite relying on a single component (hand), it involves Xiao et al. [39] introduced continuous SL recognition using
numerous human upper body components such as head, NMT approaches. The author, Elakkiya et al. [40] proposed
mouth, and gaze movements to provide a real understanding an SL recognition framework using GAN+3D-CNN+LSTM
of gesture sequences in real time. Sign languages are made up Techniques. This approach utilizes the deep reinforcement
of visual actions and do not have a unique pattern to identify learning based evaluation strategy to produce highly accurate
their motion sequences. It greatly follows different styles results. The various details of the earlier literature are shown
based on its own country’s nature and culture. Understanding in Table 1 for exploring the new advancements with different
and processing such inputs is extremely difficult for tradi- sign languages such as American Sign Language (ASL),
tional machine learning approaches. It mainly supports the Chinese Sign Language (CSL) and German Sign Language
hard-of-hearing and speech-impaired society by getting those (GSL). The conventional sensor-based approaches demand
benefits such as education, employment, and engaging them extra equipment to be worn by the signer. The use of data
in societal activities. There have been numerous research gloves, color gloves, depth cameras, and leap motion con-
efforts made to produce better translation models. The real trollers creates additional overhead for the signer to commu-
time recognition and translation of sign languages requires nicate normally and poses huge limitations [41]. Although it
careful investigation of various features to produce plausible gives good prediction results, drastically loses the scope in
output without any misclassification and wrong sign output. real time applications. In addition to that, it creates discomfort
The progress Deep Learning approaches steps towards for the child and normal people during the conversation.
newer heights and produces fabulous results in computer The optimization of hyper parameter values and the impos-
vision and human action recognition applications. The ing of various constraints produces plausible outcomes and
introduction of hybrid models and ensembling techniques attracts the researchers. The primary version of the CNN
advances the capabilities of such models to handle tedious model is introduced by authors Chen and Koltun [43] pro-
tasks. The recent research works in CNN, LSTM, GRU duces images from semantic layouts. The model investi-
and GAN techniques has been investigated related to the gates the different loss functions and produces photographic
SL recognition, translation and video generation tasks and results. The model performance bottlenecks while handling
helps to introduce the novel contributions to build a powerful the large scale of images and adds the various intrinsic
framework. challenges. Similarly, the researchers in Oord et. al. [44]
The author, Barbhuiya et al. [36] proposed CNN+SVM discussed the development of gating mechanism based Pix-
based hand gesture recognition methods for static signs. elCNN models for image generation. The model has been
This approach mainly deals with alphabets and numerals. evaluated using the datasets CIFAR-10 and ImageNet. Since
The authors, Aly et al. [37] proposed a system for han- the model applies different conditions on embedding features
dling the words of Arabic SL using DeepLabv3+ gesture to produce quality image generation results, and extending
segmentation techniques and Bi-LSTM. The ASL recog- the performance for videos creates additional overheads.
nition system for 26 alphabet level sign gesture recogni- Although numerous advancements were made in the research
tion tasks is proposed by the author Lee et al. [38] uses work [45], the production of sign gesture videos is blurred and
LSTM with KNN techniques to provide higher recogni- spatial details are incoherent.
tion results. This work deals with world-level sign lan- The development of ambient models such as FUNIT [46],
guage communications. In addition to that, the researchers StarGAN [47], StarGAN v2 [48], MoCoGAN [49], LPGAN
[50], InfoGAN [51], pix2pix [52], and CycleGAN [53] deals are the major drawbacks in the existing systems. In case
with the image generation and video production tasks effi- of Sign Generation, we have considered the limitation to
ciently. Since SL communication involves the various man- small size vocabulary, model performance improvement, low
ual and non-manual cues of humans and their facial, eye, model complexity, proper alignment of the key points, signs
gaze, and mouth expressions, it demands some advancement in spatial domain etc.
in the earlier approaches. In addition to that, the ordering
of gesture sequences greatly varies from the English sen- III. THE PROPOSED H-DNA SYSTEM
tence order. In order to address these aforementioned chal- The proposed hybrid H-DNA framework model comprises
lenges, we introduce a novel approach for aligning the frame various phases of development, such as SL recognition, mul-
sequences and generating the intermediary frames between tilingual sentences into sign word conversion, pose estimation
the sign gesture images. The proposed model deals with the using MediaPipe, and SL video generation. The proposed
various nuances of SL gestures and its components and pro- H-DNA framework aims to integrate all these modules and
duces plausible outcomes. The GAN networks are found to be provide a real time solution to SLRT research challenges.
highly capable of producing plausible results across a wider Neural Machine Translation (NMT) is the process of translat-
range of diverse domain applications. The applications such ing sentences from one language into another. It uses artificial
as security [54], [55], baggage inspection [56], infected leaf neural networks to yield highly translatable results. The iden-
identification [57], covid-19 prediction [58], agriculture [59], tification of human poses in images or videos is performed
business process monitoring [60], Brain MRI synthesis [61], using the mediapipe library. It helps to predict the various
flood visualization [62], estimating the standards of gold [63], poses of humans in various environments. Pose estimation
ECG wave synthesis [64], Internet of Things (IoT) [65] and is based on a number of key points on the human body.It
Dengue Fever sampling [66]. uses the Parity Affinity Fields approach to implement it. The
The two major components of GAN networks are generator VGG-19 model is used for classifying the different gesture
and discriminator, and they play a vital role in image or styles. It uses different 3 × 3 filters in the convolution layers.
video generation. The discriminating capability of the dis- The convolution layers provide a feature map by scanning the
criminator helps to produce high quality videos in diverse image features. The role of pooling layers is to reduce the
domains and is further investigated in the proposed work information generated by the convolution layers. To vectorize
for qualitative production of sign gesture videos. The incor- the output as a single array, the fully connected layer is
poration of CNN models with conditional GAN networks used. The incorporation of dynamic GAN [86] provides high
produces drastic improvements in video generation quality quality video generation results by encompassing the various
and efficiently handles the various traits of details present in approaches such as frame generation and video completion
an image or video. Based on the discriminator classification, techniques. The LSTM network is used for predicting the text
the generator networks underwent the fine-tuned training equivalents of the sign gestures and further helps to produce
process to produce photorealistic results. The authors, Mirza the language sentences. The following subsections explore
and Osindero [67], introduced the conditional-based GAN the various technical details and summarize the powerfulness
network model by applying constraints on label information. of each technique.
We use this approach in our work to produce videos based This section explains the implementation details of the
on the conditioned labeling approach. The advent of Dynam- proposed H- DNA framework. In the first fold, we developed
icGAN models addresses the existing challenges by using the SL recognition model using the MediaPipe library and the
strided mechanisms in convolution operations to produce VGG-19 model. Furthermore, we incorporate the Bi-LSTM
improved results. The video generation process using GAN network for text generation. In SLR, the input of continuous
networks encompasses the additional approaches to produce gesture sequences is processed by the MediaPipe library to
photo-realistic videos and keeps the coherent spatial details capture the pose sequences, angle between fingers, hand
clear. movements and locations, orientations, mouth expressions,
Although there are enormous research going on in the field and facial actions. Based on these key points, the VGG-19
of Sign Language Recognition, translation and generation model estimates the class of gestures. The incorporation of
systems the existing systems still face a lots of challenges. CNN and LSTM networks in such a hybrid way produces
The primary challenge with the Sign Language recognition higher recognition accuracy and noticeable performance. The
and generation system is the lack of availability of large-scale temporal details are analyzed sequentially to predict the trans-
open-source Indian Sign Language Dataset with natural con- lation text without any misclassification.
ditions. To overcome this issue we have developed a multi- We trained our model using 40,000 videos for 320 classes
signer, multi-model Sign Language dataset and have provided to provide wider support over multilingual sign corpus com-
it as open-source resource for further research purposes. For prises of multimodal features. The sample gesture images
Sign Language recognition systems, we have built the recog- of our own created ISL-CSLTR dataset [42] are shown in
nition model in such a way that it detects the signs irre- Figure 2 and greatly support the ISL-related SLRT research.
spective of the complex backgrounds, multi-modality, signer In general, the SL video generation process is treated as a
skin tone, signer clothing constraints, sign speed etc., which highly intensive task due to the production of sign gesture
The current input value (xt ) and the weight (W) values are
multiplied in the first part, and the second part multiplies the
previous hidden state values (ht−1 ) and its weights (U) and
finally the values are summed up to provide the new values
to the update gate. The sigmoid (σ ) activation function is
applied over the resultant values to round up the prediction
results in the range of zero to one. The update gate concludes
the volume of information to be passed to the next state. The
reset gate decides the removal of information based on the
importance of particular vector towards the prediction of next
sequences. The executions of reset gate are demonstrated
using the Equation 8 as follows. FIGURE 7. Execution flow of GRU based encoder decoder system.
rt = σ W(r) xt + Ur ht−1 (8)
The dense vector values of each word are passed to a feed
The reset gate (rt ) combines the results of the multiplication forward neural network to learn the source representation.
operation performed on the input (xt ) and weight (W) values The proposed hybrid NMT model handles varying lengths of
as well as the previous hidden node values (ht−1 ) and its sentences and changes the translation results accordingly. The
weight values (U). The sigmoid activation is applied to the score values produced by the networks are further processed
results. The current values (hcur ) to be present in the memory by the softmax function and yield attention weights. The
unit are computed using Equation 9. context vectors are calculated by multiplying the attention
hcur = tanh (Wxt + rt Uht−1 ) (9) weight values and hidden state values.
We incorporated the attention mechanism proposed by the
The current and previous node values are multiplied with researcher Bahdanau et al. [71] to yield the accurate trans-
weight values. The Hadamard product, known as element- lation results. The attention vector is estimated by concate-
wise multiplication, is performed over the reset gate and pre- nating the context vectors and previous output. Finally, the
vious hidden states values. Finally, the non-linear activation decoder network produces the target sign gloss output. The
function tanh is computed on the final outcome. The last step proposed hybrid NMT + Attention model is evaluated using
results in being recorded in memory units (hf ) at time step t the three benchmark sign corpus datasets such as RWTH
is computed using Equation 10. PHOENIX Weather 2014T dataset [72], How2Sign Dataset
0 [73], and ISL-CSLTR Dataset [41] and the results are shown
hf = Zt ht−1 + (1 − Zt ) ht (10)
in section 4. The computation of attention weights is done
The deep stacked approach provides better results over a using Equation 11.
wider range of applications and reduces the computational
exp(score(ht , h̄s ))
complexity of the model drastically. The deep stacked GRU αts = PS (11)
has several units of GRU blocks and performs the model s0 =1 exp(score(ht , h̄s0 ))
training in parallel. The detailed structure of deep stacked The context vector is calculated by using Equation 12.
GRU units is depicted in Figure 6. X
Further, we incorporate the attention mechanism proposed ct = αts h̄s (12)
by Bahdanau et al. [71]. The attention mechanism focuses s
on the particular context in encoder unit matching with target The Bahdanau’s attention vector is calculated by using
translation to yield high quality results. The cyclic execution Equation 13.
of the deep stacked GRU units is shown clearly in Figure 7.
at = f (ct , ht ) = tanh (Wc [ct ;ht ]) (13)
The GRU units process the spoken sentences input using
encoder and decoder based approach. The encoder network The proposed Deep stacked GRU algorithm uses stacked
of GRU processes the source format of input sentences. layers of GRU to effectively process the sequential inputs and
translate them into target form. We apply the Bahdanau et al. generator network to produce the photo-realistic high quality
[71] attention mechanism to compute distinct context vector sign gesture videos. The integrated architecture for translat-
values and get good results. The recursive nature of GRU ing the multilingual sentences to sign video generation is
processes the entire source sentences and translates them into shown in Figure 9.
target sentences. We use beam size 10 and tanh and sigmoid
activation functions. The proposed model totally processes
40k sentences by combining multilingual sign corpus col-
lected from different sources.
function, Batch normalization, and the Leaky ReLU activa- and validated to produce better results. We inputted 25k
tion function are applied sequentially to produce plausible images for training and 5k images for validation purposes.
outcomes. The discriminator network estimates the realness The model performance is shown in Figure 11. The proposed
of the generated results. It uses sigmoid cross-entropy loss model achieves significant improvements in classification
function to measure the quality of generated results com- accuracy and recognition performance. Furthermore, we plot
pared with real ones. The proposed Dynamic GAN model the confusion matrix for obtaining the classification perfor-
is implemented in high end GPU based environment. The mance. The confusion matrix results are shown in Figure 12.
Dell Precision 7820 Tower workstation is used to accomplish This demonstrates the improved performance of the proposed
the entire development process. It comprises pairs of Intel hybrid CNN-LSTM model.
Xeon Silver 4210 2.2. GHz processors and 10 cores. The
Nvidia Quadro RTX40000 provides GPU support for model
training. We use batch normalization and Adam optimization
techniques with the values α = 2e-4, β = 0.5 and β = 0.999.
We set the batch size value as 128, dropout is 0.01 and initial
learning rate as 0.01. The Leaky ReLu value is set as 0.1
and ReLu activation functions are further applied. The mini
batch size is set as 100 and the momentum is 0.05. The pro-
posed DynamicGAN framework is experimented using the
multilingual sign corpus such as RWTH-PHOENIX-Weather
2014T dataset, ISL-CSLTR dataset, and How2Sign dataset.
The results are shown in section 4. FIGURE 11. Accuracy and loss evaluation of hybrid CNN-LSTM model.
We use the Mean Squared Error (MSE) Metric to evaluate
the loss values in the generator network outcomes stated in
Equation 14.
LMSE (xt, gn) = `MSE (G (Xxt ) , gn) = kG (xt) −gnk2
(14)
The Sigmoid Cross-Entropy loss combines the activation
function sigmoid as well as the Cross-Entropy loss function.
Due to the independent execution of these loss functions,
it does not affect the results of one on another.
E. DATASET
RWTH-PHOENIX-Weather 2014T dataset: This dataset
deals with the SLRT research for German sign language [72].
It consists of 40k videos for sentence level. The videos are
recorded using 9 native signers. FIGURE 12. Confusion matrix results of the hybrid CNN-LSTM Model.
TABLE 2. Comparison of existing SL recognition models with hybrid TABLE 3. Hybrid CNN-LSTM model performance comparison using
CNN-LSTM model. existing frameworks.
Recall ∗ Precision
F1 Score = 2 ∗ (17)
Recall + Precision
TP + TN
Accuracy = (18)
TP + FP + TN + FN
The performance of the hybrid NMT + Attention model is
evaluated using the BLEU metrics depicted in Figure 13.
It shows the performance of the proposed hybrid NMT +
Attention model compared with existing work. Further, the
performance of the hybrid NMT + Attention model is
analyzed using the attention plots depicted in Figure 14.
The attention plot shows the real translation performance
of the model by comparing the source and target sentences.
The blocks are highlighted in white color representing the
role of attention mechanism in the context of particular word
translation.
FIGURE 14. (a), (b) Attention plot results for ISL-CSLTR dataset,
(c) How2Sign dataset, (d) RWTH-PHOENIX-Weather 2014T dataset.
TABLE 7. Comparison of inception score of dynamic GAN model. TABLE 9. Fréchet inception distance (FID) metric evaluation for
multilingual sign corpus.
REFERENCES
[1] G. Delnevo, R. Girau, C. Ceccarini, and C. Prandi, ‘‘A deep learning and
social IoT approach for plants disease prediction toward a sustainable agri-
FIGURE 18. Sample UI based sign gesture recognition results using the culture,’’ IEEE Internet Things J., vol. 9, no. 10, pp. 7243–7250, May 2021.
H-DNA framework. [2] I. Siniosoglou, P. Radoglou-Grammatikis, G. Efstathopoulos, P. Fouliras,
and P. Sarigiannidis, ‘‘A unified deep learning anomaly detection and
classification approach for smart grid environments,’’ IEEE Trans. Netw.
Service Manage., vol. 18, no. 2, pp. 1137–1151, Jun. 2021.
[3] G. Vallathan, A. John, C. Thirumalai, S. Mohan, G. Srivastava, and
J. C.-W. Lin, ‘‘Suspicious activity detection using deep learning in
secure assisted living IoT environments,’’ J. Supercomput., vol. 77, no. 4,
pp. 3242–3260, Apr. 2021.
[4] S. Ramaswamy and N. DeClerck, ‘‘Customer perception analysis using
FIGURE 19. Sample UI application results for video generation using the deep learning and NLP,’’ Proc. Comput. Sci., vol. 140, pp. 170–178,
H-DNA framework. Jan. 2018.
[5] V. Pasquadibisceglie, A. Appice, G. Castellano, and D. Malerba, ‘‘A multi-
view deep learning approach for predictive business process monitoring,’’
V. CONCLUSION IEEE Trans. Services Comput., vol. 15, no. 4, pp. 2382–2395, Jul. 2021.
This paper contributes to the development of a deep learn- [6] T. Islam, T. A. Chisty, and A. Chakrabarty, ‘‘A deep neural network
ing framework for end-to-end sign language recognition, approach for crop selection and yield prediction in Bangladesh,’’ in Proc.
IEEE Region 10th Hum. Technol. Conf. (R-HTC), Dec. 2018, pp. 1–6.
translation, and generation. We addressed the challenges [7] J. Williams, P. Dryburgh, A. Clare, P. Rao, and A. Samal, ‘‘Defect detection
that persist with earlier SL recognition and video generation and monitoring in metal additive manufactured parts through deep learning
approaches using the proposed H-DNA framework. We eval- of spatially resolved acoustic spectroscopy signals,’’ Smart Sustain. Manuf.
Syst., vol. 2, no. 1, Nov. 2018, Art. no. 20180035.
uated the model performance using the RWTH-PHOENIX- [8] B. Alipanahi, A. Delong, M. T. Weirauch, and B. J. Frey, ‘‘Predicting
Weather 2014T dataset, the How2Sign dataset, and the the sequence specificities of DNA- and RNA-binding proteins by deep
ISL-CSLTR datasets quantitatively and qualitatively. The learning,’’ Nature Biotechnol., vol. 338, pp. 831–838, Aug. 2015.
[9] M. Reichstein, G. Camps-Valls, B. Stevens, M. Jung, J. Denzler,
proposed H-DNA framework is also evaluated qualitatively N. Carvalhais, and Prabhat, ‘‘Deep learning and process understand-
using various quality metrics. The generated video frames ing for data-driven earth system science,’’ Nature, vol. 566, no. 7743,
show the quality of the outcome of our work. We achieved pp. 195–204, Feb. 2019.
[10] A. Roy, J. Sun, R. Mahoney, L. Alonzi, S. Adams, and P. Beling, ‘‘Deep
a comparatively greater recognition rate and generating per- learning detecting fraud in credit card transactions,’’ in Proc. Syst. Inf. Eng.
formance than earlier approaches. The proposed model has Design Symp. (SIEDS), Apr. 2018, pp. 129–134.
achieved the above 95% classification accuracy towards SL [11] P. Bellot, G. de los Campos, and M. Pérez-Enciso, ‘‘Can deep learning
improve genomic prediction of complex human traits?’’ Genetics, vol. 210,
recognition, 38.56 average BLEU score, remarkable human no. 3, pp. 809–819, Nov. 2018.
evaluation scores, 3.46 average FID2vid score, 0.921 aver- [12] D. Liciotti, M. Bernardini, L. Romeo, and E. Frontoni, ‘‘A sequential deep
age SSIM values, 8.4 average Inception Score, 29.73 aver- learning application for recognising human activities in smart Homes,’’
age PSNR score, 14.06 average FID score, and an average Neurocomputing, vol. 396, pp. 501–513, Jul. 2020.
[13] T.-H. Chan, K. Jia, S. Gao, J. Lu, Z. Zeng, and Y. Ma, ‘‘PCANet: A
0.715 TCM Score. These scores are notably higher than ear- simple deep learning baseline for image classification?’’ IEEE Trans.
lier models. The evaluation of realism, relevance, and coher- Image Process., vol. 24, no. 12, pp. 5017–5032, Dec. 2015.
ence factors is carried out by employing human evaluators [14] Y. Lin, H. Lei, P. Clement Addo, and X. Li, ‘‘Machine learned resume-job
matching solution,’’ 2016, arXiv:1607.07657.
and produces good results in real time scenarios. [15] N. J. Cronin, T. Rantalainen, J. P. Ahtiainen, E. Hynynen, and B. Waller,
‘‘Markerless 2D kinematic analysis of underwater running: A deep learn-
FUNDING SUPPORT ing approach,’’ J. Biomech., vol. 87, pp. 75–82, Apr. 2019.
[16] W. Zhang, L. Sun, X. Wang, Z. Huang, and B. Li, ‘‘SEABIG: A deep
This work was financially supported by the Princess learning-based method for location prediction in pedestrian semantic tra-
Nourah bint Abdulrahman University Researchers Support- jectories,’’ IEEE Access, vol. 7, pp. 109054–109062, 2019.
ing Project number (PNURSP2022R178), Princess Nourah [17] R. Elakkiya, ‘‘Machine learning based intelligent automated neonatal
epileptic seizure detection,’’ J. Intell. Fuzzy Syst., vol. 40, no. 5, pp. 1–9,
bint Abdulrahman University, Riyadh, Saudi Arabia. 2021.
[18] R. Elakkiya, K. S. S. Teja, L. J. Deborah, C. Bisogni, and C. Medaglia,
ACKNOWLEDGMENT ‘‘Imaging based cervical cancer diagnostics using small object detection-
generative adversarial networks,’’ Multimedia Tools Appl., vol. 81,
The research project was sanctioned by the Science and pp. 1–17, Jan. 2021.
Engineering Research Board (SERB), India under the Start- [19] R. Elakkiya, P. Vijayakumar, and M. Karuppiah, ‘‘COVID_SCREENET:
up Research Grant (SRG/2019/001338). The authors thank COVID-19 screening in chest radiography images using deep transfer
SASTRA Deemed University for providing infrastructural stacking,’’ Inf. Syst. Frontiers, vol. 23, no. 6, pp. 1–15, 2021.
[20] G. Padmapriya, R. Elakkiya, and M. Prakash, ‘‘Deep learning based
support to conduct the research. They thank all the students Parkinson’s disease prediction system,’’ in Machine Learning and IoT
for their contribution in collecting the sign videos and the suc- for Intelligent Systems and Smart Applications. Boca Raton, FL, USA:
cessful completion of the ISL-CSLTR Corpus. And also, they CRC Press, 2021, pp. 97–111.
[21] R. Vinayakumar, K. P. Soman, and P. Poornachandran, ‘‘Applying deep
would like to thank Navajeevan, Residential School for the learning approaches for network traffic prediction,’’ in Proc. Int. Conf. Adv.
Deaf, College of Spl. D.Ed. and B.Ed., Vocational Centre, and Comput., Commun. Informat. (ICACCI), Sep. 2017, pp. 2353–2358.
[22] R. N. Babu, V. Sowmya, and K. P. Soman, ‘‘Indian car number plate [45] S. Stoll, N. C. Camgoz, S. Hadfield, and R. Bowden, ‘‘Text2Sign: Towards
recognition using deep learning,’’ in Proc. 2nd Int. Conf. Intell. Comput., sign language production using neural machine translation and genera-
Instrum. Control Technol. (ICICICT), Jul. 2019, pp. 1269–1272. tive adversarial networks,’’ Int. J. Comput. Vis., vol. 1284, pp. 891–908,
[23] Z.-Q. Zhao, P. Zheng, S.-T. Xu, and X. Wu, ‘‘Object detection with deep Apr. 2020.
learning: A review,’’ IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 11, [46] M.-Y. Liu, X. Huang, A. Mallya, T. Karras, T. Aila, J. Lehtinen, and
pp. 3212–3232, Nov. 2019. J. Kautz, ‘‘Few-shot unsupervised image-to-image translation,’’ in Proc.
[24] K. T. P. Nguyen and K. Medjaher, ‘‘A new dynamic predictive maintenance IEEE/CVF Int. Conf. Comput. Vis., Oct. 2019, pp. 10551–10560.
framework using deep learning for failure prognostics,’’ Rel. Eng. Syst. [47] Y. Choi, M. Choi, M. Kim, J. W. Ha, S. Kim, and J. Choo, ‘‘Star-
Saf., vol. 188, pp. 251–262, Aug. 2019. GAN: Unified generative adversarial networks for multi-domain image-
[25] K. Orita, K. Sawada, R. Koyama, and Y. Ikegaya, ‘‘Deep learning- to-image translation,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
based quality control of cultured human-induced pluripotent stem cell- Jun. 2018, pp. 8789–8797.
derived cardiomyocytes,’’ J. Pharmacol. Sci., vol. 140, no. 4, pp. 313–316, [48] Y. Choi, Y. Uh, J. Yoo, and J. W. Ha, ‘‘StarGAN v2: Diverse image
Aug. 2019. synthesis for multiple domains,’’ in Proc. IEEE/CVF Conf. Comput. Vis.
[26] N. Sünderhauf, O. Brock, W. Scheirer, R. Hadsell, D. Fox, J. Leitner, Pattern Recognit., Jun. 2020, pp. 8188–8197.
B. Upcroft, P. Abbeel, W. Burgard, M. Milford, and P. Corke, ‘‘The limits [49] S. Tulyakov, M. Y. Liu, X. Yang, and J. Kautz, ‘‘MoCoGAN: Decomposing
and potentials of deep learning for robotics,’’ Int. J. Robot. Res., vol. 37, motion and content for video generation,’’ in Proc. IEEE Conf. Comput.
nos. 4–5, pp. 405–420, Apr. 2018. Vis. Pattern Recognit., Jun. 2018, pp. 1526–1535.
[27] R. Singh and S. Srivastava, ‘‘Stock prediction using deep learning,’’ Mul- [50] E. Denton, S. Chintala, A. Szlam, and R. Fergus, ‘‘Deep generative
timedia Tools Appl., vol. 7618, pp. 18569–18584, Sep. 2017. image models using a Laplacian pyramid of adversarial networks,’’ 2015,
[28] B. Lim and S. Zohren, ‘‘Time-series forecasting with deep learning: A arXiv:1506.05751.
survey,’’ Phil. Trans. Roy. Soc. A, vol. 379, Apr. 2021, Art. no. 20200209. [51] X. Chen, Y. Duan, R. Houthooft, J. Schulman, and I. A. P. Sutskever, ‘‘Info-
[29] T. Iqbal and S. Qureshi, ‘‘The survey: Text generation models in deep GAN: Interpretable representation learning by information maximizing
learning,’’ J. King Saud Univ. Comput. Inf. Sci., vol. 34, no. 6, Apr. 2020. generative adversarial nets,’’ in Proc. 30th Int. Conf. Neural Inf. Process.
[30] D. Wang, W. Li, X. Liu, N. Li, and C. Zhang, ‘‘UAV environmental Syst., Dec. 2016, pp. 2180–2188.
perception and autonomous obstacle avoidance: A deep learning and [52] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, ‘‘Image-to-image translation
depth camera combined solution,’’ Comput. Electron. Agricult., vol. 175, with conditional adversarial networks,’’ in Proc. IEEE Conf. Comput. Vis.
Aug. 2020, Art. no. 105523. Pattern Recognit., Jul. 2017, pp. 1125–1134.
[31] M. Gjoreski, M. Ž Gams, M. Luštrek, P. Genc, J.-U. Garbas, and T. Hassan, [53] J. Y. Zhu, T. Park, P. Isola, and A. A. Efros, ‘‘Unpaired image-to-image
‘‘Machine learning and End-to-End deep learning for monitoring driver translation using cycle-consistent adversarial networks,’’ in Proc. IEEE Int.
distractions from physiological and visual signals,’’ IEEE Access, vol. 8, Conf. Comput. Vis., Oct. 2017, pp. 2223–2232.
pp. 70590–70603, 2020. [54] I. K. Dutta, B. Ghosh, A. Carlson, M. Totaro, and M. Bayoumi, ‘‘Gen-
erative adversarial networks in security: A survey,’’ in Proc. 11th IEEE
[32] P. Hewage, M. Trovati, E. Pereira, and A. Behera, ‘‘Deep learning-based
Annu. Ubiquitous Comput., Electron. Mobile Commun. Conf. (UEMCON),
effective fine-grained weather forecasting model,’’ Pattern Anal. Appl.,
Oct. 2020, pp. 0399–0405.
vol. 241, pp. 343–366, Feb. 2021.
[55] R. H. Randhawa, N. Aslam, M. Alauthman, H. Rafiq, and F. Comeau,
[33] M. Toǧaçar, B. Ergen, and Z. Cömert, ‘‘COVID-19 detection using deep
‘‘Security hardening of botnet detectors using generative adversarial net-
learning models to exploit social mimic optimization and structured chest
works,’’ IEEE Access, vol. 9, pp. 78276–78292, 2021.
X-ray images using fuzzy color and stacking approaches,’’ Comput. Biol.
[56] Y. Zhu, Y. Zhang, H. Zhang, J. Yang, and Z. Zhao, ‘‘Data augmentation
Med., vol. 121, Jun. 2020, Art. no. 103805.
of X-ray images in baggage inspection based on generative adversarial
[34] S. Reddy, N. Srikanth, and G. S. Sharvani, ‘‘Development of kid-friendly
networks,’’ IEEE Access, vol. 8, pp. 86536–86544, 2020.
Youtube access model using deep learning,’’ in Data Science and Security.
[57] R. Sujatha, J. M. Chatterjee, N. Z. Jhanjhi, and S. N. Brohi, ‘‘Performance
Singapore: Springer, 2021, pp. 243–250.
of deep learning vs machine learning in plant leaf disease detection,’’
[35] O. Zavala-Romero, A. L. Breto, I. R. Xu, Y. C. C. Chang, N. Gautney, Microprocessors Microsyst., vol. 80, Feb. 2021, Art. no. 103615.
P. A. Dal, and R. Stoyanova, ‘‘Segmentation of prostate and prostate
[58] N. Eldeen M. Khalifa, M. Hamed N. Taha, A. E. Hassanien, and S. Elgham-
zones using deep learning,’’ Strahlentherapie und Onkologie, vol. 19610,
rawy, ‘‘Detection of coronavirus (COVID-19) associated pneumonia based
pp. 932–942, Oct. 2020.
on generative adversarial networks and a fine-tuned deep transfer learning
[36] A. A. Barbhuiya, R. K. Karsh, and R. Jain, ‘‘CNN based feature extraction model using chest X-ray dataset,’’ 2020, arXiv:2004.01184.
and classification for sign language,’’ Multimedia Tools Appl., vol. 802, [59] B. Espejo-Garcia, N. Mylonas, L. Athanasakos, E. Vali, and
pp. 3051–3069, Jan. 2021. S. Fountas, ‘‘Combining generative adversarial networks and agricultural
[37] S. Aly and W. Aly, ‘‘DeepArSLR: A novel signer-independent deep learn- transfer learning for weeds identification,’’ Biosystems Eng., vol. 204,
ing framework for isolated Arabic sign language gestures recognition,’’ pp. 79–89, Apr. 2021.
IEEE Access, vol. 8, pp. 83199–83212, 2020. [60] F. Taymouri, R. M. La, S. Erfani, Z. D. Bozorgi, and I. Verenich, ‘‘Predic-
[38] C. K. Lee, K. K. Ng, C. H. Chen, H. C. Lau, S. Y. Chung, and tive business process monitoring via generative adversarial nets: The case
T. Tsoi, ‘‘American sign language recognition and training method with of next event prediction,’’ in Proc. Int. Conf. Bus. Process Manage. Cham,
recurrent neural network,’’ Expert Syst. Appl., vol. 167, Apr. 2021, Switzerland: Springer, 2020, pp. 237–256.
Art. no. 114403. [61] Y. Gu, Y. Peng, and H. Li, ‘‘AIDS brain MRIs synthesis via generative
[39] Q. Xiao, X. Chang, X. Zhang, and X. Liu, ‘‘Multi-information spatial– adversarial networks based on attention-encoder,’’ in Proc. IEEE 6th Int.
temporal LSTM fusion continuous sign language neural machine transla- Conf. Comput. Commun. (ICCC), Dec. 2020, pp. 629–633.
tion,’’ IEEE Access, vol. 8, pp. 216718–216728, 2020. [62] B. Lütjens, B. Leshchinskiy, C. Requena-Mesa, F. Chishtie,
[40] R. Elakkiya, P. Vijayakumar, and N. Kumar, ‘‘An optimized generative N. Díaz-Rodríguez, O. Boulais, A. Sankaranarayanan, A. Piña,
adversarial network based continuous sign language classification,’’ Expert Y. Gal, C. Raïssi, A. Lavin, and D. Newman, ‘‘Physically-consistent
Syst. Appl., vol. 182, Nov. 2021, Art. no. 115276. generative adversarial networks for coastal flood visualization,’’ 2021,
[41] R. Elakkiya, ‘‘Machine learning based sign language recognition: A review arXiv:2104.04785.
and its research frontier,’’ J. Ambient Intell. Hum. Comput., vol. 12, no. 7, [63] S. Liu, B. Zhang, Y. Liu, A. Han, H. Shi, T. Guan, and Y. He, ‘‘Unpaired
pp. 1–20, 2020. stain transfer using pathology-consistent constrained generative adversar-
[42] R. Elakkiya and B. NATARAJAN, ‘‘ISL-CSLTR: Indian sign language ial networks,’’ IEEE Trans. Med. Imag., vol. 40, no. 8, pp. 1977–1989,
dataset for continuous sign language translation and recognition,’’ Mende- Aug. 2021.
ley Data Repository, V1, 2021, doi: 10.17632/kcmpdxky7p.1. [64] K. Vo, E. K. Naeini, A. Naderi, D. Jilani, A. M. Rahmani, N. Dutt, and
[43] Q. Chen and V. Koltun, ‘‘Photographic image synthesis with cascaded H. Cao, ‘‘P2E-WGAN: ECG waveform synthesis from PPG with condi-
refinement networks,’’ in Proc. IEEE Int. Conf. Comput. Vis., Oct. 2017, tional Wasserstein generative adversarial networks,’’ in Proc. 36th Annu.
pp. 1511–1520. ACM Symp. Appl. Comput., 2021, pp. 1030–1036.
[44] A. van den Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, [65] Y. Shi, X. Zhang, Q. Hu, and H. Cheng, ‘‘Data recovery algorithm based
and K. Kavukcuoglu, ‘‘Conditional image generation with PixelCNN on generative adversarial networks in crowd sensing Internet of Things,’’
decoders,’’ 2016, arXiv:1606.05328. Personal Ubiquitous Comput., vol. 2020, pp. 1–14, Jul. 2020.
[66] C. Davi and U. Braga-Neto, ‘‘A semi-supervised generative adversarial E. RAJALAKSHMI received the B.E. degree in
network for prediction of genetic disease outcomes,’’ in Proc. IEEE 31st information technology from the Cummins Col-
Int. Workshop Mach. Learn. Signal Process. (MLSP), Oct. 2021, pp. 1–6. lege of Engineering, Pune, in 2018, and the
[67] M. Mirza and S. Osindero, ‘‘Conditional generative adversarial nets,’’ M.Tech. degree in computer science and engineer-
2014, arXiv:1411.1784. ing from SASTRA Deemed University, Thanjavur,
[68] K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks for in 2020, where she is currently pursuing the Ph.D.
large-scale image recognition,’’ 2014, arXiv:1409.1556. degree.
[69] S. Hochreiter and J. Schmidhuber, ‘‘Long short-term memory,’’ Neural
She is currently working as a Project Associate
Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
with SASTRA Deemed University. Her research
[70] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio, ‘‘On the
properties of neural machine translation: Encoder-decoder approaches,’’ interests include sign language recognition, music
2014, arXiv:1409.1259. emotion recognition, deep neural network, image processing, and com-
[71] D. Bahdanau, K. Cho, and Y. Bengio, ‘‘Neural machine translation by puter vision. She has contributed various articles and chapters for many
jointly learning to align and translate,’’ 2014, arXiv:1409.0473. high-quality Scopus and SCI/SCIE indexed journals, conferences, and
[72] N. C. Camgoz, S. Hadfield, O. Koller, H. Ney, and R. Bowden, ‘‘Neural books. She is a Lifetime Member of International Association of Engineers
sign language translation,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern and a member of Association for Computing Machinery.
Recognit., Jun. 2018, pp. 7784–7793.
[73] A. Duarte, S. Palaskar, L. Ventura, D. Ghadiyaram, K. DeHaan, F. Metze,
and X. Giro-i-Nieto, ‘‘How2Sign: A large-scale multimodal dataset for
continuous American sign language,’’ in Proc. IEEE/CVF Conf. Comput.
Vis. Pattern Recognit., Jun. 2021, pp. 2735–2744.
[74] L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and L. Van Gool, ‘‘Pose
guided person image generation,’’ 2017, arXiv:1705.09368.
[75] A. Hao, Y. Min, and X. Chen, ‘‘Self-mutual distillation learning for contin-
uous sign language recognition,’’ in Proc. IEEE/CVF Int. Conf. Comput.
Vis., Oct. 2021, pp. 11303–11312. R. ELAKKIYA received the Doctor of Philosophy
[76] T. S. Dias, J. J. A. M. Júnior, and S. F. Pichorim, ‘‘An instrumented glove for from Anna University, Chennai, in 2018. She is
recognition of Brazilian sign language alphabet,’’ IEEE Sensors J., vol. 22, currently working as an Assistant Professor with
no. 3, pp. 2518–2529, 2021. the Department of Computer Science and Engi-
[77] R. Rastgoo, K. Kiani, and S. Escalera, ‘‘Hand sign language recognition neering, School of Computing, SASTRA Univer-
using multi-view hand skeleton,’’ Expert Syst. Appl., vol. 150, Jul. 2020, sity, Thanjavur. She got three patents. She has
Art. no. 113336. published more than 20 research papers in lead-
[78] D. Li, C. Rodriguez, X. Yu, and H. Li, ‘‘Word-level deep sign language ing journals, conference proceedings, and book
recognition from video: A new large-scale dataset and methods compar-
including IEEE, Elsevier, and Springer. She is cur-
iso,’’ in Proc. IEEE/CVF winter Conf. Appl. Comput. Vis., Mar. 2020,
rently an Editor of Information Engineering and
pp. 1459–1469.
[79] J. Zhao, M. Mathieu, and Y. LeCun, ‘‘Energy-based generative adversarial Applied Computing journal and also a Life Time Member of International
network,’’ 2016, arXiv:1609.03126. Association of Engineers.
[80] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas,
‘‘Stackgan: Text to photo-realistic image synthesis with stacked generative
adversarial networks,’’ in Proc. IEEE Int. Conf. Comput. Vis., Oct. 2017,
pp. 5907–5915.
[81] D. Berthelot, T. Schumm, and L. Metz, ‘‘BEGAN: Boundary equilibrium
generative adversarial networks,’’ 2017, arXiv:1703.10717.
[82] T. Karras, T. Aila, S. Laine, and J. Lehtinen, ‘‘Progressive grow-
ing of GANs for improved quality, stability, and variation,’’ 2017,
arXiv:1710.10196.
[83] A. Odena, ‘‘Semi-supervised learning with generative adversarial net- KETAN KOTECHA is currently an Administrator
works,’’ 2016, arXiv:1606.01583. and a Teacher of deep learning with Symbiosis
[84] J. Donahue, P. Krähenbühl, and T. Darrell, ‘‘Adversarial feature learning,’’ Centre for Applied Artificial Intelligence, Sym-
2016, arXiv:1605.09782. biosis International (Deemed University), Pune.
[85] B. Natarajan, R. Elakkiya, and M. L. Prasad, ‘‘Sentence2SignGesture: A He has expertise and experience in cutting-edge
hybrid neural machine translation network for sign language video gener- research and projects in A.I. and deep learning
ation,’’ J. Ambient Intell. Hum. Comput., vol. 2022, pp. 1–15, Jan. 2022. for the last 25 years. He has published more than
[86] B. Natarajan and R. Elakkiya, ‘‘Dynamic GAN for high-quality sign 100 widely in several excellent peer-reviewed jour-
language video generation from skeletal poses using generative adversarial nals on various topics ranging from cutting edge
networks,’’ Soft Comput., vol. 2022, pp. 1–23, Jun. 2022. A.I., education policies, teaching-learning prac-
tices, and A.I. for all. He has published three patents and delivered keynote
speeches at various national and international forums, including at the
Machine Intelligence Laboratory, USA, IIT Bombay under the World Bank
Project, the International Indian Science Festival organized by the Depart-
ment of Science and Technology, Government of India, and many more.
His research interests include artificial intelligence, computer algorithms,
B. NATARAJAN received the Bachelor of machine learning, and deep learning. He was a recipient of the two SPARC
Engineering degree, in 2011, and the Master of Projects worth INR 166 lakhs from MHRD Government of India in A.I.
Engineering degree, in 2015. He is currently pur- in collaboration with Arizona State University, USA, and The University
suing the Ph.D. degree with the School of Comput- of Queensland, Australia. He was also a Recipient of numerous prestigious
ing, SASTRA University, Thanjavur. He has seven awards, such as Erasmus+ Faculty Mobility Grant to Poland, the DUO-India
years of teaching experience and has published Professors Fellowship for research in responsible A.I. in collaboration with
many articles in leading international journals. Brunel University, U.K., the LEAP Grant at Cambridge University, U.K.,
His research interests include computer vision, the UKIERI Grant with Aston University, U.K., and a Grant from the Royal
machine learning, deep learning, and sign lan- Academy of Engineering, U.K., under Newton Bhabha Fund. He is an
guage development. Associate Editor of IEEE ACCESS journal.