A Survey On Sign Language Recognition Systems
A Survey On Sign Language Recognition Systems
Review
Keywords: Sign language, as a different form of the communication language, is important to large groups of people in
Sign language recognition society. There are different signs in each sign language with variability in hand shape, motion profile, and
Pose estimation position of the hand, face, and body parts contributing to each sign. So, visual sign language recognition is
Deep learning
a complex research area in computer vision. Many models have been proposed by different researchers with
Computer Vision
significant improvement by deep learning approaches in recent years. In this survey, we review the vision-
Face recognition
Application
based proposed models of sign language recognition using deep learning approaches from the last five years.
While the overall trend of the proposed models indicates a significant improvement in recognition accuracy
in sign language recognition, there are some challenges yet that need to be solved. We present a taxonomy to
categorize the proposed models for isolated and continuous sign language recognition, discussing applications,
datasets, hybrid models, complexity, and future lines of research in the field.
∗ Corresponding author.
E-mail addresses: [email protected] (R. Rastgoo), [email protected] (K. Kiani), [email protected] (S. Escalera).
https://doi.org/10.1016/j.eswa.2020.113794
Received 20 February 2020; Received in revised form 10 July 2020; Accepted 23 July 2020
Available online 6 August 2020
0957-4174/© 2020 Elsevier Ltd. All rights reserved.
R. Rastgoo et al. Expert Systems With Applications 164 (2021) 113794
of the early efforts on hand and gesture recognition is dated back to flux sensors of a glove. However, there are different challenges to
1987, where a hand gesture interface is proposed by Zimmerman et al. address in visual sign language recognition (e.g. inter and intra subject
(1987) to estimate the hand position and orientation using the magnetic variability, illumination conditions, partial occlusions, different points
2
R. Rastgoo et al. Expert Systems With Applications 164 (2021) 113794
of view and resolutions, background artifacts) that makes it difficult as dropout, batch normalization, and data augmentation), and de-
to define a universal automatic model for automatic sign language velopment of some strong frameworks like TensorFlow (TensorFlow,
recognition. In addition to the present list of challenges, there are 2020), Theano (Frederic, Lamblin, Pascanu, et al., 2012), and MXNET
some other critical challenges in sign language recognition. One of the (MXNET, 2020), had a significant role in Deep Learning advancement.
main challenges is developing a sign language recognition system to In this survey, we focus only on Deep Learning-based models for sign
translate the sing words or sentences into the text or voice to facilitate language recognition in computer vision.
the communication between deaf people and the hearing majority in
a real-world situation. The current works in sign language recognition
3. Taxonomy
use datasets including videos of only one sign or images of only one
character. We need to develop systems that can be applied in a real
chat or conversation between a deaf person and a hearing one. To do In this section, we present a taxonomy that summarizes the main
this, the proposed systems need to be efficient and intelligent enough to concepts related to deep learning in sign language recognition. In
split the input videos, including some characters, words, or sentences, the rest of this section, we explain different feature fusions for sign
into separate characters, words, or sentences. Another main challenge language recognition, input modality, datasets, and applications of this
in this area is the high difference between sign languages of different area. Figs. 1 and 2 show the proposed taxonomy that we describe in
countries. We need to have a multi-lingual system capable of translating this section.
different sign languages into text or voice. Due to the importance of sign
language recognition in the deaf and speaking disabled community, we
3.1. Feature fusion
present a comprehensive review of the recent sign language recognition
works in computer vision using deep learning, identifying future lines
of research. In this work, we perform a comprehensive review of To improve the recognition accuracy of sign language, different
recent works for sign language recognition, defining taxonomy to group features can be fused that we organize them in three categories, which
existing works and providing with an associated discussion on their are: using only the hand pose features, using the hand and face pose
pros and cons. features, using the hand, face, and body pose features. Details of these
The remainder of this paper is organized as follows. Section 2 categories are explained in the following sub-sections.
includes a brief review of Deep Learning algorithms. Section 3 presents
a taxonomy of the sign language recognition area. Hand sign language, 3.1.1. Hand pose features
face sign language, and human sign language literature are reviewed
Using the hand pose features has been more considered in recent
in Sections 4, 5, and 6, respectively. Section 7 presents the recent
years with the advent of the accurate depth sensors (Chen, Wanga,
models in continuous sign language recognition. Recent hybrid models
Guoa, & Zhanga, 2020; Dibra, Wolf, Oztireli, & Gross, 2017; Doosti,
are included in Section 8. Finally, We discuss the main challenges and
2019; Wang, Chen, Liu, Qian, Lin, & Ma, 2018). In this category,
conclude the work on Section 9.
only hand features are used to sign language recognition. After hand
detection from input data, the hand features are extracted using dif-
2. Why deep learning?
ferent deep learning architectures such as CNN, RBM, RNN, GAN,
In this section, we present a brief introduction to Deep Learning. and so on (Cao et al., 2017; Cheok et al., 2017; Deng et al., 2017;
Over recent years, deep learning methods outperformed previous state- Dibra et al., 2017; Escobedo-Cardenas & Camara-Chavez, 2020; Gomez-
of-the-art machine learning techniques in different areas, especially Donoso, Orts-Escolano, & Cazorla, 2019; Guo et al., 2017; Li, Xue,
in Computer Vision and Natural Language Processing (Voulodimos, Wang, Ge, Ren, & Rodriguez, 2019; Oberweger et al., 2015; Rastgoo
Doulamis, Doulamis, & Protopapadakis, 2018). Some of the most sig- et al., 2018; Rastgoo, Kiani, & Escalera, 2020a, 2020b; Supancic et al.,
nificant deep learning models used in computer vision problems are 2018; Tagliasacchi et al., 2015; Wang et al., 2018; Zheng et al., 2017).
Convolutional Neural Network (CNN) (Wu, 2019), Deep Boltzmann So, having an efficient hand detection and feature extraction model is
Machine (RBM) (Fischer & Igel, 2012), Deep Belief Network (DBN) challenging in this category. While CNN has an impressive capability
(Hinton, 2007), Auto Encoder (AE) (Grosse, 2017), Variational Auto En- to cope with the still images, it does not efficiently cover the sequence
coder (VAE) (Doersch, 2016), Generative Adversarial Network (GAN) information. So, CNN is combined with another deep learning model,
(Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, such as RNN, LSTM, and GRU, to benefit from the capability of these
& Bengio, 2014), and Recursive Neural Network (RNN) including Long models in sequence feature extraction from visual data.
Short Term Memory (LSTM) and Gated Recurrent Unit (GRU) (Wang,
2016). One of the main goals of deep models is to avoid the need to
3.1.2. Hand and face pose features
build/extract features. Deep learning allows computational models of
Since the face pose includes some important grammatical and
multiple processing layers to learn and represent data with multiple
prosodic features, using these features as complementary features with
levels of abstraction to imitate the human brain mechanism and implic-
the hand features could improve the sign language recognition accu-
itly capture the complex structures of large-scale data. The first effort
racy. This category focuses on the proposed models using the combined
on human brain simulation is dated back to 1943, where McCulloch
and Pitts (1943) tried to understand how the brain could produce features of hand and face pose. Few models have been proposed in this
highly complex patterns by using interconnected basic cells, called category due to some challenges in tracking human faces from videos
neurons. The trend of major contributions continued so that DBN was such as the head tilting and side-to-side movements. Furthermore, the
one of the prominent breakthroughs in Deep Learning introduced by proposed models have to be able to track and recognize the facial
Hinton, Osindero, and Teh (2006). Deep learning contains a wealthy features in natural settings, not in a constrained environment. So it
group of methods, including neural networks, hierarchical probabilistic is essential to have a model with accurate recognition that could
models, and a variety of unsupervised and supervised feature learning be applied in the real world, not in some complicated laboratory
algorithms. One of the important factors that contributed to the huge conditions. Challenges related to extreme head movements from side to
boost of deep learning is the advent of large-scale, high-quality, and side, frequent self-occlusions of the face by the signer’s hands and hair
publicly available labeled datasets along with the capability of parallel also have to be considered in this area. Because of these complexities,
GPU computing. Other factors, such as the alleviation of the van- some of the models work only on a part of the face, such as eye or leap,
ishing gradient, proposing some new regularization techniques (such to decrease these complexities (Koller, Ney, & Bowden, 2015).
3
R. Rastgoo et al. Expert Systems With Applications 164 (2021) 113794
Table 1
Sign language datasets including video samples. F: face, H: hand, h: head, W: word, S: sentence, C: Country, CN: Class Number, SubN: Subject Number, SampN: Sample Number,
LL: Language Level (word or sentence), A: Annotation.
Y Dataset C CN SubN SampN LL A
2011 Boston ASL LVD (Thangali, Nash, Sclaroff, & Neidle, 2011) USA 3300 6 9800 W H
2012 DGS Kinect 40 (Cooper, Ong, Pugeault, & Bowden, 2012) Germany 40 15 3000 W –
2012 RWTH-PHOENIX-Weather (Forster, Schmidt, Hoyoux, Koller, Zelle, Piater, & Ney, 2012) Germany 1200 9 45760 S F, H
2012 GSL 20 (Adaloglou, Chatzis, Papastratis, Stergioulas, Papadopoulos, Zacharopoulou, Xydopoulos, Atzakas, Greek 20 6 840 W –
Papazachariou, & Daras, 2019)
2013 PSL Kinect 30 (Oszust & Wysocki, 2013) Poland 30 1 300 W –
2013 PSL ToF 84 (Oszust & Wysocki, 2013) Poland 84 1 1680 W –
2014 DEVISIGN-G (Chai, Guang, Lin, Xu, Tang, Chen, & Zhou, 2013) China 36 8 432 W –
2014 DEVISIGN-D (Chai et al., 2013) China 500 8 6000 W –
2014 DEVISIGN-L (Chai et al., 2013) China 2000 8 24000 W –
2015 SIGNUM (Koller, Forster, & Hermann, 2015) Germany 450 25 33210 S –
2016 MSR (Chen, Zhang, Hou, Jiang, Liu, & Yang, 2017) USA 12 10 336 W –
2016 LSA64 (Ronchetti, Quiroga, Estrebou, L, & Rosete, 2016) Argentina 64 10 3200 W H, h
2016 TVC-hand gesture (Kim, Ban, & Lee, 2017) Korea 10 1 650 – –
2018 PHOENIX14T (Camgoz, Hadfield, Koller, Ney, & Bowden, 2018) Germany 1066 9 67781 S –
2020 RKS-PERSIANSIGN (Rastgoo et al., 2020a) Iran 100 10 10000 W H
3.1.3. Hand, face, and body pose features to combine with the CNN features. Two types of flow information,
Some models fuse the features of hand, face, and the other parts Optical Flow (OF) and Scene Flow (SF), have been used in some
of the human body to benefit from the fused features and improve models. While OF, as a displacement vector of pixel positions, is
the recognition accuracy. Using these fused features, sign language usually used for RGB image sequences, SF, as a dense or semi-
recognition models could be improved to be more robust to occlusions, dense 3D motion field of a scene with respect to the camera, is
severe deformations, and appearance variations (Kocabas, Karagoz, & applied to depth frame sequences.
Akbas, 2018; Newell, Yang, & Deng, 2016; Wei, Ramakrishna, Kanade, • Devices: There are different devices to record the input data
& Sheikh, 2016). In this category, the proposed models benefit from the modalities and camera is the most common one of them. Different
body features, as these features could improve the recognition accuracy cameras support different formats and qualities of input data.
in the complex situations of the hand or face occlusions. Microsoft Kinect, as a strong device, is widely used due to its
ability to provide high-quality depth and RGB video streams,
3.2. Input modality simultaneously. Another device is the flex sensors planted inside
the gloves to acquire data of the palms and fingers movement.
Generally, vision-based and glove-based approaches are the two Leap Motion Controller (LMC) system is another device to detect
main categories considered for sign language recognition models. While and track hands, fingers, and finger-like objects. The point is
the vision-based models use the captured video data of the signers for that we need to delineate what kind of applications are targeted
different signs, the glove-based models employ some mechanical or because selection of the suitable device heavily depends on the
optical sensors attached to a glove in order to use the electrical signals application. For example, while Kinect camera can work well
for hand pose detection. Vision-based models provide more natural and for many real applications, it highly depends on some environ-
real systems based on the information that human can sense from the mental conditions. The same go with the other devices. Using
surroundings (Zheng et al., 2017). Focusing on vision-based models, we these depth sensors and cameras such as Kinect and RealSense,
explain the details of the input modalities in this subsection. the proposed models can employ 3D information to decrease
the ambiguity of 2D information recorded from traditional de-
3.2.1. Visual modality vices (Marı n Jimeneza, Romero-Ramireza, Munoz-Salinasa, &
We present the details of visual modality from two perspectives, as Medina-Carnicer, 2018; Lifshitz, Fetaya, & Ullman, 2016).
follows:
• Input data modality: There are some visual input data modalities 3.2.2. Static or dynamic
in sign language recognition area in recent years. RGB and depth Two forms of input data, static or dynamic, are used in sign lan-
input data are two common types of input data used in sign guage recognition models to extract the necessary features. Many deep-
language recognition models. While RGB images or videos include based models have been proposed to use the still or sequence inputs in
the high-resolution contents, depth inputs have the accurate in- recent years. While dynamic inputs include the sequential information
formation related to the distance between the image plane and that could be useful to improve the sign language recognition accuracy,
the corresponding object in the image. Some of the models use there are still some challenges for using this data such as compu-
the advantages of two input modalities, simultaneously. We also tation complexity of input sequences. Furthermore, dynamic inputs
survey the models in this category in the next section. Thermal can be split into isolated dynamic inputs and continuous dynamic
modality is another modality that it is not as common as RGB inputs. While the isolated dynamic inputs are used at word level, the
and depth modalities. Infrared (IR) thermal sensors have the continuous dynamic inputs are employed at sentence level. Additional
capability of imaging scenes and objects based on the IR light challenges in continuous dynamic inputs include tokenization of the
reflectance or radiation emittance. Although many works have sentences into separate words, detection of start and end of a sentence,
used the benefits of thermal information for face recognition and and managing the abbreviations and synonyms in the sentence. In the
human body detection, few hand sign recognition models have fo- next sections, we survey the sign language recognition models that have
cused on thermal information learning (Kim et al., 2017). Another used these inputs.
type of the input modality is the skeleton, as an encoded form
of the joint sequences, that has been provided in some gesture 3.3. Datasets and different sign languages
and hand datasets. Flow information, as the motion features of
every pixel in a sequence of the video frames, is another input We list the most relevant datasets, including videos, for sign lan-
modality that have been widely used by some researchers in order guage recognition area in Table 1, respectively. For each dataset in this
4
R. Rastgoo et al. Expert Systems With Applications 164 (2021) 113794
Table 2
Sign language datasets including images.
Year Dataset C CN Sub Samp.
2011 ASL Fingerspelling A (Pugeault & Bowden, 2011) USA 24 5 131000
2011 ASL Fingerspelling B (Pugeault & Bowden, 2011) USA 24 9 –
2016 LSA16 handshapes (Ronchetti, Quiroga, Estrebou, & Lanzarini, 2016) Argentina 16 10 800
2015 PSL Fingerspelling ToF (Kapuscinski, Oszust, Wysocki, & Warchol, 2015) Poland 16 3 960
table, we specify eight fields including Year (Y), Dataset name, Country 2011; Tompson, Stein, Lecun, & Perlin, 2014; Wan, Zhao, Zhou, Guyon,
(C), Class Number (CN), Subject Number (SubN), Sample Number Escalera, & Li, 2016; Yuan, Ye, Stenger, Jain, & Kim, 2017). The trend
(SampN), Language Level (LL), and Annotation (A). These datasets have of sign and gesture datasets can be found in Fig. 3. Some of the most
different environments, qualities, constraints, and complexities. popular sign datasets, including image modality, have been listed in
While the sign language recognition models use different languages Tables 2 and 3. Also, some of the gesture datasets and also some
in the input data, American Sign Language (ASL) has attracted more samples of sign and gesture datasets have been presented in Table 3,
attention due to more popularity and usage. The other languages, such Fig. 4, and Fig. 5, respectively.
as India, Germany, Netherlands, Greece, Poland, Argentina, Turkey, As one can see in Tables 1–3, it should be better to increase the
China, have also been used in the proposed datasets (Adaloglou et al., numbers of sign categories to have a more realistic method generaliza-
2019; Andriluka, Pishchulin, Gehler, & Bernt, 2014; Baró, Gonzàlez, tion for real applications. Furthermore, We have to note that most of
Fabian, Bautista, Oliu, Escalante, Guyon, & Escalera, 2015; Chai et al., these datasets are for sign classification, not detection/spotting. Only
2013; Chen et al., 2017, 2017; Cooper et al., 2012; Escalera, Gonzàlez, a few datasets, such as Montalbano II dataset (Escalera et al., 2013),
Baró, Reyes, Lopés, Guyon, Athitsos, & Escalante, 2013; Forster et al., include a detection task.
2012; Ganapathi, Plagemann, Koller, & Thrun, 2012; Haque, Peng, Luo,
Alahi, Yeung, & Fei-Fei, 2016; Kapuscinski et al., 2015; Koller, Forster 3.4. Task complexity
et al., 2015; Liu & Shao, 2013; Matilainen, Sangi, Holappa, & Silven,
2016; Molchanov, Gupta, Kim, & Kautz, 2015; Oszust & Wysocki, 2013; Sign language, as a visual language for the deaf community, defines
Pugeault & Bowden, 2011; Ronchetti, Quiroga, Estrebou, & Lanzarini, some grammatical rules to concatenate the movements of the hand,
2016; Ronchetti, Quiroga, Estrebou, Lanzarini et al., 2016; Sapp & face, and body parts. While there are different parameters in hand
Taskar, 2013; Smedt, Wannous, & Vandeborre, 2016; Thangali et al., sign such as movement, shape, orientation, and place of articulation,
5
R. Rastgoo et al. Expert Systems With Applications 164 (2021) 113794
face sign uses the eye gaze, eyebrows, mouth, and head orientation of signs depends on the sign modality. There are three levels of signs,
parameters. In addition to these parameters, sign language benefits which are a static sign, dynamic sign, and continuous dynamic sign.
from multi-channel information of the hand shape, motions, body pose, While the input modality of a static sign is an image, a video modality is
and even facial gestures. Using these parameters, the level complexity employed in dynamic and continuous dynamic signs. However, a video
6
R. Rastgoo et al. Expert Systems With Applications 164 (2021) 113794
Table 3 each object proposal in the model input, without sharing computation.
Some of the gesture recognition datasets.
To improve the R-CNN performance, Fast-RCNN was suggested to
Year Dataset Modality Class num. efficiently classify object proposals using deep convolutional networks
2011 ChaLearn Gesture (Escalera et al., 2013) RGB, Depth 15 (Girshick, 2015). In Fast-RCNN, the image is fed as input to the CNN in
2012 MSR-Gesture3D (Chen et al., 2017) RGB, Depth 12
order to provide the convolutional feature map. The region proposals
2014 ChaLearn (Track 3) (Baró et al., 2015) RGB, Depth 20
2015 VIVA Hand Gesture (Molchanov et al., 2015) RGB 19
are detected in feature maps and forwarded to the next step. To avoid
2016 ChaLearn conGD (Wan et al., 2016) RGB, Depth 249 using the selective search method used in the RCNN and Fast-RCNN
2016 ChaLearn isoGD (Wan et al., 2016) RGB, Depth 249 to detect the region proposals, Shaoqing Ren et al. proposed a CNN
to learn the region proposals in the model (Ren, He, Girshick, & Sun,
2015). The Faster-RCNN model has been fine-tuned and used by some
of the proposed models for hand detection applications (Bambach, Lee,
sample in dynamic sign language recognition includes only one sign,
Crandall, & Yu, 2015; Yang, Li, Fermuller, & Aloimonos, 2015).
while a video sample containing multiple signs is used in continuous
The region-based methods, RCNN, Fast-RCNN, and Faster-RCNN, do
dynamic sign language recognition. In this way, there is a rising trend
not consider the whole image and work on the regions of the input
in parameter complexity from static sign to dynamic and continuous
image to localize the objects. To solve this problem, another model,
dynamic signs.
namely You Only look Once (YOLO), has been proposed to object de-
tection using just one CNN to predict the bounding boxes and the class
3.5. Deep applications
probabilities for these boxes. YOLO splits the input image into a grid
that predefined numbers of bounding boxes are considered in each grid.
Hand detection, hand tracking, hand pose estimation, hand gesture
The class probability of each bounding box is estimated using CNN,
recognition, and hand pose recovery are some of the most important
and the detected object will be located at the bounding boxes with
sub-areas of hand sign language recognition that are widely used in
maximum class probabilities. The limitation of YOLO is that it struggles
human–computer interaction applications (Newell et al., 2016). Nowa-
with small objects in the input image (Redmon, Divvala, Girshick, &
days that computers have become an important part of our lives,
Farhadi, 2016). To improve the detection accuracy of YOLO, another
they can be used to facilitate the communications of deaf and hearing
object detection model, namely Single Shot Multi-box Detector (SSD),
impaired people. For example, the interpreting services, such as remote
has been proposed using adding the feature maps from different layers
video human interpreting using high-speed internet connections, can be on top of the YOLO model. This makes the SSD more accurate and
used but there are some major limitations for these services that need faster (Liu, Anguelov, Erhan, Szegedy, Reed, Fu, & Berg, 2016). In this
to be resolved before usage. These applications are important not only section, we review the proposed models for hand detection using deep
from an engineering point of view but also for the impact on society. learning approaches.
Further applications, such as automatic indexing of signed videos,
human behavior understanding, online human communication using 4.1.1. RGB-based hand detection methods
hand tracking, multi-person pose estimation, and the other human– Simon et al. proposed a real-time convolutional-based method to
computer interaction applications, related to sign language area, can hand detection using a multi-view camera system from RGB still im-
also be considered to use (Cao et al., 2017; Deng et al., 2017; Marı n ages. They used a keypoint detector to provide noisy labels. After that,
Jimeneza et al., 2018; Supancic et al., 2018; Tagliasacchi et al., 2015). a multi-view geometry approach has been used to convert and re-
project the 2D detected keypoints into the 3D view. They generated
4. Hand sign language recognition some labeled data in this way and used them to train the proposed
model for hand keypoints detection. The model has been trained only
In this category, only hand features are used to sign language on their own dataset and reported the results. In the best case, they
recognition. There are some important sub-areas for hand sign language achieved average error of 3.65 for Tip Touch that is equal to having an
recognition which are: hand detection, hand pose estimation, real- improvement with a 1.66 margin in comparison with state-of-the-art.
time hand tracking, hand gesture recognition, and hand pose recovery. However, their model needs to be robust enough to work with fewer
In this section, we present the proposed models of last four years cameras and in less controlled environments (e.g., with multiple cell
in these sub-areas with an associated discussion on their pros and phones) (Simon et al., 2017).
cons. Tables 4–7, show the details of these models. In these tables, Yan et al. provided a multi-scale CNN for unconstrained hand
we categorize the presented models based on the same datasets, same detection in still images. They used a generic region proposal algorithm,
evaluation metrics, same features, and same input modalities used in followed by multi-scale information fusion from the VGG16 model to
the models. In these tables, we use some abbreviations for goal field hand detection scheme. They integrated the features from multiple lay-
which are: HP: Hand Pose, HT: Hand Tracking, HD: Hand Detection, ers of a CNN model to have a multi-scale representation of hand objects.
HSR: Hand Sign Recognition, HG: Hand Gesture, RHR: Real-time Hand The evaluation results on Oxford Hand Detection Dataset and VIVA
Recognition. Hand Detection Challenge showed that the model achieved a detection
accuracy of 49.6% and 92.8% on these datasets with a relative state-of-
4.1. Hand detection the-art improvement of 1.6% and 2.1% (Yan, Xia, Smith, Lu, & Zhang,
2017).
Hand detection currently has a momentous role in sign language
recognition area. However, many researches have been conducted to 4.1.2. Depth-based hand detection
improve hand detection models, this task still includes many challenges Neverova et al. provided a CNN-based model for hand detection
in computation time and detection accuracy aspects (Le, Jaw, Lin, Liu, and segmentation using both of the unlabeled and synthetic data.
& Huang, 2018). Some of the prominent object detection methods have They used the structural information not into the model architecture
been fine-tuned using transfer learning approach in order to use their but into the training objective and benefited from the advantages of
strong capabilities in hand detection area. One of the commendable ob- very fast test-time processing and the ability to parallelize. While the
ject detection methods is Region-based Convolutional Neural Network evaluation results on the synthetic data showed the improvement of the
(R-CNN) model provided by Girshick, Donahue, Darrell, and Malik segmentation and detection accuracy obtaining a detection accuracy
(2015). R-CNN uses the region proposals to detect the objects in the of 82.0%, they need to adopt the model to real data (Neverova et al.,
input images. It is slow because it performs a CNN forward pass for 2014).
7
R. Rastgoo et al. Expert Systems With Applications 164 (2021) 113794
Table 4
Deep sign language recognition models, categorized based on the datasets used for evaluation.
Dataset Year Ref. Goal Model Modality Results
Own dataset 2014 (Neverova, Wolf, Taylor, & Nebout, 2014) HP CNN 2D, Depth 82.0 (Acc.)
2014 (Tompson et al., 2014) HP CNN 2D, Depth 33.0 (error)
2015 (Kang, Tripathi, & Nguyen, 2015) HD CNN Depth 99.0
2015 (Tang, Lu, Wang, Huang, & Li, 2015) HP CNN, DNN 2D, Depth, RGB 0.899 ms (Ave. time)
2016 (Han, Chen, Li, & Chang, 2016) HG CNN 2D, RGB 93.80
2017 (Simon, Joo, Matthews, & Sheikh, 2017) HD CNN RGB 4.15 mm
2018 (Rao, Syamala, Kishore1, & Sastry, 2018) HSR CNN 2D, RGB 92.88
2018 (Ye, Tian, Huenerfauth, & Liu, 2018) HSR CNN 3D, RGB 69.2
2019 (Gomez-Donoso et al., 2019) HP CNN static, RGB, Depth 5 mm
2020 (Wadhawan & Kumar, 2020) HSR CNN static, RGB 99.72
NYU 2015 (Oberweger et al., 2015) HP CNN 3D, Depth 20 mm
2017 (Deng et al., 2017) HP CNN 3D, Depth 17 mm
2017 (Guo et al., 2017) HP CNN 2D, Depth 66.00
2017 (Fang & Lei, 2017) HP CNN, AE 2D, Depth 17 mm
2017 (Yuan et al., 2017) HP CNN 2D, Depth 21.4 mm
2017 (Madadi, Escalera1, Baro, & Gonzalez, 2017) HP CNN 2D, Depth 15.6 mm
2018 (Chen et al., 2020) HP CNN 2D, Depth 11.811 mm
2018 (Rastgoo et al., 2018) HSR RBM 2D, RGB, Depth 90.01
2016 (Sinha, Choi, & Ramani, 2016) HP CNN static, Depth 9 mm
2017 (Ge, Liang, Yuan, & Thalmann, 2017) HP CNN static, Depth 9 mm
2017 (Dibra et al., 2017) HP CNN static, Depth 9 mm
2018 (Ge, Liang, Yuan, & Thalmann, 2018) HP CNN static, Depth 9.46 mm
2018 (Moon, Chang, & Lee, 2018) HP CNN static, Depth 8.42 mm
2018 (Baek, Kim, & Kim, 2018) HP GAN static, Depth 14.1 mm
2018 (Kazakos, Nikou, & Kakadiaris, 2018) HP CNN static, RGB, Depth 9 mm
2018 (Spurr, Song, Park, & Hilliges, 2018) HP VAE static, RGB, Depth 10.5 mm
2020 (Rastgoo et al., 2020a) HSR SSD, 2DCNN, 3DCNN, LSTM dynamic, RGB 4.64 mm
ICVL 2015 (Oberweger et al., 2015) HP CNN 3D, Depth 10 mm
2017 (Deng et al., 2017) HP CNN 3D, Depth 11 mm
2017 (Guo et al., 2017) HP CNN 2D, Depth 7.8 mm
2017 (Fang & Lei, 2017) HP CNN, AE 2D, Depth 9 mm
2017 (Yuan et al., 2017) HP CNN 2D, Depth 12.3 (error)
2017 (Dibra et al., 2017) HP CNN static, Depth 8 mm
2018 (Chen et al., 2020) HP CNN 2D, Depth 6.793 mm
2018 (Ge, Liang et al., 2018) HP CNN static, Depth 8 mm
2018 (Moon et al., 2018) HP CNN static, Depth 6.28 mm
2018 (Baek et al., 2018) HP GAN static, Depth 8.5 mm
2018 (Spurr et al., 2018) HP VAE static, RGB, Depth 19.5 mm
MSRA 2016 (Oberweger, Riegler, Wohlhart, & Lepetit, 2016) HP CNN 2D, 3D, Depth 5.58 mm (error)
2017 (Guo et al., 2017) HP CNN 2D, Depth 9.80 mm
2017 (Madadi et al., 2017) HP CNN 2D, Depth 18 mm
2017 (Yuan et al., 2017) HP CNN 2D, Depth 21.3 (error)
2017 (Ge et al., 2017) HP CNN static, Depth 6 mm
2018 (Chen et al., 2020) HP CNN 2D, Depth 8.649 mm
2018 (Ge, Liang et al., 2018) HP CNN static, Depth 8 mm
2018 (Moon et al., 2018) HP CNN static, Depth 7.49 mm
2018 (Baek et al., 2018) HP GAN static, Depth 12.5 mm
2018 (Spurr et al., 2018) HP VAE static, RGB, Depth 10 mm
FLIC 2014 (Toshev & Szegedy, 2014) HP DNN 2D, RGB 96.0
2016 (Newell et al., 2016) HP CNN 2D, RGB 99.0 (Elbow)
2016 (Wei et al., 2016) HP CNN 2D, RGB 97.59
LSP 2014 (Toshev & Szegedy, 2014) HP DNN 2D, RGB 78.0
2016 (Wei et al., 2016) HP CNN 2D, RGB 84.32
isoGD 2016 (Duan, Zhou, Wany, Guo, & Li, 2016) HG CNN 2D, Depth, RGB 67.19
2017 (Wang, Li, Liu, Gao, Tang, & Ogunbona, 2017) HG CNN 2D, Depth 55.57
2020 (Rastgoo et al., 2020b) HSR SSD, CNN, LSTM 2D, RGB 86.32
MPII 2016 (Newell et al., 2016) HP CNN 2D, RGB 90.90 (total)
2016 (Wei et al., 2016) HP CNN 2D, RGB 87.95
ITOP 2016 (Haque et al., 2016) HP CNN 2D, Depth 80.50
2017 (Guo et al., 2017) HP CNN 2D, Depth 84.90
2018 (Marı n Jimeneza et al., 2018) HP CNN 3D, Depth 97.5 (AUC)
RGBD-HuDaAct 2016 (Duan et al., 2016) HG CNN 2D, Depth, RGB 96.74
STB 2017 (Zimmermann & Brox, 2017) HP CNN 3D, RGB 94.0 (AUC)
2018 (Mueller, Bernard, Sotnychenko, Mehta, Sridhar, HT CNN static, RGB 96.5 (AUC)
Casas, & Theobalt, 2018)
2018 (Spurr et al., 2018) HP VAE static, RGB, Depth 98.3(AUC)
2019 (Li et al., 2019) HP CNN static, RGB 8.34 (err)
2019 (Gomez-Donoso et al., 2019) HP CNN static, RGB, Depth 5 mm
EVAL 2016 (Haque et al., 2016) HP CNN 2D, Depth 74.10
Dexter 2017 (Zimmermann & Brox, 2017) HP CNN 3D, RGB 49.0 (AUC)
8
R. Rastgoo et al. Expert Systems With Applications 164 (2021) 113794
Table 4 (continued).
Dataset Year Ref. Goal Model Modality Results
Real video samples 2015 (Tagliasacchi et al., 2015) HT CNN 2D, Depth 8 mm (MSE)
2019 (Ferreira et al., 2019) HS CNN static, RGB, Depth 3.17, 92.61
RWTH-PHOENIX-Weather 2015 (Koller, Ney et al., 2015) HSR CNN 2D, RGB 55.70 (Precision)
BigHand2.2M 2017 (Yuan et al., 2017) HP CNN 2D, Depth 17.1 (error)
2018 (Baek et al., 2018) HP GAN static, Depth 13.7 mm
Human3.6M 2018 (Wang et al., 2018) HP CNN 2D, Depth 62.8 mm
UBC3V 2018 (Marı n Jimeneza et al., 2018) HP CNN 3D, Depth 88.2 (AUC)
Massey 2012 2018 (Rastgoo et al., 2018) HR RBM 2D, RGB, Depth 99.31
SL Surrey 2018 (Rastgoo et al., 2018) HR RBM 2D, RGB, Depth 97.56
ASL Fingerspelling A 2018 (Rastgoo et al., 2018) HR RBM 2D, RGB, Depth 98.13
OUHANDS, 2018 (Dadashzadeh, Tavakoli Targhi, & Tahmasbi, 2018) HG CNN 2D, Depth 86.46
Egohands 2017 (Dibia, 2017) HT CNN static, RGB 96.86 (mAP)
Dexter 2018 (Mueller et al., 2018) HT CNN static, RGB 64.0 (AUC)
EgoDexter 2018 (Mueller et al., 2018) HT CNN static, RGB 54.0 (AUC)
RHD 2018 (Spurr et al., 2018) HP VAE static, RGB, Depth 84.9(AUC)
B2RGB-SH 2019 (Li et al., 2019) HP CNN static, RGB 7.18 (err)
DHG-14/28 Dataset 2019 (Chen, Zhao, Peng, Yuan, & Metaxas, 2019) HG CNN dynamic, RGB 91.9
SHREC’17 Track Dataset 2019 (Chen et al., 2019) HG CNN dynamic, RGB 94.4
RWTH-BOSTON-50 2019 (Lim et al., 2019) HS CNN dynamic, RGB 89.33
ASLLVD 2019 (Lim et al., 2019) HS CNN dynamic, RGB 31.50
EgoGesture 2019 (Kopuklu, Gunduz, Kose, & Rigoll, 2019) HG CNN dynamic, RGB 94.03
NVIDIA benchmarks 2019 (Kopuklu et al., 2019) HG CNN dynamic, RGB 83.83
isoGD 2020 (Elboushaki, Hannane, Afdel, & Koutti, 2020) HG CNN dynamic, RGB, Depth 72.53
SKIG 2020 (Elboushaki et al., 2020) HG CNN dynamic, RGB, Depth 99.72
NATOPS 2020 (Elboushaki et al., 2020) HG CNN dynamic, RGB, Depth 95.87
SBU 2020 (Elboushaki et al., 2020) HG CNN dynamic, RGB, Depth 97.51
RKS-PERSIANSIGN 2020 (Rastgoo et al., 2020a) HSR SSD, 2DCNN, 3DCNN, LSTM dynamic, RGB 99.80
9
R. Rastgoo et al. Expert Systems With Applications 164 (2021) 113794
Table 5
Deep sign language recognition models, categorized based on the evaluation metric.
Evaluationmetric Year Ref. Goal Model Modality Dataset Results
Accuracy 2014 (Neverova et al., 2014) HP CNN 2D, Depth proposed dataset 82.0
2014 (Toshev & Szegedy, 2014) HP DNN 2D, RGB FLIC, LSP 96.0 , 78.0
2015 (Kang et al., 2015) HD CNN Depth proposed dataset 99.0
2016 (Han et al., 2016) HG CNN 2D, RGB proposed dataset 93.80
2016 (Duan et al., 2016) HG CNN 2D, Depth, RGB Chalearn IsoGD, RGBD-HuDaAct 67.19, 96.74
2016 (Newell et al., 2016) HP CNN 2D, RGB FLIC, MPII 99.0 (Elbow), 90.90 (total)
2016 (Wei et al., 2016) HP CNN 2D, RGB MPII, LSP, FLIC 87.95, 84.32, 97.59
2016 (Haque et al., 2016) HP CNN 2D, Depth EVAL, ITOP 74.10, 80.50
2017 (Wang et al., 2017) HG CNN 2D, Depth ChaLearn 55.57
2018 (Rao et al., 2018) HSR CNN 2D, RGB own dataset 92.88
2018 (Ye et al., 2018) HSR CNN 3D, RGB own dataset 69.2
2018 (Rastgoo et al., 2018) HR RBM 2D, RGB, Depth Massey 2012, SL Surrey, 99.31, 97.56, 90.01, 98.13
NYU, ASL Fingerspelling A
2018 (Dadashzadeh et al., 2018) HG CNN 2D, Depth OUHANDS, 86.46
2020 (Wadhawan & Kumar, 2020) HSR CNN static, RGB own dataset 99.72
2019 (Chen et al., 2019) HG CNN dynamic, RGB DHG-14/28 Dataset 91.9
2019 (Chen et al., 2019) HG CNN dynamic, RGB SHREC’17 Track Dataset 94.4
2019 (Ferreira et al., 2019) HS CNN static, RGB, Depth real-time frames 93.17, 92.61
2019 (Lim et al., 2019) HS CNN dynamic, RGB RWTH-BOSTON-50 89.33
2019 (Lim et al., 2019) HS CNN dynamic, RGB ASLLVD 31.50
2019 (Kopuklu et al., 2019) HG CNN dynamic, RGB EgoGesture 94.03
2019 (Kopuklu et al., 2019) HG CNN dynamic, RGB NVIDIA benchmarks 83.83
2020 (Elboushaki et al., 2020) HG CNN dynamic, RGB, Depth isoGD 72.53
2020 (Elboushaki et al., 2020) HG CNN dynamic, RGB, Depth SKIG 99.72
2020 (Elboushaki et al., 2020) HG CNN dynamic, RGB, Depth NATOPS 95.87
2020 (Elboushaki et al., 2020) HG CNN dynamic, RGB, Depth SBU 97.51
mAP 2017 (Dibia, 2017) HT CNN static, RGB Egohands 96.86
Error 2014 (Tompson et al., 2014) HP CNN 2D, Depth proposed dataset 0.33
2017 (Yuan et al., 2017) HP CNN 2D, Depth ICVL, NYU, MSRC, BigHand2.2M 12.3, 21.4, 21.3, 17.1
2019 (Li et al., 2019) HP CNN static, RGB STB 8.34
2019 (Li et al., 2019) HP CNN static, RGB B2RGB-SH 7.18
2015 (Tagliasacchi et al., 2015) HT CNN 2D, Depth real video samples 8 mm
2015 (Oberweger et al., 2015) HP CNN 3D, Depth ICVL, NYU 10 mm, 20 mm
2016 (Oberweger et al., 2016) HP CNN 2D, 3D, Depth MSRA 5.58 mm
2017 (Deng et al., 2017) HP CNN 3D, Depth ICVL, NYU 11 mm, 17 mm
2017 (Guo et al., 2017) HP CNN 2D, Depth ICVL, NYU, MSRA, ITOP 7.8 mm, 34.00, 9.80 mm, 15.10
2017 (Simon et al., 2017) HD CNN RGB own dataset 4.15 mm
2017 (Fang & Lei, 2017) HP CNN, AE 2D, Depth NYU, ICVL 17 mm, 9 mm
2017 (Madadi et al., 2017) HP CNN 2D, Depth NYU, MSRA 15.6 mm, 18 mm
2018 (Chen et al., 2020) HP CNN 2D, Depth ICVL, NYU, MSRA 6.793 mm, 11.811 mm, 8.649 mm
2018 (Wang et al., 2018) HP CNN 2D, Depth Human3.6M 62.8 mm
2016 (Sinha et al., 2016) HP CNN static, Depth Dexter1 16.35 mm
2016 (Sinha et al., 2016) HP CNN static, Depth NYU 9 mm
2017 (Ge et al., 2017) HP CNN static, Depth NYU 9 mm
2017 (Dibra et al., 2017) HP CNN static, Depth NYU 9 mm
2018 (Ge, Liang et al., 2018) HP CNN static, Depth NYU 9.46 mm
2018 (Moon et al., 2018) HP CNN static, Depth NYU 8.42 mm
2018 (Baek et al., 2018) HP GAN static, Depth NYU 14.1 mm
2018 (Kazakos et al., 2018) HP CNN static, RGB, Depth NYU 9 mm
2018 (Spurr et al., 2018) HP VAE static, RGB, Depth NYU 10.5 mm
2017 (Dibra et al., 2017) HP CNN static, Depth ICVL 8 mm
2018 (Ge, Liang et al., 2018) HP CNN static, Depth ICVL 8 mm
2018 (Moon et al., 2018) HP CNN static, Depth ICVL 6.28 mm
2018 (Baek et al., 2018) HP GAN static, Depth ICVL 8.5 mm
2018 (Spurr et al., 2018) HP VAE static, RGB, Depth ICVL 19.5 mm
2017 (Ge et al., 2017) HP CNN static, Depth MSRA 6 mm
2018 (Ge, Liang et al., 2018) HP CNN static, Depth MSRA 8 mm
2018 (Moon et al., 2018) HP CNN static, Depth MSRA 7.49 mm
2018 (Baek et al., 2018) HP GAN static, Depth MSRA 12.5 mm
2018 (Spurr et al., 2018) HP VAE static, RGB, Depth MSRA 10 mm
2019 (Gomez-Donoso et al., 2019) HP CNN static, RGB, Depth STB 5 mm
2019 (Gomez-Donoso et al., 2019) HP CNN static, RGB, Depth own dataset 5 mm
2018 (Baek et al., 2018) HP GAN static, Depth Big Hand 2.2M 13.7 mm
Run time 2015 (Tang et al., 2015) HP CNN, DNN 2D, Depth, RGB own dataset 0.899 ms
Precision 2015 (Koller, Ney et al., 2015) HSR CNN 2D, RGB RWTH-PHOENIX-Weather 55.70
AUC 2017 (Zimmermann & Brox, 2017) HP CNN 3D, RGB SPT and Dexter 94.0, 49.0
2018 (Marı n Jimeneza et al., 2018) HP CNN 3D, Depth UBC3V, ITOP 88.2, 97.5
2018 (Mueller et al., 2018) HT CNN static, RGB STB 96.5
2018 (Spurr et al., 2018) HP VAE static, RGB, Depth STB 98.3
2018 (Mueller et al., 2018) HT CNN static, RGB Dexter 64.0
2018 (Mueller et al., 2018) HT CNN static, RGB EgoDexter 54.0
2018 (Spurr et al., 2018) HP VAE static, RGB, Depth RHD 84.9
10
R. Rastgoo et al. Expert Systems With Applications 164 (2021) 113794
Table 6
Deep sign language recognition models, categorized based on the feature types.
Goal Year Ref. Model Modality Dataset Results
HP 2014 (Neverova et al., 2014) CNN 2D, Depth proposed dataset 82 (Acc.)
2014 (Toshev & Szegedy, 2014) DNN 2D, RGB FLIC, LSP 96, 78
2016 (Newell et al., 2016) CNN 2D, RGB FLIC, MPII 99 (Elbow), 90.90 (total)
2016 (Wei et al., 2016) CNN 2D, RGB MPII, LSP, FLIC 87.95, 84.32, 97.59
2016 (Haque et al., 2016) CNN 2D, Depth EVAL, ITOP 74.10, 80.50
2014 (Tompson et al., 2014) CNN 2D, Depth proposed dataset 33 (error)
2017 (Yuan et al., 2017) CNN 2D, Depth ICVL, NYU, MSRC, BigHand2.2M 12.3, 21.4, 21.3, 17.1 (error)
2015 (Oberweger et al., 2015) CNN 3D, Depth ICVL, NYU 10 mm, 20 mm
2016 (Oberweger et al., 2016) CNN 2D, 3D, Depth MSRA 5.58 mm (error)
2017 (Deng et al., 2017) CNN 3D, Depth ICVL, NYU 11 mm, 17 mm
2017 (Guo et al., 2017) CNN 2D, Depth ICVL, NYU, MSRA, ITOP 7.8 mm, 66.00, 9.80 mm, 84.90
2017 (Fang & Lei, 2017) CNN, AE 2D, Depth NYU, ICVL 17 mm, 9 mm
2017 (Madadi et al., 2017) CNN 2D, Depth NYU, MSRA 15.6 mm, 18 mm
2018 (Chen et al., 2020) CNN 2D, Depth ICVL, NYU, MSRA 6.793 mm, 11.811 mm, 8.649 mm
2018 (Wang et al., 2018) CNN 2D, Depth Human3.6M 62.8 mm
2015 (Tang et al., 2015) CNN, DNN 2D, Depth, RGB own dataset 0.899 ms(Ave. time)
2017 (Zimmermann & Brox, 2017) CNN 3D, RGB Stereo Hand Pose Tracking Benchmark and Dexter 94, 49.0 (AUC)
2018 (Marı n Jimeneza et al., 2018) CNN 3D, Depth UBC3V, ITOP 88.2, 97.5 (AUC)
2016 (Sinha et al., 2016) CNN static, Depth Dexter1 16.35 mm
2016 (Sinha et al., 2016) CNN static, Depth NYU 9 mm
2017 (Ge et al., 2017) CNN static, Depth MSRA 6 mm
2017 (Ge et al., 2017) NN static, Depth NYU 9 mm
2017 (Dibra et al., 2017) CNN static, Depth ICVL 8 mm
2017 (Dibra et al., 2017) CNN static, Depth NYU 9 mm
2018 (Ge, Liang et al., 2018) CNN static, Depth NYU 9.46 mm
2018 (Ge, Liang et al., 2018) CNN static, Depth ICVL 8 mm
2018 (Ge, Liang et al., 2018) CNN static, Depth MSRA 8 mm
2018 (Moon et al., 2018) CNN static, Depth ICVL 6.28 mm
2018 (Moon et al., 2018) CNN static, Depth NYU 8.42 mm
2018 (Moon et al., 2018) CNN static, Depth MSRA 7.49 mm
2018 (Baek et al., 2018) GAN static, Depth ICVL 8.5 mm
2018 (Baek et al., 2018) GAN static, Depth MSRA 12.5 mm
2018 (Baek et al., 2018) GAN static, Depth NYU 14.1 mm
2018 (Baek et al., 2018) GAN static, Depth Big Hand 2.2M 13.7 mm
2018 (Kazakos et al., 2018) CNN static, RGB, Depth NYU 9 mm
2018 (Spurr et al., 2018) VAE static, RGB, Depth STB 98.3(AUC)
2018 (Spurr et al., 2018) VAE static, RGB, Depth RHD 84.9(AUC)
2018 (Spurr et al., 2018) VAE static, RGB, Depth ICVL 19.5 mm
2018 (Spurr et al., 2018) VAE static, RGB, Depth NYU 10.5 mm
2018 (Spurr et al., 2018) VAE static, RGB, Depth MSRA 10 mm
2019 (Gomez-Donoso et al., 2019) CNN static, RGB, Depth own dataset 5 mm
2019 (Gomez-Donoso et al., 2019) CNN static, RGB, Depth STB 5 mm
2019 (Li et al., 2019) CNN static, RGB STB 8.34 (err)
2019 (Li et al., 2019) CNN static, RGB B2RGB-SH 7.18 (err)
2018 (Spurr et al., 2018) VAE static, RGB, Depth STB 98.3(AUC)
2018 (Spurr et al., 2018) VAE static, RGB, Depth RHD 84.9(AUC)
2019 (Li et al., 2019) CNN static, RGB STB 8.34 (err)
2019 (Li et al., 2019) CNN static, RGB B2RGB-SH 7.18 (err)
RHR 2018 (Rastgoo et al., 2018) RBM 2D, RGB, Depth Massey 2012, SL Surrey, 99.31, 97.56, 90.01, 98.13
NYU, ASL Fingerspelling A
HG 2016 (Han et al., 2016) CNN 2D, RGB proposed dataset 93.80
2016 (Duan et al., 2016) CNN 2D, Depth, RGB Chalearn IsoGD, RGBD-HuDaAct 67.19, 96.74
2017 (Wang et al., 2017) CNN 2D, Depth ChaLearn 55.57
2018 (Dadashzadeh et al., 2018) CNN 2D, Depth OUHANDS, 86.46
2019 (Kopuklu et al., 2019) CNN dynamic, RGB EgoGesture 94.03
2019 (Kopuklu et al., 2019) CNN dynamic, RGB NVIDIA benchmarks 83.83
2020 (Elboushaki et al., 2020) CNN dynamic, RGB, Depth isoGD 72.53
2020 (Elboushaki et al., 2020) CNN dynamic, RGB, Depth SKIG 99.72
2020 (Elboushaki et al., 2020) CNN dynamic, RGB, Depth NATOPS 95.87
2020 (Elboushaki et al., 2020) CNN dynamic, RGB, Depth SBU 97.51
2019 (Chen et al., 2019) CNN dynamic, RGB DHG-14/28 Dataset 91.9
2019 (Chen et al., 2019) CNN V SHREC’17 Track Dataset 94.4
HSR 2018 (Rao et al., 2018) CNN 2D, RGB own dataset 92.88
2018 (Ye et al., 2018) CNN 3D, RGB own dataset 69.2
2015 (Koller, Ney et al., 2015) CNN 2D, RGB RWTH-PHOENIX-Weather 55.70 (Precision)
2020 (Wadhawan & Kumar, 2020) CNN static, RGB own dataset 99.72
2019 (Ferreira et al., 2019) CNN static, RGB, Depth real-time frames 93.17, 92.61
2019 (Lim et al., 2019) CNN dynamic, RGB RWTH-BOSTON-50 89.33
2019 (Lim et al., 2019) CNN dynamic, RGB ASLLVD 31.50
HD 2015 (Kang et al., 2015) CNN Depth proposed dataset 99
2017 (Simon et al., 2017) CNN RGB own dataset 4.15 mm
11
R. Rastgoo et al. Expert Systems With Applications 164 (2021) 113794
Table 6 (continued).
Goal Year Ref. Model Modality Dataset Results
HT 2015 (Tagliasacchi et al., 2015) CNN dynamic, Depth real video samples 8 mm (MSE)
2017 (Dibia, 2017) CNN static, RGB Egohands 96.86 (mAP)
2018 (Mueller et al., 2018) CNN static, RGB STB 96.5 (AUC)
2018 (Mueller et al., 2018) CNN static, RGB Dexter 64 (AUC)
2018 (Mueller et al., 2018) CNN static, RGB EgoDexter 54 (AUC)
Stereo Hand Pose Tracking Benchmark show that the model outper- et al., 2016). Another deep learning approach, including a CNN with an
formed state-of-the-art alternatives in hand pose estimation obtaining embedded forward kinematics based layer for intermediate representa-
a Percentage of Correct Keypoints (PCK) of 99.85% with 9.85% relative tion in the network, has been proposed by Zhou et al. to hand pose
improvement (Gomez-Donoso et al., 2019). Spurr et al. designed a estimation. The proposed model has used prior geometric knowledge
model to learn a statistical hand model represented by a trained cross- in the learning process. While the model has some problems with
modal latent space via a VAE. They derive the variational lower bound inaccurate joint annotations and small viewpoint changes of the ICVL
that permits training of a single latent space using multiple modalities. dataset, it achieves an estimation accuracy of 16.9 mm on NYU dataset.
In this space, similar input poses are embedded close to each other Result of this model is comparable with state-of-the-art alternatives
independent of the input modality. While the proposed model is trained on NYU with an approximately 1 mm relative distance (Zhou, Wan,
on the RGB input images, it is tested on different combinations of Zhang, Xue, & Wei, 2016). Ge et al. proposed a multi-view CNN-based
modalities where the aim is to estimate 3D hand poses as output. In model to project the query depth image onto three orthogonal planes
parallel, the VAE framework permits to generate samples consistently and regress the 2D heat-maps of each plan in order to estimate the
in each modality. Reported results using AUC metric show 17.4 relative hand joint positions. The final 3D hand pose estimation with learned
improvement in model performance in comparison with the state-of- pose priors is achieved using the multi-view heat-maps. Experimental
the-art models in hand pose estimation on the RHD dataset (Spurr et al., results on NYU, ICVL, and MSRA datasets show that the proposed
2018). method outperforms state-of-the-art alternatives achieving a relative
improvement of 1.1 mm, 0.5 mm, and 0.5 mm on these datasets,
4.2.2. Depth-based hand pose models respectively (Ge, Ren, & Yuan, 2018). Fang and Lei suggested a CNN
Oberweger et al. have proposed several CNN architectures to predict model with an embedding denoising auto-encoder in the bottom layer
the 3D hand joint locations from a depth map. They used a learned prior of the network to hand pose estimation. used an auto-encoder as
model as a bottleneck layer, with fewer neurons than the last layer, to the nonlinear dimension converter in the model to learn the prior
improve the hand pose estimation accuracy. Furthermore, a refinement knowledge and consolidate this embedding layer monotonically into
stage has been provided using spatial pooling and sub-sampling of the CNN to improve hand pose estimation accuracy. They reported a
multiple input regions centered on the initial joints estimation. This relative state-of-the-art improvement of approximately 1 mm and 2 mm
model achieved the estimation error of 10 mm and 20 mm on ICVL on ICVL and NYU datasets (Fang & Lei, 2017). Chen et al. suggested
and NYU datasets obtaining 1 mm and 3 mm relative improvement in a Pose guided structured Region Ensemble Network (Pose-REN) to
comparison with state-of-the-art models. Performance analysis of this enhance the estimation accuracy of hand pose from a depth image.
model show that the location prediction of the joints is constrained They partitioned the CNN feature maps into some regions and fused
by the learned hand model. This means that if the missing regions of the estimated joints of these regions based on a tree-structured fully
prediction get too large, the accuracy gets worse (Oberweger et al., connections. A refined estimation and iterative cascaded method have
2015). A 3D neural network architecture has been provided by Deng been applied to estimate the final hand pose. Evaluation results showed
et al. to 3D hand pose estimation from a single depth image. They that the model obtained state-of-the-art with a relative improvement
converted an input depth map to a 3D volumetric representation and of 7.04%, 0.88 mm, and 2 mm on ICVL, NYU, and MSRA datasets,
fed it into a 3D CNN without any ground truth reference point for net- respectively (Chen et al., 2020). Yuan et al. proposed a tracking system
work initialization. Some synthetic depth images have been rendered using the magnetic sensors and inverse kinematics to automatically
from existing real image datasets to increase the training data. They acquire the hand joints annotations from a depth map. They also
achieved an estimation accuracy of 74% and 96% with 9% and 3% provided a dataset using six magnetic sensors, five on each fingernail
relative improvement in comparison with state-of-the-art alternatives and one on the back of the palm, to record the 6D measurements of the
on NYU and ICVL datasets (Deng et al., 2017). Guo et al. suggested a joints. After that, inverse kinematics with 31 degrees of freedom (dof)
tree-structured Region Ensemble Network (REN) for hand pose estima- and kinematic constraints have been applied to estimate the final joint
tion using a regression-based method. Partitioning the last convolution locations. This model achieved a relative state-of-the-art improvement
outputs of the CNN into several grid regions, the results from fully- of 2.6 mm, 1.2 mm, and 29.2 mm in estimation error on ICVL, NYU,
connected (FC) regressors on each region are fused and input to another and Bighand datasets, respectively (Yuan et al., 2017). Supancic et al.
FC layer for final estimation. Different training strategies have been provided an analysis of the state-of-the-art methods focusing on hand
used to improve the performance of the joints localization. The results pose estimation from a single depth frame. They defined an evalu-
of the model evaluation on three public datasets, ICVL, NYU, and ITOP ation metric and a simple nearest-neighbor baseline to compare the
datasets showed that their model achieved the state-of-the-art results recognition accuracy of different models. Comparison results of some
for hand pose estimation with 1.05 mm, 3.7 mm, and 4.4% of relative existence models on NYU, ICL, and EGO datasets have been presented
improvement (Guo et al., 2017). Oberweger et al. proposed an efficient and analyzed extensively using the proposed metric. Evaluation results
and accurate method to put a label for each frame of a hand depth video confirm the effectiveness of the proposed metric for analyzing different
and estimate the 3D locations of the joints. They sample some frames methods and datasets (Supancic et al., 2018). Moon et al. designed a
of each video by some users, namely reference frames. Users provide model for mapping 3D hand and human pose estimation parameters
the initial 2D estimations of the visible joints. After that, some spatial, into a voxel-to-voxel space to estimate the likelihood for each keypoint
temporal, and appearance constraints are applied to these 2D joints per each voxel. This model benefits from a 3D CNN for real-time
in order to estimate the full 3D poses of the hand over the complete estimation of hand keypoints. Evaluation results on the three publicly
sequence. While the evaluation results on MSRA dataset showed that available 3D hand and human pose estimation datasets, used in the
the model outperformed state-of-the-art model with 4.38 mm relative HANDS 2017 challenge, show that the model outperforms the other
improvement, their model was too complicated to use (Oberweger models in hand and human pose estimation. This model achieved a
12
R. Rastgoo et al. Expert Systems With Applications 164 (2021) 113794
Table 7
Deep sign language recognition models, categorized based on the input modality.
Modality Year Ref. Goal Model Dataset Results
2D, Depth 2014 (Neverova et al., 2014) HP CNN own dataset 82 (Acc.)
2014 (Neverova et al., 2014) HP CNN own dataset 82 (Acc.)
2014 (Tompson et al., 2014) HP CNN own dataset 33 (error)
2015 (Kang et al., 2015) HD CNN own dataset 99
2017 (Guo et al., 2017) HP CNN NYU 66
2017 (Fang & Lei, 2017) HP CNN, AE NYU 17 mm
2017 (Yuan et al., 2017) HP CNN NYU 21.4 (error)
2017 (Madadi et al., 2017) HP CNN NYU 15.6 mm
2018 (Chen et al., 2020) HP CNN NYU 11.811 mm
2016 (Sinha et al., 2016) HP CNN NYU 9 mm
2017 (Ge et al., 2017) HP CNN NYU 9 mm
2017 (Dibra et al., 2017) HP CNN NYU 9 mm
2018 (Ge, Liang et al., 2018) HP CNN NYU 9.46 mm
2018 (Moon et al., 2018) HP CNN NYU 8.42 mm
2018 (Baek et al., 2018) HP GAN NYU 14.1 mm
2017 (Guo et al., 2017) HP CNN ICVL 7.8 mm
2017 (Fang & Lei, 2017) HP CNN, AE ICVL 9 mm
2017 (Yuan et al., 2017) HP CNN ICVL 12.3 (error)
2017 (Dibra et al., 2017) HP CNN ICVL 8 mm
2018 (Chen et al., 2020) HP CNN ICVL 6.793 mm
2018 (Ge, Liang et al., 2018) HP CNN ICVL 8 mm
2018 (Moon et al., 2018) HP CNN ICVL 6.28 mm
2D, Depth 2018 (Baek et al., 2018) HP GAN ICVL 8.5 mm
2017 (Guo et al., 2017) HP CNN MSRA 9.80 mm
2017 (Madadi et al., 2017) HP CNN MSRA 18 mm
2017 (Yuan et al., 2017) HP CNN MSRA 21.3 (error)
2017 (Ge et al., 2017) HP CNN MSRA 6 mm
2018 (Chen et al., 2020) HP CNN MSRA 8.649 mm
2018 (Ge, Liang et al., 2018) HP CNN MSRA 8 mm
2018 (Moon et al., 2018) HP CNN MSRA 7.49 mm
2018 (Baek et al., 2018) HP GAN MSRA 12.5 mm
2017 (Wang et al., 2017) HG CNN ChaLearn IsoGD 55.57
2016 (Haque et al., 2016) HP CNN ITOP 80.50
2017 (Guo et al., 2017) HP CNN ITOP 84.90
2016 (Haque et al., 2016) HP CNN EVAL 74.10
2015 (Tagliasacchi et al., 2015) HT CNN real video samples 8 mm (MSE)
2017 (Yuan et al., 2017) HP CNN BigHand2.2M 17.1 (error)
2018 (Baek et al., 2018) HP GAN Big Hand 2.2M 13.7 mm
2018 (Wang et al., 2018) HP CNN Human3.6M 62.8 mm
2018 (Dadashzadeh et al., 2018) HG CNN OUHANDS, 86.46
3D, Depth 2015 (Oberweger et al., 2015) HP CNN NYU 20 mm
2017 (Deng et al., 2017) HP CNN NYU 17 mm
2015 (Oberweger et al., 2015) HP CNN ICVL 10 mm
2017 (Deng et al., 2017) HP CNN ICVL 11 mm
2018 (Marı n Jimeneza et al., 2018) HP CNN ITOP 97.5 (AUC)
2018 (Marı n Jimeneza et al., 2018) HP CNN UBC3V 88.2 (AUC)
2D, RGB 2016 (Han et al., 2016) HG CNN own dataset 93.80
2017 (Simon et al., 2017) HD CNN own dataset 4.15 mm
2018 (Rao et al., 2018) HSR CNN own dataset 92.88
2020 (Wadhawan & Kumar, 2020) HSR CNN own dataset 99.72
2014 (Toshev & Szegedy, 2014) HP DNN FLIC 96
2016 (Newell et al., 2016) HP CNN FLIC 99.0 (Elbow)
2016 (Wei et al., 2016) HP CNN FLIC 97.59
2014 (Toshev & Szegedy, 2014) HP DNN LSP 0.78
2016 (Wei et al., 2016) HP CNN LSP 84.32
2016 (Newell et al., 2016) HP CNN MPII 90.90 (total)
2016 (Wei et al., 2016) HP CNN MPII 87.95
2018 (Mueller et al., 2018) HT CNN STB 96.5 (AUC)
2019 (Li et al., 2019) HP CNN STB 8.34 (err)
2015 (Koller, Ney et al., 2015) HSR CNN RWTH-PHOENIX-Weather 55.70 (Precision)
2017 (Dibia, 2017) HT CNN Egohands 96.86 (mAP)
2018 (Mueller et al., 2018) HT CNN Dexter 64 (AUC)
2018 (Mueller et al., 2018) HT CNN EgoDexter 54 (AUC)
2019 (Li et al., 2019) HP CNN B2RGB-SH 7.18 (err)
3D, dynamic, RGB 2018 (Ye et al., 2018) HSR CNN own dataset 69.2
2017 (Zimmermann & Brox, 2017) HP CNN STB 94 (AUC)
2017 (Zimmermann & Brox, 2017) HP CNN Dexter 49.0 (AUC)
2019 (Lim et al., 2019) HS CNN RWTH-BOSTON-50 89.33
2019 (Lim et al., 2019) HS CNN ASLLVD 31.50
2019 (Chen et al., 2019) HG CNN DHG-14/28 Dataset 91.9
2019 (Chen et al., 2019) HG CNN SHREC’17 Track Dataset 94.4
13
R. Rastgoo et al. Expert Systems With Applications 164 (2021) 113794
Table 7 (continued).
Modality Year Ref. Goal Model Dataset Results
2D, RGB, Depth 2018 (Rastgoo et al., 2018) HR RBM Massey 2012 99.31
2018 (Rastgoo et al., 2018) HR RBM SL Surrey 97.56
2018 (Rastgoo et al., 2018) HR RBM ASL Fingerspelling A 98.13
2019 (Gomez-Donoso et al., 2019) HP CNN own dataset 5 mm
2018 (Rastgoo et al., 2018) RHR RBM NYU 90.01
2018 (Kazakos et al., 2018) HP CNN NYU 9 mm
2018 (Spurr et al., 2018) HP VAE NYU 10.5 mm
2018 (Spurr et al., 2018) HP VAE ICVL 19.5 mm
2018 (Spurr et al., 2018) HP VAE MSRA 10 mm
2016 (Duan et al., 2016) HG CNN Chalearn IsoGD 67.19
2016 (Duan et al., 2016) HG CNN RGBD-HuDaAct 96.74
2015 (Tang et al., 2015) HP CNN, DNN own dataset 0.899 ms (Ave. time)
2018 (Spurr et al., 2018) HP VAE RHD 84.9 (AUC)
2019 (Ferreira et al., 2019) HS CNN real-time frames 93.17, 92.61
2018 (Spurr et al., 2018) HP VAE STB 98.3(AUC)
2019 (Gomez-Donoso et al., 2019) HP CNN STB 5 mm
dynamic, RGB, Depth 2020 (Elboushaki et al., 2020) HG CNN isoGD 72.53
2020 (Elboushaki et al., 2020) HG CNN SKIG 99.72
2020 (Elboushaki et al., 2020) HG CNN NATOPS 95.87
2020 (Elboushaki et al., 2020) HG CNN SBU 97.51
2019 (Kopuklu et al., 2019) HG CNN EgoGesture 93.75, 94.03
2019 (Kopuklu et al., 2019) HG CNN NVIDIA benchmarks 78.63, 83.83
2D, 3D, Depth 2016 (Oberweger et al., 2016) HP CNN MSRA 5.58 mm (Ave. err.)
relative improvement of 0.51 mm, 3.39 mm, and 1.16 mm on the ICVL, claimed that the proposed model is fast in all stages achieving the
NYU, and MSRA datasets, respectively (Moon et al., 2018). Dibra et al. average recognition time of 0.899 ms per each video input (Tang et al.,
used a simple CNN which is pre-trained only on synthetic depth images 2015). Kazakos et al. designed a CNN-based model using the fusion
generated from a single 3D hand model. They fine-tuned the model of RGB and Depth information in a double-stream architecture for
on the unlabeled depth images from a real user’s hand. Experimental hand pose estimation. While the RGB and depth images are fed into
results on two public datasets showed that the model performance two separate CNNs for feature extraction, the intermediate layers of
was comparable with state-of-the-art methods in hand pose estimation the CNNs are fused for final hand pose estimation. Evaluation results
obtaining an error estimation of 8 mm and 9 mm on ICVL and NYU demonstrate that while the depth of the network is crucial for hand
datasets (Dibra et al., 2017). Baek et al. applied a GAN for hand pose pose estimation, the double stream nets performs very similarly with
estimation by making a one to one relation between depth disparity the net trained only with depth images. The proposed model obtained
maps and 3D hand pose models. This model refines the initial skeleton a comparable estimation error with state-of-the-art methods on NYU
estimations for further accuracy improvement. Evaluation results on Hand pose dataset (Kazakos et al., 2018).
ICVL, MSRA, NYU, and Big Hand 2.2M demonstrate that the proposed
model is on par with state-of-the-art models in hand pose estimation, 4.2.4. Discussion
achieving an estimation error of 8.5 mm, 12.5 mm, 14.1 mm, and With the advent of deep learning for hand sign language recognition
13.7 mm on these datasets, respectively (Baek et al., 2018). Ge et al. in recent years, having a large amount of data has been proven to be a
proposed a 3DCNN for real-time hand pose estimation from single crucial part of the model learning. Using depth cameras has facilitated
depth images. A 3D volumetric representation of the hand depth image the creation of large-scale datasets with automatic annotations of key-
is fed to a 3DCNN to get the 3D spatial structure of the input image. A point locations for data using some magnetic sensors attached to the
3D data augmentation is performed on the training data to increase hand. While these magnetic sensors provide accurate annotations for
the robustness of the input images to variations in hand sizes and depth inputs, they are not effective for RGB inputs because they deform
global orientations. Results on MSRA show that the proposed model the hand appearance in the RGB inputs. So, most of the proposed
outperforms state-of-the-art methods in hand pose estimation with a models in hand pose estimation area are based on the depth inputs and
relative improvement of 3 mm. Furthermore, the evaluation results on few models have been suggested to RGB inputs. A broad benchmark
the proportion of joints within in different error thresholds on NYU evaluation has shown that deep models appear particularly well-suited
dataset confirm the relative state-of-the-art improvement of 10% (Ge for pose estimation (Supancic et al., 2018). The impressive capability
et al., 2017). Sinha et al. proposed a CNN-based model for real-time 3D of CNN to work with the still images is not enough for video inputs
hand pose estimation using depth data. They provided a hierarchical and it needs to be combined with another deep approach for video
pipeline for hand pose estimation using the combination of global inputs in order to more cover the sequence information. Based on the
pose orientation and finger articulations in a principled way. They performance analysis of different CNN structures with regard to hand
used an efficient matrix completion method for joint angle parameters shape, joint visibility, viewpoint, and articulation distributions, 3D
estimation using the initialized pose matrix. Experimental results on hand pose estimation, using deep learning approaches, in the isolated
Dexter1 confirm a relative state-of-the-art improvement of 3.25 mm. scenes have a lower mean error in comparison with the cluttered and
Indeed, this model achieved a relative state-of-the-art improvement of complex scenes. Furthermore, it is possible to integrate the forward
approximately 4% in estimation accuracy on NYU dataset (Sinha et al., kinematic process of an articulated hand model into the deep learning
2016). framework for accurate hand pose estimation (Supancic et al., 2018).
Using the prior knowledge in geometric hand model in the learning
4.2.3. Multi-modal hand pose estimation models process also has led to significant results (Zhou et al., 2016).
Tang et al. proposed a real-time hand pose estimation model using While hand pose estimation roughly has been solved for scenes with
the combination of traditional and deep models. While they used mor- isolated hands, the proposed models still struggle to analyze cluttered
phological transform from two modalities for hand detection, a Deep scenes where hands may be interacting with the other objects and
Neural Networks (DBNs) has been applied to hand pose estimation surfaces. Furthermore, self-occlusion (between fingers), close similarity
in real-time. They evaluated the model on some provided videos and between fingers, dexterity of the hands, speed of the pose and high
14
R. Rastgoo et al. Expert Systems With Applications 164 (2021) 113794
dimension of the hand kinematic parameters are so challenging. Also, 4.3.4. Discussion
when segmentation is hard due to active hands or clutter, many existing Real-time hand detection is now an attractive area in the research
models fail to work properly. Using the multi-modal inputs, such as community as a next step to provide a complete system for online
image, video, skeleton, flow features, text, and so on, along with the human communication in desktop environments. Using the deep learn-
real and diverse training sets can be more considered because the input ing approaches along with the recent improvements of GPUs have led
data is as important as the choice of model architecture. The current to impressive improvement in accuracy and speed for real-time hand
datasets for hand pose estimation are not suitable enough to apply in tracking area. Some of the models use not only the CNN capabilities,
real communication of sign language because they are highly restricted as a deep learning model, but also the advantages of a depth sensor
not only in environmental conditions and background complexities but for input data providing (Kang et al., 2015). Furthermore, the advent
also in numbers of the signs in each video. In other words, we need of fast CNN-based models such as Faster-RCNN (Ren et al., 2015),
a dataset that includes the daily phrases or sentences similar to the YOLO (Redmon et al., 2016), and SSD (Liu et al., 2016) make deep-
real world, not in a constrained environment with the predefined slots. based models as the attractive candidates for real-time hand detection
Articulated hand pose estimation is still an open problem and under and tracking applications. Accurate and real-time hand tracking is a
intensive research from both academia and industry. We think that challenging problem in computer vision area due to highly articu-
using the complementary features, such as face and body, more real and lated human hands. Fast processing in an uncontrolled environment
rich datasets, benefited from the face and human datasets, along with considering the rapid hand motions is a too difficult requirement of
the new hardware and wearable devices capabilities could improve the the real-time hand tracking models. Since it is difficult to satisfy this
hand pose estimation accuracy in the future. requirement, some of the hand tracking models apply some restrictions
on the user or the environment to facilitate the processing. For example,
some models consider only a uniform or static background, avoid the
4.3. Real-time hand tracking
rapid hand motions, or assume the hand as a skin-colored object. These
limitations may not be applicable to real-world systems. We think that
Hands are the most important object in the inputs of the sign using the faster hardware to process the input data, a pre-trained model
language recognition models. Tracking the detected hands are one of for hand detection and also hand localization trained on a large amount
the substantial challenges for video inputs due to high occlusions of of data, an effective and fast refinement approach to correct the false
hand fingers and joints. In this sub-section, we review the deep-based hand detection, and also benefiting from multi-modal inputs could
suggested models for hand tracking in the last four years. improve the hand tracking accuracy models.
15
R. Rastgoo et al. Expert Systems With Applications 164 (2021) 113794
recognition. A fully-connected graph, including a self-attention mech- the capability of multiple gestures generalization for only one person
anism for automatically learning the node and edge features in both or multiple person generalization for only one gesture. The most in-
spatial and temporal domains, is constructed from a hand skeleton. A teresting contribution of this model is the ability of unseen gestures
novel spatial–temporal mask is applied to significantly reduction of the verification and identification. Results on the MSR Action3D dataset
computational cost. Evaluation results on DHG-14/28 and SHREC’17 show a relative state-of-the-art gesture recognition improvement of
confirm the superior performance of this model in comparison with
18.91% (Wu, Chen, Ishwar, & Konrad, 2016). Rastgoo et al. have
the state-of-the-art models in hand gesture recognition with 0.9% and
proposed a hand sign recognition model using RBM from visual data.
3% relative accuracy improvement (Chen et al., 2019). Canuto dos
Two modalities, RGB and depth, have been considered in the model
Santos et al. developed a deep-based model using two ResNet models
and a soft-attention ensemble layer for dynamic gesture recognition. input in three forms: original image, cropped image, and noisy cropped
A condensing technique, namely star RGB, is proposed to summarize image. In the first step, the hand of each crop is detected using a
an input RGB video into only one RGB image. This image is passed to CNN. After that, for each modality, three forms of an input image
the rest of the model including two ResNets, a soft-attention ensem- have been input to RBMs. The outputs of the RBMs for two modalities
ble, and a fully connected layer for final classification. Experimental are fused in another RBM in order to recognize the output sign label.
results on Montalbano and GRIT datasets show a relative state-of-the- The proposed multi-modal model is trained on four publicly avail-
art accuracy improvement of 0.78% and 6.68% (Canuto-dos Santos, able datasets, Massey University Gesture Dataset 2012, Fingerspelling
Leonid-Aching-Samatelo, & Frizera-Vassallo, 2020). Dataset from the University of Surrey’s Center for Vision, Speech and
Signal Processing, NYU, and ASL Fingerspelling A. Results showed that
4.4.2. Depth-based hand gesture models
the model achieved state-of-the-art with a relative accuracy improve-
Wang et al. proposed a CNN model for gesture recognition and eval-
ment of 27.31%, 28.56%, 2.9%, and 11.13%, respectively (Rastgoo
uated on the Large-scale Isolated Gesture Recognition at the ChaLearn
et al., 2018).
Looking at People (LAP) challenge 2016 and verified the effectiveness
of the proposed method. In their model, three representations of depth
sequences have been constructed from a sequence of depth maps using
4.4.5. Discussion
bidirectional rank pooling to provide the spatio-temporal information.
Using these representations, they fine-tuned the CNN model trained on Hand gestures and gesticulations are a common form of human
the image data for classification of depth sequences without considering communication. Vision-based gesture recognition has faced up much
large parameters to learn. This model achieved a relative state-of-the- attention from both the academic and the industrial communities due to
art accuracy improvement of 16.34% on IsoGD dataset (Wang et al., its prominent applications in HCI and sign language area. Many deep-
2017). based models have been proposed over the last few years. A CNN-based
model along with a simple Gaussian skin color model and background
4.4.3. Skeleton-based hand gesture recognition subtraction (Han et al., 2016), a two-stage deep model for hand gesture
Devineau et al. provided a deep CNN model for hand gesture
segmentation and recognition using CNN (Dadashzadeh et al., 2018),
recognition using only hand-skeletal data. They used the parallel convo-
a CNN model using only hand-skeletal data (Devineau et al., 2018), a
lutions to train the sequences of hand-skeletal positions of the joints in
stacked RBMs model (Rastgoo et al., 2018), and a recurrent 3D CNN-
different time resolutions. The evaluation results on the DHG dataset
showed that the model achieved the state-of-the-art results by 3% based model (Molchanov et al., 2016) are just some of the proposed
relative improvement. Their model demonstrated that the parallel pro- models for hand gesture recognition using deep learning approaches.
cessing of sequences using CNNs can be competitive with neural archi- Similar to other sub-area of sign language recognition, there are much
tectures that use some cells specifically designed for sequences, such as more gesture recognition models for depth modality than the RGB one
GRU and LSTM cells. The biggest drawback of the model is that it only due to advancement and availability of depth-sensors such as Kinect.
works on complete sequences of input data. Furthermore, due to the To benefit from the complementary advantages of all input modalities,
weight sharing between all channels, the overall model performance is some models used the multi-modal inputs to fuse the structural infor-
decreased. So, they need to have a trade-off between model accuracy mation from the depth channel, high-resolution pixel information of
and its total parameters count, respectively (Devineau, Xi, Moutarde, & RGB inputs, motion features, and also skeletal input data (Duan et al.,
Yang, 2018).
2016; Molchanov et al., 2016; Rastgoo et al., 2018). There are still some
limitations regarding hand gesture recognition area due to hand gesture
4.4.4. Multi-modal hand gesture models
variations, illuminations, background complexity, and large diversity
Duan et al. provided a convolutional Two-Stream Consensus Voting
Network (2SCVN) to explicitly model the short-term and long-term in how people perform gestures. Another challenge is related to a real-
structure of the RGB sequences. To decrease the complexity of the time gesture recognition area. Since human feedback time to show the
background, a 3D Depth-Saliency CNN stream (3DDSN) has been used gestures in real-world can vary in different gestures, this can present
in parallel to provide the motion features. These two networks, 2SCVN the challenge of detecting and classifying gestures immediately upon or
and 3DDSN, have been fused in one framework to improve the recog- before their completion to provide rapid feedback. Furthermore, using
nition accuracy. The evaluation results of the multi-modal model on the new hardware capabilities and also recognizing the unseen gestures
Chalearn IsoGD benchmark and RGBD-HuDaAct dataset showed that could be more considered, especially for real-time applications. We
the model outperformed the state-of-the-art models on these datasets think that more adaptive selection of the optimal hyper-parameters of
with a relative improvement of 4.47% and 0.61% (Duan et al., 2016). the model can improve the recognition speed of real-time models. Joint
Molchanov et al. proposed a recurrent 3D CNN-based model for dy-
features of face gesture, body gesture, face pose, and body pose, as
namic hand gesture recognition from multi-modal inputs. Four input
complementary features, can also be useful to improve the hand gesture
modalities, including RGB, depth, OF, and stereo-IR sensors data, have
recognition accuracy. Furthermore, an effective spatio-temporal data
been fused to boost the recognition accuracy of the model. They
achieved the state-of-the-art results on SKIG and ChaLearn datasets augmentation method to deform the input volumes of hand gestures in
with a relative accuracy improvement of 0.9% and 1% (Molchanov, order to facilitate the final recognition can be helpful. Since data is a
Yang, Gupta, Kim, Tyree, & Kautz, 2016). Wu et al. provided a two- central part of the deep learning models, using different forms of input
stream CNN-based model for hand gesture recognition and identifi- data, such as image or video, OF, skeleton, text, and so on, can provide
cation from depth map and OF input information. Their model has more accurate hand gesture recognition models.
16
R. Rastgoo et al. Expert Systems With Applications 164 (2021) 113794
4.5. Hand pose recovery 5.1. RGB-based models for sign language recognition using hand and face
Hand pose recovery attracted special attention in recent years due Koller et al. suggested a combined model including a CNN and an
to the availability of low-cost depth cameras such as Microsoft Kinect. HMM for weakly supervised learning of mouth shapes in the input
Although hand pose estimation area had remarkable improvements
frames without explicit frame labels. Experimental results on RWTH-
using deep learning approaches, 3D hand pose estimation and recovery
PHOENIX-Weather corpus showed a relative state-of-the-art recognition
coped with some challenges. In this sub-section, we present deep-based
accuracy improvement of 8% (Koller, Ney et al., 2015). Rao et al. pro-
models for hand pose recovery in four recent years.
vided a CNN-based model using head and hand movements along with
their constantly changing shape features. Different CNN architectures
4.5.1. Depth-based hand pose recovery
have been applied and selected the best one. They proposed an Indian
Tompson et al. provided a four-part model including a randomized
decision forest classifier for image segmentation, a robust method dataset including sign language videos for 200 signs in 5 different
for labeled dataset generation, a CNN for feature extraction, and an viewing angles under various background environments. Experimental
inverse kinematics step for real-time pose recovery. They used the results on this dataset showed a recognition accuracy of 92.88% (Rao
intermediate heat-map features to extract the accurate and reliable 3D et al., 2018).
pose information at interactive frame-rates using inverse kinematics.
While the evaluation results on own dataset showed that the model 5.2. Discussion
achieved the recognition error of 33.0 for hand pose recovery, this
model can track two hands only if they are not interacting (Tompson
Facial signs in sign language happen simultaneously with head pose
et al., 2014). Madadi et al. proposed a hierarchical tree-like structured
changes and hand signs. So, the proposed models try to learn the shape
CNN for hand pose recovery using end-to-end training. They fused
the branches in predefined subsets of hand joints to learn the higher features of hand, face, and head in order to fuse these features and im-
order dependencies among joints in the final pose. Furthermore, some prove the recognition accuracy. Because of the impressive capabilities
appearance and physical constraints of hand motion and deformation of CNN to feature extraction from input images, it has been used in
have been defined in a loss function. Evaluation results on NYU dataset most of the models with the input image. For video input, CNN is not
showed the proposed model outperformed state-of-the-art models in as effective as image input and so it usually is combined with another
hand pose recovery with a relative improvement of 1.3 mm (Madadi approach, such as RNN, to more cover the sequence information. In
et al., 2017). addition, using the 3D shape features of the face and also combining
the deep learning approaches with traditional methods have been used
4.5.2. Discussion in the proposed models.
Hand pose recovery area has been tremendously studied in recent Due to the fast motion of the head and face in different signs and
years. Different deep-based models have been proposed by different also face occlusions by hands during the signing, tracking the facial
researchers. Most of the models include CNN-based models that are features is challenging. Hence, the communication of the deaf persons
highly data-dependent models and have powerful capabilities in coping usually are considered only as a set of hand movements and do not
with a highly nonlinear output space. Fusing CNN outputs with another pay attention to the facial and body features. However, a model has
deep approach had also considerable attention among the proposed been proposed using the fused features of hand and head (Rao et al.,
models. 2018), this model has been limited to the constant background and
Although the availability of affordable depth cameras have allowed predefined environmental conditions that cannot be generalized for
researchers to use non-invasive, precise, robust to illumination and real-world applications. Furthermore, this model has been evaluated
color changes approaches to hand pose recovery and led to signif-
only on the private dataset, not public datasets. We think that fusing the
icant advances in this area, there are still several open challenges
features of hand with the face, head, and body features in an unlimited
to tackle. Fingers self-occlusions, hand-body occlusions, low resolu-
environment can improve the recognition accuracy. Also, using the
tion/noisy depth images, and the inherent complexity of modeling
other forms of input data, such as flow information, skeleton, thermal,
hand motion due to its highly articulated nature are some of these
text, and so on, can be more considered in order to benefit from the
challenges. The current datasets mainly provide front-face hand de-
fusion features of these inputs. Due to the limitation of the large and
formations, which are not appropriate to compare state-of-the-art ap-
proaches against hard cases with high occlusions. Furthermore, a little diverse dataset including both of the hand and face annotations, using
attention has been paid to embed temporal motion information in hand the human body pose datasets can be helpful here (Andriluka et al.,
pose recovery problems. We think that we need a system for efficient 2014; Ionescu, Papava, Olaru, & Sminchisescu, 2014; Sapp & Taskar,
hand pose recovery in non-controlled settings involving self-occlusions 2013; Varol, Romero, Martin, Mahmood, Black, Laptev, & Schmid,
of hands. While the proposed models must be robust against highly- 2017). Furthermore, using the symmetrical face appearance in order
variable hand poses, also they need to be able to recover occluded joints to consider just half of the input face and decrease the parameters and
both efficiently and accurately using a powerful refinement approach. complexity of the model can also be effective for recognition accuracy
Also, using the novel hardware capabilities for input data recording improvement.
along with the different fusions of input data types and features could
improve the hand pose recovery accuracy. In addition, due to having
6. Sign language recognition using hand, face, and human body
the impressive progress in human pose recovery in recent years, using
these complementary features can be more considered to propose a
more accurate hand pose recovery model. Human body pose estimation is one of the fundamental tools in the
areas related to understanding people behavior, especially action and
5. Sign language recognition using hand and face sign language recognition. Using the body features can improve the
recognition accuracy in high occlusions and severe deformation situ-
Humans mainly look at the face during sign language communica- ations. In this section, we review deep human pose estimation models
tion. So, movement of different parts of the face plays a significant role proposed for sing language recognition. Furthermore, we present some
and constitutes natural patterns with large variability. In this section, of the accurate human pose estimation models in recent four years that
we review deep sign language recognition models using facial features. could be applied in sign language recognition area.
17
R. Rastgoo et al. Expert Systems With Applications 164 (2021) 113794
6.1. RGB-based human pose estimation models pose estimation area with a relative improvement of 6.2 mm (Wang
et al., 2018). MarJimenez et al. provided a deep-based model, namely
Newell et al. proposed a CNN-based model, namely stacked hour- Deep Depth Pose (DDP), to 3D human pose estimation from depth
glass, to human pose estimation using different scales of the input maps. A depth map including a person and a set of predefined 3D
image in order to monitor the spatial relationships associated with prototype poses are input to the DDP model to estimate the 3D position
different parts of the human body. They used the successive steps of of the body joints. The proposed model outperformed state-of-the-art
pooling and up-sampling to provide the prediction. Evaluation results models on ITOP and UBC3V datasets with a relative improvement of
on FLIC and MPII datasets showed that the model achieved state-of- 11.3% and 13.1% (Marı n Jimeneza et al., 2018).
the-art results for human pose estimation with an average relative
improvement of 1.7% and 2.4% (Newell et al., 2016). Wei et al. 6.3. Multi-modal human sign language recognition
proposed a CNN-based model, namely Convolutional Pose Machine
(CPM), for articulated pose estimation using long-term dependencies Huang et al. proposed a deep sign recognition model using 3D CNN
among the variables of the model. The model includes a sequential from multi-modal inputs. Three input modalities, including RGB, depth,
architecture composed of convolutional networks that directly operate and skeleton data, have been used as multi-channel video streams,
on belief maps from previous stages. A refined estimation process has to boost the recognition accuracy. They validated the model on own
been used for part locations without the need for explicit graphical dataset and reported the effectiveness of the model obtaining a recog-
model-style inference. They achieved state-of-the-art performance on nition accuracy 94.2% (Huang et al., 2015).
MPII, LSP, and FLIC datasets with a relative accuracy improvement
of 9%, 6.11%, and 3%, respectively (Wei et al., 2016). Toshev and 6.4. Discussion
Szegedy designed a DNN-based cascade model to estimate the joints by
defining a joints regression problem. The proposed model is capable Human pose estimation is an important area related to a variety of
of not only capturing the full context of each body joint but also applications, especially sign language recognition area. A lot of human
estimating the location of each joint. Evaluation results on FLIC and pose estimation models have been proposed in recent years using deep
LSP datasets showed that the model achieved state-of-the-art results learning approaches, especially CNN and RNN. Using the 3D informa-
in human pose estimation with an approximately relative accuracy tion of depth maps has led to significant improvement in this area. The
improvement of 17% and 2% (Toshev & Szegedy, 2014). Gattupalli proposed models tried to improve the estimation accuracy of human
et al. proposed a dataset, namely SLR, and provided a baseline for pose using different approaches such as cascading, tree-structure, 3D
human pose estimation to evaluate the performance of two deep-based estimation, constraint definition, and so on. Although the experimental
hand pose estimation methods on the proposed dataset. Furthermore, results of different models approved that the estimation accuracy has
they analyzed the impact of transfer learning to the pose estimation and more improved in recent years, much more work is necessary to solve
confirmed the improvement of the estimation accuracy using transfer the challenges.
learning (Gattupalli et al., 2016). Madadi et al. proposed a deep model Human pose estimation in unconstrained conditions is an intensive
for human pose recovery using Skinned Multi-Person Linear Model research area in computer vision area. Using body features can help to
(SMPL) and deep neural networks in a still RGB image. 3D joints, as improve the recognition accuracy of human pose estimation but estima-
an intermediate representation, have been used to regress the SMPL tion of different joints and parts of the human body remains challenging
parameters. In addition, a denoising autoencoder connects the CNN under partial occlusions and noisy situations. Localization of different
to SMPLR model for structural errors recovery. Evaluation results on body parts and joints is the main core of human pose estimation. While
SURREAL and Human3.6M datasets showed a relative improvement 3D information of the human body has been used by a lot of models
over SMPL-based state-of-the-art models about 4 mm and 12 mm in recent years, this information could be challenging where several 3D
(Madadi, Bertiche, & Escalera, 2020). Bin et al. suggested a model, poses could be projected to the same 2D joints. Furthermore, annotating
namely Pose Graph Convolutional Network (PGCN), to capture the of 3D joints is very hard and needs the sophisticated tracking devices.
structural connections between 3D human body keypoints. An atten- Another challenge is regarding 3D pose regression that it is not possible
tion mechanism is employed to focus on the most crucial structural without accurate modeling the correlation of joints. However, some
information and refine both short-range and long-range correlations of the proposed models for human pose and shape estimation have
between 3D human keypoints. Evaluation results on single-person and considered few constraints including pose angles, shape priors, and 3D
multi-person estimation datasets confirm the superiority of the model joint localization, they are quite accurate in unconstrained and real
achieving a relative state-of-the-art estimation accuracy improvement environments. We think that we can benefit from these models in sign
of 1.3%, 0.7%, and 3.4% on MPII, LSP, and COCO datasets, respectively language recognition area to improve the accuracy of the sign models.
(Bin, Chen, Wei, Chen, Gao, & Sang, 2020). Furthermore, using the new hardware for 3D joints labeling of the
human body is crucial to use the 3D information for more accurate
6.2. Depth-based human pose models human pose and shape estimation. Also, providing an accurate model
for joint correlations and using the other input forms of data, such
Haque et al. provided an end-to-end 3D human pose estimation as thermal or synthetic data, as an auxiliary data, can be helpful for
model using a CNN and recurrent network architecture with a top-down accuracy improvement of human pose estimation. In addition, there are
error feedback mechanism to self-correct the previous pose estimations. some large human pose datasets including the accurate annotations and
The model inputs the local regions into a learned viewpoint invariant the least constraints (Ionescu et al., 2014; Varol et al., 2017) that can
feature space to selectively estimate partial poses in the presence of be used in sign language recognition area to improve the recognition
noise and occlusion. Evaluation results on own dataset confirm the accuracy and use in real applications.
performance improvement in human pose estimation (Haque et al.,
2016). Wang et al. provided a two-stage Depth Ranking Pose 3D 7. Continuous dynamic sign language recognition
estimation (DRPose3D) model using a Pairwise Ranking Convolutional
Neural Network (PRCNN) and a 3D Pose Network(DPNet). The model In recent years, with the availability of large datasets, such as
extracts depth rankings of human joints from input images and esti- RWTHPHOENIX-Weather-2014 (Forster, Schmidt, Koller, Bellgardt, &
mates the 3D poses from both depth rankings and 2D human joint Ney, 2014), some researchers have been attracted to continuous dy-
locations. Evaluation results on the Human3.6M benchmark showed namic sign language recognition (Cihan Camgöz, Hadfield, Koller, &
that the proposed model outperformed state-of-the-art models for 3D Bowden, 2017; Cui, Liu, & Zhang, 2019; Koller, Zargaran, Ney, &
18
R. Rastgoo et al. Expert Systems With Applications 164 (2021) 113794
Bowden, 2016; Mocialov, Turner, Lohan, & Hastie, 2017; Pu, Zhou, 7.4. RGB-based continuous dynamic Human pose estimation
& Li, 2018; Wei, Zhou, Pu, & Li, 2019). While isolated sign language
recognition uses an image or video including only one sing in the Ye et al. suggested a 3D Recurrent Convolutional Neural Network
input model, continuous dynamic sign language recognition concerns (3DRCNN) for gesture recognition and joint localization using their
about multiple signs per video input. One of the main challenges in temporal boundaries within continuous videos and fusing multi-modal
continuous dynamic sign language recognition is video segmentation features. Dividing the original videos into some short video clips, the
to make multiple videos including only one sign per each video seg- extracted features of the relations among these video clips have consid-
ment. In this section, we present the recent works in continuous sign ered as the temporal information. Sliding window approach has been
language recognition and related areas. Table 8 shows the details of used to merge the temporal information of the Consecutive clips with
these models. the same semantic meaning. To evaluate the method, they proposed
a dataset including some words and sentences videos and reported
the effectiveness of the proposed model on this dataset achieving an
7.1. RGB-based continuous dynamic sign language recognition
estimation accuracy of 69.2% (Ye et al., 2018).
Pu et al. proposed a CNN-based model for continuous dynamic 7.5. Discussion
sign language recognition from RGB video input. They combined 3D
residual convolutional network (3D-ResNet) with a stacked dilated CNN Sign language recognition includes two main categories, which
and a Connectionist Temporal Classification (CTC) for visual feature are isolated sign language recognition and continuous sign language
extraction and making a mapping between the sequential features recognition. The supervision information is a key difference between
and the text sentence. They used an iterative optimization strategy the two categories. While isolated sign language recognition is similar
to overcome the problem of less contribution between CTC and CNN to the action recognition area, the continuous sign language recognition
parameters. After generating an initial label for a video clip using CTC, concerns about not only the recognition task but also the accurate
they fine-tune CNN to refine the generated label. Evaluation results on alignment between the input video segments and the corresponding
RWTH-PHOENIX-Weather dataset demonstrate the superiority of the sentence-level labels. Generally, continuous sign language recognition
model obtaining a relative state-of-the-art word error rate improvement is more challenging than isolated sign language recognition. Indeed,
of 1.4% (Pu et al., 2018). Mocialov et al. employed the combination of isolated sign language recognition can be considered as a subset of
a heuristic approach with the stacked LSTMs for video segmentation continuous sign language recognition. Two factors play a key role in the
using the epenthesis identification and automatic classification of the performance evaluation of continuous sign language recognition, which
segmented videos, respectively. They performed an analysis on the include feature extraction from frame sequences of the input video and
sign numbers and efficiency of different features for final recognition. alignment between the features of each video segment and the cor-
Evaluation results reported only on different sign classes without com- responding sign label. Obtaining more descriptive and discriminative
paring with state-of-the-art. In the best case, this model achieved a features from the video frames could result in a better performance for
recognition accuracy of 95% on the NGT dataset, including 40 sign a continuous sign language recognition system. While recent models
in continuous sign language recognition have a rising trend in model
classes (Mocialov et al., 2017). Wei et al. defined the continuous
performance relying on deep learning capabilities in computer vision
sign language recognition as a grammatical-rule-based classification
and NLP, there is still much room for performance improvement in
problem using the combination of 3D convolutional residual network
this area. Considering the attention mechanism, using multiple in-
and bidirectional LSTM. They used two modules, word-independent
put modalities to benefit from multi-channel information, learning
classifiers (WIC) module and n-gram classifier (NGC) module, to split a
structured spatio-temporal patterns (such as Graph Neural Networks
sentence into a sequence of consecutive words. The confidence scores
models), and employing the prior knowledge on sign language are only
provided by these modules are used to concatenate the features of some of the possible future directions in this area.
the words in a sentence. Evaluation results on a CSL SPLIT I dataset
demonstrated a relative state-of-the-art precision improvement of 2% 8. Hybrid models using deep and traditional methods
(Wei et al., 2019).
In this section, we present the recent works in sign language recog-
7.2. Depth-based continuous dynamic sign language recognition nition and related areas that use the combination of deep-based de-
scriptors and classic classifiers for training. Table 9 shows the details of
Camgoz et al. presented a deep-based and end-to-end framework these models. While some works use traditional descriptors and classic
for continuous dynamic sign language recognition. They employed the classifiers for training (Cheron, Laptev, & Schmid, 2015; Escobedo-
SubUNets approach to improve the learning procedure of the interme- Cardenas & Camara-Chavez, 2015), these works are not under the scope
diate representations. This model uses the multi-channel information of this survey.
such as hand shape, motions, body pose and facial gestures, from input
data to generate a sequence of outputs from a given video. Evaluation 8.1. RGB-based Hybrid hand sign language recognition
results on One-Million Hands demonstrated that this model obtained
Rastgoo et al. proposed a deep-based model to hand sign language
a comparable sign recognition rate to previous research works on this
recognition using SSD, CNN, LSTM benefiting from hand pose features.
dataset achieving a word error rate of 40.7 (Cihan Camgöz et al., 2017).
They improved hand detection accuracy of SSD model using five online
sign dictionaries. Furthermore, they employed some hand-crafted fea-
7.3. Multi-modal continuous dynamic sign language recognition tures and combined with the extracted features from CNN model. Eval-
uation results on RKS-PERSIANSIGN and isoGD datasets confirm the
Cui et al. developed a continuous sign language recognition frame- superiority of the proposed model achieving state-of-the-art on isoGD
work using the combination of CNN and Bi-LSTM. They used an iter- dataset with 4.25% relative margin (Rastgoo et al., 2020b). Koller et al.
ative optimization process to obtain the representative features from used the embedding of CNN into a HMM for continuous sign language
CNN. Model performance is iteratively improved using the training recognition. They used the CNN outputs as true Bayesian posteriors and
and tuning procedures of the recognition model. Besides, this model train the model end-to-end as a hybrid CNN-HMM. Evaluation results
benefits from the multi-modal fusion of RGB images and OF informa- on RWTHPHOENIX-Weather 2014 Multi-signer dataset showed that
tion. Experimental results on two public benchmarks, RWTH-PHOENIX- this model decreased the error rates from 51.6/50.2 to 38.3/38.8 on
Weather 2014 and SIGNUM, confirm outperforming of state-of-the-art dev/test in comparison with state-of-the-art models (Koller, Zargaran
by a relative improvement of more than 15% (Cui et al., 2019). et al., 2016).
19
R. Rastgoo et al. Expert Systems With Applications 164 (2021) 113794
Table 8
Deep continuous dynamic sign language recognition models; CDSLR: Continuous Dynamic Sign Language Recognition, CDHPE: Continuous Dynamic Human Pose Estimation.
Modality Year Ref. Goal Model Dataset Results
RGB video 2018 (Pu et al., 2018) CDSLR 3DCNN RWTH-PHOENIX-Weather 2014 37.3 (error rate)
RGB video 2017 (Mocialov et al., 2017) CDSLR heuristic approach and LSTM NGT 80.70 (accuracy)
RGB video 2019 (Wei et al., 2019) CDSLR 3DCNN and Bi-LSTM Chinese dataset 94.90
Depth video 2017 (Cihan Camgöz et al., 2017) CDSLR CNN One-Million Hands 40.8 (error rate)
RGB video, OF 2019 (Cui et al., 2019) CDSLR 3DCNN, Bi-LSTM RWTH-PHOENIX-Weather 2014, SIGNUM 22.86, 2.80 (error rate)
RGB video 2018 (Ye et al., 2018) CDHPE 3DRCNN own dataset 69.2
Table 9
Hybrid sign language recognition models; SLR: Sign Language Recognition, GR: Gesture Recognition, PE: Pose Estimation.
Modality Year Ref. Goal Model Dataset Results
RGB video 2020 (Rastgoo et al., 2020b) SLR SSD, CNN, LSTM, RKS-PERSIANSIGN, 98.42, 86.32
hand-crafted features isoGD accuracy
RGB video 2016 (Koller, Zargaran et al., 2016) SLR CNN, HMM RWTHPHOENIX-Weather 2014 38.3, 38.8
Multi-signer dataset (error rate)
RGB image 2018 (Chen, Ting, Wu, & Fu, 2018) GR CNN, SVM own dataset 49.88
RGB video 2020 (Escobedo-Cardenas & Camara-Chavez, 2020) GR CNN, HCM UTD-MHAD, isoGD, 94.81, 67.36,
UFOP-LIBRAS 64.33 (accuracy)
RGB, Depth image 2016 (Ma1, Chen, & Wu, 2016) GR CNN, SVM ASL 96.1 (accuracy)
Depth image 2018 (Chen, Ting, Wu, & Fu, 2018) PE CNN, SPM ICVL, NYU, NTU 8.64, 15.90, 12.81 (error)
Chen et al. proposed a hybrid model to hand gesture recognition, Recently, deep learning-based models attracted more attention in
including a CNN for automatic feature extraction and SVM for final the research community in comparison with the traditional computer
classification, from the input of raw EMG image. Experimental results vision techniques. However, this is not mean that traditional com-
on own dataset confirmed the higher accuracy with 2.5% and 9.7% puter vision techniques have got obsolete. Some problems may benefit
margins in comparison with the cases that only CNN or traditional from having a trade off between both the powerful capabilities of
methods was used (Chen, Tong et al., 2018). deep learning (in particular in those cases of having large amounts of
data) and the specific problem-tailored design of handcrafted features.
8.3. Multi-modal hybrid gesture recognition To this end, using the capabilities of both categories, as a hybrid
model, has considered by some researchers in recent years (Chen, Tong
Cardenas and Chavez proposed a hybrid model to hand gesture et al., 2018, 2018; Escobedo-Cardenas & Camara-Chavez, 2020; Koller,
recognition using the combination of CNN and Histogram of Cumu- Zargaran et al., 2016; Ma1 et al., 2016; Rastgoo et al., 2020b). A
lative Magnitudes (HCM). They used three input modalities including combination of a deep-based and automatic feature extractor, such as
RGB, Depth, and Skeleton. A skeleton estimation method along with a CNN, with a classical classifier method, such as Support Vector Machine
sampling method is employed to include a fixed number of keyframes (SVM), has widely used in the hybrid models. Furthermore, employing
from the input video. They fuse the extracted spatio-temporal features some hand-crafted features along with the CNN-based features is an-
and fed them into a linear Support Vector Machine (SVM) classifier other approach in the hybrid models (Chen, Ting et al., 2018; Rastgoo
for final recognition. Evaluation results on UTD-MHAD, ChaLearn LAP et al., 2020b). Combining prior knowledge with deep-based features
IsoGD, and UFOP-LIBRAS, confirmed the effectiveness of this model can help to develop systems with lower complexity and maybe more
achieving the accuracy of 94.81%, 67.36%, and 64.33%, respectively. accurate in some special domains.
While this model performed comparably with state-of-the-art methods
on isoGD and UFOP-LIBRAS datasets, it outperformed state-of-the- 9. Discussion and conclusion
art on UTD-MHAD with relative improvement of 0.16% (Escobedo-
Cardenas & Camara-Chavez, 2020). Ma et al. deployed a CNN-based Sign language and different forms of sign-based communication are
model for hand gesture recognition from two modalities, RGB and prominent to large groups in society. With the advent of deep learning
Depth images. The gesture region is extracted using a depth image- approaches, sign language recognition area have faced up a significant
based segmentation method. After feature extraction using a CNN, final accuracy improvement in recent years. In this survey, we reviewed the
recognition is done using the SVM method. Experimental results on proposed models of sign language recognition area using deep learning
own dataset confirm that the proposed model, benefiting from the in recent four years based on a proposed taxonomy. Many models have
been proposed by researchers in recent years. Most of the models have
combination of CNN and SVM, achieved a recognition accuracy of
used the CNN model for feature extraction from input image due to the
96.1% (Ma1 et al., 2016).
impressive capabilities of CNN for this goal. In the case of video input,
RNN, LSTM, and GRU have been used in most of the models to cover
8.4. Depth-based hybrid hand pose estimation
the sequence information. Also, some models have combined two or
more approaches in order to boost the recognition accuracy. Moreover,
Chen et al. proposed a vision-based framework for 3D hand pose
different types of input data, such as RGB, depth, thermal, skeleton,
estimation using the combination of a Spherical Part Model (SPM) and
flow information have been used in the models. Tables 4–9 show the
a deep CNN. In this framework, prior knowledge of the human hand
proposed models details for sign language recognition in recent four
is used for accurately hand pose estimation from a depth map. Using
years. Furthermore, the pros and cons of these models are presented in
the hand-centric coordinate system, SPM employed to obtain skeletal
the Table 10. Here, we present the challenges based on our taxonomy:
configurations from the most stable joints and use spherical repre-
sentation. Results on NYU and NTU datasets demonstrated a relative • Feature fusion: While many models have been proposed by
estimation improvement of 0.063 mm and 3.358 mm in comparison different researchers, there are still several challenges in this area
with state-of-the-art methods (Chen, Ting et al., 2018). that need to be solved. Feature fusion can be applied in input data
20
R. Rastgoo et al. Expert Systems With Applications 164 (2021) 113794
Table 10
Summary of deep sign language recognition models.
Year Ref. Feature Input Dataset Description.
fusion modality
2014 (Neverova et al., 2014) Hand static, Depth own dataset Pros: This model significantly improved hand gesture
recognition accuracy using unlabeled real-world samples.
Cons: Need to adapt with real data.
2014 (Tompson et al., 2014) Hand static, Depth own dataset Pros: Acceptable generalization performance in coping with
hand shape changes. Cons: The recognition accuracy of the
hand pose model has not been evaluated. Furthermore, the
model is not robust against hand occlusion.
2014 (Toshev & Szegedy, 2014) Hand static, RGB FLIC, LSP Pros: High precision pose estimation using a simple but yet
powerful formulation. Cons: Need to propose a new
architecture which could be potentially better tailored
towards localization problems and especially in pose
estimation in particular because they used a generic model
which was originally designed for classification tasks and
applied it for hand joints localization.
2015 (Kang et al., 2015) Hand static, Depth own dataset Pros: High recognition performance for observed signers in
real-time. Cons: Need to include more data from different
subjects to improve the results.
2015 (Oberweger et al., 2015) Hand static, Depth ICVL, NYU Pros: Accurate and fast. Cons: Location prediction of the
joint is constrained by the learned hand model.
2015 (Tang et al., 2015) Hand static, Depth, own dataset Pros: Robust to occlusion and RGB insensitive to movement,
scaling and rotation. Cons: training process of DBN is
difficult to parallelize.
2015 (Tagliasacchi et al., 2015) Hand dynamic, Depth real video Pros: Novel, robust and accurate hand tracking algorithm.
Cons: This model track only hand and suffers from low
accuracy in two-hand tracking.
2015 (Koller, Ney et al., 2015) Face dynamic, RGB RWTH-PHOENIX Pros: Accurate and robust modeling of mouth shapes. Cons:
Need to be more generalize with more data because the
recognition accuracy of the model depends on the shuffling
of the training samples.
2015 (Huang et al., 2015) Body dynamic, RGB, Own dataset Pros: Boosting the recognition accuracy of the model using
multi-channels of video streams. Cons: The evaluation results
of the model highly depends on own dataset. Need to be
more general by evaluating on some public datasets.
Depth, Skeleton
2016 (Oberweger et al., 2016) HP dynamic, Depth MSRA Pros: Accurate 3D hand pose annotation and estimation.
Cons: Estimation accuracy of the model highly depends on
the annotation accuracy of input data. The model is
semi-automated and need some human users to annotate the
visible joints in frames. In addition, the model is highly
complex.
2016 (Han et al., 2016) Hand static, RGB own dataset Pros: Some simple pre-processing methods applied on input
data that improved the recognition performance of the
model. Cons: Need to increase the gesture labels and input
data for more generalization of the model.
2016 (Duan et al., 2016) Hand static, Depth, Chalearn IsoGD, Pros: Using the spatial and temporal information
complementary to improve the recognition accuracy and
reduce the estimation variance. Cons: While two input
modality are used in the model, the temporal information of
only one modality, RGB, are employed.
RGB RGBD-HuDaAct
2016 (Newell et al., 2016) Hand static, RGB FLIC, MPII Pros: Accurate and robust against heavy occlusion and
multiple people in close proximity. Cons: No robust to some
complicated poses.
2016 (Wei et al., 2016) Human static, RGB MPII, LSP, FLIC Pros: Robust to create an effective communication between
joints to accurately pose estimation. Cons: The model is not
accurate in handling multiple people in close proximity.
2016 (Haque et al., 2016) Hand static, Depth EVAL, ITOP Pros: Accurate pose estimation on alternate viewpoints and
partially robust to noise and occlusion. Cons: Not accurate
in joints localization.
2016 (Molchanov et al., 2016) Hand dynamic, RGB, SKIG, ChaLearn Pros: Accuracy improvement of hand gesture model using
effective modality fusion.
Depth, OF, Cons: Temporal information between all clips of each video
can be used in an efficient way to improve hand gesture
recognition accuracy.
stereo-IR sensors
2016 (Wu et al., 2016) Hand dynamic, Depth, MSR Action 3D Pros: Proposing an accurate two-stream CNN model for
gesture verification and identification.
OF Cons: Generalization of the model is poor for unseen
gestures.
2017 (Deng et al., 2017) Hand static, Depth ICVL, NYU Pros: Integrating both local 3D feature and global context
without any further post-processing. Cons: Estimation
accuracy of hand pose model highly depends on data
augmentation.
(continued on next page)
21
R. Rastgoo et al. Expert Systems With Applications 164 (2021) 113794
Table 10 (continued).
Year Ref. Feature Input Dataset Description.
fusion modality
2017 (Guo et al., 2017) Hand static, Depth ICVL, NYU, Pros: Simple but accurate and fast model. Cons: No robust
to occlusion.
MSRA, ITOP
2017 (Simon et al., 2017) Hand static, RGB own dataset Pros: Robust to accurately generate the annotations for
keypoint detection. Cons: Model accuracy highly depends on
using multiple cameras in controlled environments.
2017 (Zimmermann & Brox, 2017) Hand static, RGB Stereo Hand Pros: Approximately accurate to predict 3D hand poses from
Pose 2D keypoints. Cons: The performance seems mostly limited
by the lack of an annotated large scale dataset with
real-world images and diverse pose statistics.
Tracking, Dexter
2017 (Fang & Lei, 2017) Hand dynamic, Depth NYU, ICVL Pros: Accurate estimation of hand pose by exploiting
dependencies between hand joints. Cons: No robust to
occlusion and noisy inputs.
2017 (Yuan et al., 2017) Hand Static, Depth ICVL, NYU, Pros: A suitable evaluation cross-benchmark for different
MSRC, BigHand models. Cons: Proposed tracking system highly depends on
6D magnetic sensors and inverse kinematics to obtain hand
joints annotations.
MSRC, BigHand
2017 (Wang et al., 2017) Hand dynamic, Depth ChaLearn Pros: Accuracy improvement of gesture recognition by using
different presentations of input images. Cons: Recognition
accuracy can be improved using more efficient way for
features fusion in the model.
2017 (Madadi et al., 2017) Hand static, Depth NYU, MSRA Pros: High accurate capability to local pose recovery. Cons:
Need to consider data complexity reduction of the model.
2017 (Dibra et al., 2017) Hand static, Depth NYU, ICVL Pros: Accurately estimation of 3D hand pose with the ability
to refine on unlabeled depth images. Cons: Model has
adopted to a single hand shape only.
2018 (Chen et al., 2020) Hand static, Depth ICVL, NYU, Pros: Accurately estimation of 3D hand pose.
MSRA Cons: No robust and accurate when hands are interacting
with other hands or objects.
2018 (Rao et al., 2018) Face dynamic, RGB own dataset Pros: Approximately accurate and robust against 5 various
orientations. Cons: Recognition accuracy of the model can be
improved using accurate and pre-trained CNN model instead
of a shallow CNN.
2018 (Ye et al., 2018) Face dynamic, RGB own dataset Pros: Capability to learn the complementary information
from multi-modal inputs via different fusions. Cons: Poor
performance for sings including facial information.
2018 (Wang et al., 2018) Hand static, Depth Human3.6M Pros: Accurate capturing context and reasoning about pose in
a holistic manner. Cons: Not accurate in joints localization.
2018 (Marı n Jimeneza et al., 2018) Hand static, Depth UBC3V, ITOP Pros: Accurate and robust to different viewpoints. Cons:
Constrained on some especial types of poses.
2018 (Rastgoo et al., 2018) Hand static, RGB, Massey2012, Pros: Robust to noise and accurate by providing a
generalization in instances of low amounts of annotated data.
Depth ASL Surrey,
NYU,ASL Cons: Need to decrease the complexity of the model by
sharing the parameters.
Fingerspelling A
2018 (Dadashzadeh et al., 2018) Hand static, Depth OUHANDS Pros: Accurate pixel-level semantic segmentation into hand
region. Cons: Not accurate in recognition stage.
2018 (Devineau et al., 2018) Hand dynamic, DHG Pros: Efficient recognition performance in time and accuracy.
Skeleton Cons: The model only works on complete sequences of input
data.
2018 (Moon et al., 2018) Hand static, Depth ICVL, MSRA, Pros: Estimation accuracy improvement of the model by
converting 2D depth map into the 3D voxel representation.
NYU Cons: Model complexity is high due to doubling the number
of channels of each feature map.
2019 (Chen et al., 2019) Hand dynamic, RGB DHG-14/28, Pros: They developed a general framework that can be used
for other tasks aiming to learn spatial and temporal
information from graph-based data.
SHREC’17 Cons: Need to generalize to additional datasets.
2019 (Gomez-Donoso et al., 2019) Hand static, RGB own dataset, Pros: Accurately prediction of 3D positions of hand joints.
Cons: Suffering from a minor jittering when the results are
rendered over time.
Stereo Hand
Pose Tracking
2019 (Lim et al., 2019) Hand dynamic, RGB ASLLVD, Pros: Accurate and compact sign language hand
representation with a good discriminating power.
RWTH-Boston-50 Cons: No robust against different skin colors and hands
occlusions.
2019 (Li et al., 2019) Hand static, RGB STB, RSTB Pros: Estimation accuracy improvement of 3D hand pose
benefiting from stereo cameras capabilities. Cons: No robust
to multiple hands or cases with hand/object interaction.
(continued on next page)
22
R. Rastgoo et al. Expert Systems With Applications 164 (2021) 113794
Table 10 (continued).
Year Ref. Feature Input Dataset Description.
fusion modality
2020 (Wadhawan & Kumar, 2020) Hand static, RGB own dataset Pros: Extensive evaluation results on 50 deep learning
models using different optimizers. Cons: Need to fine-tune
the recognition method using more real data.
2020 (Elboushaki et al., 2020) Hand dynamic, RGB, isoGD, SKIG, Pros: Recognition accuracy improvement by capturing the
Depth fine-grained motion details encoded in multiple adjacent
frames of input video.
NATOPS, SBU Cons: Human gestures are highly related to different
modalities.
Table 11
State-of-the-art models on the datasets corresponding to the sign language and related areas.
Dataset Year Ref. Goal Model Modality Results
NYU 2020 (Rastgoo et al., 2020a) HSR SSD, 2DCNN, Depth 4.64 mm
3DCNN, LSTM
ICVL 2018 (Moon et al., 2018) HP CNN Depth 6.28 mm
MSRA 2016 (Oberweger et al., 2016) HP CNN Depth 5.58 mm (Ave. err.)
FLIC 2016 (Newell et al., 2016) HP CNN RGB 99.0 (Elbow)
LSP 2016 (Wei et al., 2016) HP CNN RGB 84.32
isoGD 2020 (Rastgoo et al., 2020b) HSR SSD, CNN, LSTM RGB 86.32
MPII 2016 (Newell et al., 2016) HP CNN RGB 90.90 (total)
ITOP 2018 (Marı n Jimeneza et al., 2018) HP CNN Depth 97.5 (AUC)
RGBD-HuDaAct 2016 (Duan et al., 2016) HG CNN Depth, RGB 96.74
STB 2018 (Spurr et al., 2018) HP VAE RGB, Depth 0.983(AUC)
EVAL 2016 (Haque et al., 2016) HP CNN Depth 74.10
Dexter 2017 (Zimmermann & Brox, 2017) HP CNN RGB 49.0 (AUC)
RWTH-PHOENIX-Weather 2012 2015 (Koller, Ney et al., 2015) HSR CNN RGB 55.70 (Precision)
RWTH-PHOENIX-Weather 2014 2019 (Cui et al., 2019) CDSLR 3DCNN, Bi-LSTM RGB 22.86
BigHand2.2M 2018 (Baek et al., 2018) HP GAN Depth 13.7 mm
Human3.6M 2018 (Wang et al., 2018) HP CNN Depth 62.8 mm
NGT 2017 (Mocialov et al., 2017) CDSLR heuristic, LSTM RGB 80.70 (accuracy)
UBC3V 2018 (Marı n Jimeneza et al., 2018) HP CNN Depth 88.2 (AUC)
Massey 2012 2018 (Rastgoo et al., 2018) HR RBM RGB, Depth 99.31
SL Surrey 2018 (Rastgoo et al., 2018) HR RBM RGB, Depth 97.56
ASL Fingerspelling A 2018 (Rastgoo et al., 2018) HR RBM RGB, Depth 98.13
OUHANDS, 2018 (Dadashzadeh et al., 2018) HG CNN Depth 86.46
Egohands 2017 (Dibia, 2017) HT CNN RGB 0.9686 (mAP)
Dexter 2018 (Mueller et al., 2018) HT CNN RGB 0.64 (AUC)
EgoDexter 2018 (Mueller et al., 2018) HT CNN RGB 0.54 (AUC)
RHD 2018 (Spurr et al., 2018) HP VAE RGB, Depth 0.849(AUC)
B2RGB-SH 2019 (Li et al., 2019) HP CNN RGB 7.18 (err)
DHG-14/28 Dataset 2019 (Chen et al., 2019) HG CNN RGB 91.9
SHREC’17 Track Dataset 2019 (Chen et al., 2019) HG CNN RGB 94.4
RWTH-BOSTON-50 2019 (Lim et al., 2019) HS CNN RGB 89.33
ASLLVD 2019 (Lim et al., 2019) HS CNN RGB 31.50
EgoGesture 2019 (Kopuklu et al., 2019) HG CNN RGB 94.03
NVIDIA benchmarks 2019 (Kopuklu et al., 2019) HG CNN RGB 83.83
SKIG 2020 (Elboushaki et al., 2020) HG CNN RGB, Depth 99.72
NATOPS 2020 (Elboushaki et al., 2020) HG CNN RGB, Depth 95.87
SBU 2020 (Elboushaki et al., 2020) HG CNN RGB, Depth 97.51
First-Person 2020 (Rastgoo et al., 2020a) HSR SSD, 2DCNN, RGB 91.12
3DCNN, LSTM
RKS-PERSIANSIGN 2020 (Rastgoo et al., 2020a) HSR SSD, 2DCNN, RGB 99.80
3DCNN, LSTM
Table 12
A summary of the main characteristics of the reviewed models.
Query Available choices Most used
Sign languages USA, Germany, Greek, Poland, China, Argentina, Korea, Iran USA
Modalities RGB, Depth, Skeleton Depth
Static (Image), Dynamic (Video) Static (Image)
Architectures Static: CNN, RBM, GAN, AE CNN
Dynamic: RNN, LSTM, GRU, 3DCNN LSTM
Datasets with lowest performance RWTH-PHOENIX-Weather, Human3.6M, isoGD isoGD
Datasets with highest performance FLIC, SKIG, Massey 2012, RKS-PERSIANSIGN FLIC
Generative models RBM, AE, VAE, GAN VAE
Traditional classifiers SVM, SPM, HCM SVM
Traditional descriptors Heuristics, HOG, HMM HMM
Recognition modalities Isolated, Continuous Isolated
Evaluation metrics Accuracy, Error rate, Precision, Recall, mAP, AUC Accuracy
Features fusion Hand, Face, Body Hand
Feature types HP, HG, HSR, HD, HT HP
Input dimensions 2D, 3D 2D
23
R. Rastgoo et al. Expert Systems With Applications 164 (2021) 113794
or human body parts features. In input data fusion, different types as the most complex signs, face a challenge in movement epenthe-
of input data, such as RGB, depth, skeleton, flow information, sis to manage the transitions between signs in the input video.
text, and synthetic data, can be fused to have much more powerful All of these leads to this fact that while there are many research
features and improve the recognition accuracy. In the fusion works in static and dynamic sign language recognition (Deng
of human body parts features, there are three parts, including et al., 2017; Gomez-Donoso et al., 2019; Guo et al., 2017; Li et al.,
hand, face, and body, that their features can be fused. However, 2019; Oberweger et al., 2015; Spurr et al., 2018; Zimmermann
there are many challenges to use the hand features for hand sign & Brox, 2017), few works have been developed for continuous
language recognition area, most of the models have relied on the dynamic sign language (Cihan Camgöz et al., 2017; Cui et al.,
hand features and tried to improve the sign language recogni- 2019; Koller, Zargaran et al., 2016; Mocialov et al., 2017; Pu
tion accuracy just using the hand features. Hand sign language et al., 2018; Wei et al., 2019). Some of these models may require
recognition area, including hand detection, hand pose estimation, mechanisms to model spatio-temporal modeling of structures and
hand gesture recognition, real-time hand tracking, and hand pose patterns that localize signs, gestures, and poses. While these
recovery, cops with a lot of challenges such as high variations mechanisms increase the model complexity, they help to improve
of the hand shapes and gestures, self-occlusion (between fingers), the model performance by providing discriminative features. Ben-
close similarity between fingers, low resolution, varying illumina- efiting from deep learning capabilities for parallel computation,
tion conditions, different hand gestures, and complex interactions using more accurate fusion methods to integrate multiple input
between hands and objects or other hands. So, some of the models modalities, especially in continuous dynamic sign language, and
fused the hand features with the face features to decrease the employing the combination of fast traditional methods with deep-
effect of these challenges and improve the recognition accuracy. based models can help to decrease the complexity of the models
However, the facial features tracking is also challenging due to in sign language recognition and related areas.
the fast motion of the head and face in different signs and also • Applications: In respect of applications, deep learning approaches
face occlusions by hands during the signing. In this regards, the have been successfully applied in many areas related to sign
human body features are also used as the complementary features language recognition area such as machine translation, voice
along with the hand and face features to boost the recognition assistant, text assistant, and so on (Deng et al., 2017; John, Boyali,
accuracy. So, sing language recognition area can benefit from Mita, Imanishi, & Sanma, 2016; Mittal, 2018; Supancic et al.,
the accurate models for human pose estimation that we reviewed 2018). We expect that the sign language application area will be
some of them in this survey to improve the recognition accuracy. extended in future not only for deaf and speaking disable people
• Input modality: While most of the models have used the ad- but also for the other people of the society that rely on signing
vantages of depth modality, the other models have benefited as a complementary language to verbal communication in daily
from the high-resolution pixel information of the RGB modality interactions.
for sign language recognition. Furthermore, some of the models • Hybrid models: While there is a rising trend in employing deep
have utilized the flow information, in the forms of OF and SF, learning-based models in the research community, this is not
skeleton modality, or synthetic data. While the thermal modality mean that traditional computer vision techniques have got obso-
is not as common as RGB, depth, and the other modalities, lete. Some problems may benefit from having a trade off between
using this modality with another familiar modality can be more both the powerful capabilities of deep learning (in particular in
considered to improve the feature quality. Regarding different those cases of having large amounts of data) and the specific
types of signs, static, dynamic, and continuous dynamic, we think problem-tailored design of handcrafted features. This combina-
that the research community will move into designing for useful tion can help to develop more accurate systems in some special
in practice sign language recognition models concentrating on domains.
learning unsegmented signs of long-term video streams. Current • Other challenges: One of the most important challenges in sign
progress in deep learning for video understanding shows the language recognition area is occlusion. Since each sign may con-
feasibility of moving into this direction. sist of the whole or part of the hand, face, and body movements,
• Datasets and different sign languages: There are many datasets, occlusion of these parts during the signing can be led to the
with different data modalities and languages, for hand, face, more complicated situations. Another important challenge is in
and human body sign language recognition. While hand sing real-time sign language recognition. We think that the research
recognition area suffers from the inexistence of a large, diverse, community will pay much more attention to deep learning models
and realistic dataset including accurate annotated data in the for real-time sign language recognition in the future. To have
unconstrained environment, there are some accurate annotated the more sophisticated and realistic models for applying in deaf
datasets for human pose estimation, with the capability of ap- and speaking community applications, we need to have a real-
plying in real-world applications, which could be used in sign time translation system to connect these people to other people
language recognition area. Furthermore, we need the models of the community. Some efforts have been done in this area and
with long untrimmed videos including some sign sentences in some models have been suggested but much more improvement
real communication, not just one sign, word, letter, or action is indispensable. Another challenge is multi-person sign language
in a video to have a more realistic model in order to apply in recognition that could be more considered in future models. Most
real-world communications. In other words, the sign language of the proposed models just considered the individual sign for
recognition systems, learned on real input data, have to be used
only one person that nevertheless takes into account the sign of
in real communications with an unlimited environment. This is so
other persons.
complicated but we think that the research community will try to
do it in the future especially based on deep learning approaches. Finally, we presented an aggregated information about all of the works
• Task complexity: Different deep-based models with different presented in this survey. Table 11 shows state-of-the-art models on
levels of complexity have been developed in recent years. Using different datasets in sign language recognition and related areas. As
various representations of input data along with different input this table shows, trends of the proposed models on different datasets in
modalities is led to propose different models with different levels sign language and related areas show that deep learning approaches
of complexity. While static signs use the image modality in the successfully improved the model performance with a high margin.
model, dynamic and continuous dynamic signs tackle with the However, more endeavor is necessary for some challenging datasets
video input challenges. Furthermore, continuous dynamic signs, such as isoGD, LSP, and EVAL. In most of the existing datasets, such
24
R. Rastgoo et al. Expert Systems With Applications 164 (2021) 113794
as NYU, ICVL, MSRA, ASL Fingerspelling A, RKS-PERSIANSIGN, the CRediT authorship contribution statement
achieved performance by deep-based models are higher than the other
challenging datasets. The proposed experimental results of different Razieh Rastgoo: Methodology, Software, Validation, Data curation,
deep-based models confirm the effective role of using multi-modal Writing - original draft, Visualization. Kourosh Kiani: Conceptualiza-
and multi-channel information (Duan et al., 2016; Elboushaki et al., tion, Data curation, Writing - review & editing, Supervision, Project
2020; Rastgoo et al., 2018; Spurr et al., 2018). Furthermore, the administration. Sergio Escalera: Conceptualization, Writing - review
proposed hybrid models successfully improved the model performance & editing, Supervision, Project administration.
benefiting from the combination of some hand-crafted features with
deep-based features (Chen et al., 2020; Escobedo-Cardenas & Camara- Declaration of competing interest
Chavez, 2020; Ma1 et al., 2016; Rastgoo et al., 2020b). These models
benefit from having a trade off between both the powerful capabilities The authors declare that they have no known competing finan-
of deep learning (in particular in those cases of having large amounts of cial interests or personal relationships that could have appeared to
data) and the specific problem-tailored design of handcrafted features. influence the work reported in this paper.
Due to the undeniable power of CNN models for feature extraction
from visual inputs, in most of the proposed deep-base models, CNN Acknowledgments
or a combination of CNN with other deep-based models is employed.
Generative models, such as RBM and VAE, showed a comparable or This work has been partially supported by the Spanish project
better performance than other deep alternatives in coping with few data PID2019-105093GB-I00 (MINECO/FEDER, UE) and CERCA Programme/
for sign language recognition (Rastgoo et al., 2018; Spurr et al., 2018). Generalitat de Catalunya, ICREA under the ICREA Academia pro-
Since the dynamic modality is more challenging than the static one, gramme, and High Intelligent Solution (HIS) company in Iran.
most of the proposed models employed LSTM or 3DCNN for analyzing
temporal dynamics. Table 12 shows a summary of the main charac- Funding
teristics relevant to sign language recognition regarding the reviewed
models. This research received no external funding.
Next, we describe possible limitations of this survey and discuss
some future recommendations for advancing the research in the field. References
• Limitations: In this survey, we reviewed the vision-based pro- Acton, B., & Koum, J. (2009). WhatsApp. Yahoo, www.whatsapp.com.
posed models of sign language recognition and related areas Adaloglou, N., Chatzis, T., Papastratis, I., Stergioulas, A., Papadopoulos, G.,
using deep learning approaches from the last five years. The Zacharopoulou, V., Xydopoulos, G., Atzakas, K., Papazachariou, D., & Daras, P.
main goal of this survey is to compactly summarize the vision- (2019). A comprehensive study on sign language recognition methods. IEEE
Transactions on Multimedia.
based sign language recognition models corresponding to the Andriluka, M., Pishchulin, L., Gehler, P., & Bernt, S. (2014). 2D human pose estimation:
achieved results. This can facilitate the research way for other New benchmark and state of the art analysis. In CVPR. Columbus, Ohio.
researchers in the field to access the latest developments, advan- Asadi-Aghbolaghi, M., Clapés, A., Bellantonio, M., Jair Escalante, H., Ponce-López, V.,
tages, limitations, and future directions in sign language recog- Baró, X., Guyon, I., Kasaei, S., & Escalera, S. (2017). Deep learning for action and
gesture recognition in image sequences: A survey. {𝐺}esture {𝑅}ecognition, 539–578.
nition. While we present the hybrid models in sign language
Baek, S., Kim, K., & Kim, T.-K. (2018). Augmented skeleton space transfer for depth-
recognition, we did not include the sensor-based models nor the based hand pose estimation. In CVPR (pp. 8330–8339). Salt Lake City, Utah, United
traditional-based models. There are many sensor-based modali- States.
ties considered for sign language recognition. Furthermore, other Bambach, S., Lee, S., Crandall, D., & Yu, C. (2015). Lending A hand: Detecting hands
modalities achieved from other data collection devices can be and recognizing activities in complex egocentric interactions. In ICCV. Las Condes,
Chile.
also considered for possible usage. We presented a brief review in
Baró, X., Gonzàlez, J., Fabian, J., Bautista, M., Oliu, M., Escalante, H., Guyon, I., &
terms of the application domain of sign language recognition. Due Escalera, S. (2015). ChaLearn Looking at People 2015 challenges: action spotting
to the importance of this domain in deaf and speak-impaired com- and cultural event recognition. In CVPR 2015. Boston, Massachusetts.
munity, more details of this domain must be studied to open new Barsoum, E. (2016). Articulated hand pose estimation review. arXiv:1604.06195.
windows for proposing some applications compatible with real Bin, Y., Chen, Z., Wei, X., Chen, X., Gao, C., & Sang, N. (2020). Structure-aware human
pose estimation with graph convolutional networks. Pattern Recognition, 106, Article
world conditions. This may include considering in more detail
107411.
the real needs in practice of this technology and their associated Camgoz, N., Hadfield, S., Koller, S., Ney, H., & Bowden, R. (2018). Neural sign language
usability, privacy, generalization to different populations, and translation. In CVPR (pp. 7784–7793). Utah, United States.
ethical dimensions. Cao, Z., Simon, T., Wei, S., & Sheikh, Y. (2017). Real-time multi-person 2D pose
• Future directions: While many models have been proposed for estimation using part affinity fields. In CVPR. Hawaii, United States.
Chai, X., Guang, L., Lin, Y., Xu, Z., Tang, Y., Chen, X., & Zhou, M. (2013). Sign language
sign language recognition, more effort is required to provide more recognition and translation with kinect.
accurate and useful in practice models. We envision a multi- Chen, T. Y., Ting, P. W., Wu, M. Y., & Fu, L. C. (2018). Learning a deep network with
modal integration from the point of view of face, body, and hand spherical part model for 3D hand pose estimation. Pattern Recognition, 80, 1–20.
visual cues with significantly enhance recognition performance Chen, H., Tong, R., Chen, M., Fang, Y., & Liu, H. (2018). A hybrid CNN-SVM classifier
for hand gesture recognition with surface EMG signals. In 2018 international
of current models, providing the fine grain recognition analysis
conference on machine learning and cybernetics (ICMLC) (pp. 619–624).
required in practice. We foresee that most of the challenges in Chen, X., Wanga, G., Guoa, H., & Zhanga, C. (2020). Pose guided structured region
sign language recognition area will be solved under the support ensemble network for cascaded hand pose estimation. Neurocomputing, http://dx.
of deep learning, faster hardware to process the input data, ac- doi.org/10.1016/j.neucom.2018.06.097.
curate multi-modal approaches, and new data covering the real Chen, C., Zhang, B., Hou, Z., Jiang, J., Liu, M., & Yang, Y. (2017). Action recognition
from depth sequences using weighted fusion of 2D and 3D auto-correlation of
variability and distribution of the problem at hand. While most
gradients features. Multimedia Tools and Applications, 76, 4651–4669.
of the presented models are in the scope of isolated sign language Chen, Y., Zhao, L., Peng, X., Yuan, J., & Metaxas, D. . N. (2019). Construct dynamic
recognition, we expect the community to move in a near future graphs for hand gesture recognition via spatial-temporal attention. In BMVC, UK
into the direction of addressing the challenges of continuous sign (pp. 1–13).
language recognition, including continuous annotated datasets, Cheok, M., Omar, Z., & Jaward, M. (2017). A review of hand gesture and sign language
recognition techniques. International Journal of Machine Learning and Cybernetics,
tokenization, and long term multi-modal modeling of data, spe- 1–23.
cially benefiting from the integration of vision and language Cheron, G., Laptev, I., & Schmid, C. (2015). P-CNN: Pose-based CNN features for action
models. recognition. In IEEE International conference on computer vision (ICCV). Chile.
25
R. Rastgoo et al. Expert Systems With Applications 164 (2021) 113794
Cihan Camgöz, N., Hadfield, S., Koller, O., & Bowden, R. (2017). SubUNets: End-to- Gomez-Donoso, F., Orts-Escolano, S., & Cazorla, M. (2019). Accurate and efficient 3D
end hand shape and continuous sign language recognition. In IEEE international hand pose regression for robot hand tele-operation using a monocular RGB camera.
conference on computer vision (ICCV) 2017. Venice, Italy. Expert Systems With Applications, 136, 327–337.
Cooper, H., Ong, W., Pugeault, N., & Bowden, R. (2012). Sign language recognition Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,
using sub-units. Journal of Machine Learning Research 13, 2205–2231. Courville, A., & Bengio, Y. (2014). Generative adversarial nets. In NIPS. Monteral,
Cui, R., Liu, H., & Zhang, C. (2019). A deep neural framework for continuous sign Canada.
language recognition by iterative training. IEEE Transactions on Multimedia, 21(7), Grosse, R. (2017). CSC321 Lecture 20: Autoencoders. Toronto University, http://www.
1880–1891. cs.toronto.edu/~rgrosse/courses/csc321_2017/slides/lec20.pdf.
Dadashzadeh, A., Tavakoli Targhi, A., & Tahmasbi, M. (2018). HGR-Net: A two- Guo, H., Wang, G., & Chen, X. (2017). Towards good practices for deep 3D hand pose
stage convolutional neural network for hand gesture segmentation and recognition. estimation. arXiv:1707.07248.
arXiv:1806.05653. Han, M., Chen, J., Li, L., & Chang, Y. (2016). Visual hand gesture recognition with
Deng, X., Yang, S., Zhang, Y., Tan, P., Chang, L., & Wang, H. (2017). Hand3D: Hand convolution neural network. In 17th IEEE/ACIS international conference on software
pose estimation using 3D neural network. arXiv:1704.02224. engineering, artificial intelligence, networking and parallel/distributed computing (SNPD).
Devineau, G., Xi, W., Moutarde, F., & Yang, J. (2018). Deep learning for hand gesture China.
recognition on skeletal data. In 13th IEEE conference on automatic face and gesture Haque, A., Peng, B., Luo, Z., Alahi, A., Yeung, S., & Fei-Fei, L. (2016). Towards
recognition. China. viewpoint invariant 3D human pose estimation. In ECCV. Amsterdam, Netherlands.
Dibia, V. (2017). Handtrack: A library for prototyping real-time hand tracking inter- Hinton, G. (2007). Deep belief nets. In NIPS. Vancouver, B.C., Canada.
faces using convolutional neural networks. GitHub Repository, https://github.com/ Hinton, G., Osindero, S., & Teh, Y. (2006). A fast learning algorithm for deep belief
victordibia/handtracking/tree/master/docs/handtrack.pdf. nets. Neural Computation, 18, 1527–1554.
Dibra, E., Wolf, T., Oztireli, C., & Gross, M. (2017). How to refine 3D hand pose Huang, J., Zhou, W., Li, H., & Li, W. (2015). Sign language recognition using 3D
estimation from unlabelled depth data?. In International conference on 3D vision convolutional neural networKS. In IEEE international conference on multimedia and
(3DV). Qingdao, China. expo (ICME). Turin, Italy.
Doersch, C. (2016). Tutorial on variational autoencoders. arXiv:1606.05908. Ionescu, C., Papava, D., Olaru, V., & Sminchisescu, C. (2014). Human3.6M: LArge scale
Doosti, B. (2019). Hand pose estimation: A survey. arXiv:1903.01013. datasets and predictive methods for 3D human sensing in natural environments.
Duan, J., Zhou, S., Wany, J., Guo, X., & Li, S. Ż. (2016). Multi-modality fusion IEEE Transactions on Pattern Analysis and Machine Intelligence.
based on consensus-voting and 3D convolution for isolated gesture recognition. Marı n Jimeneza, M., Romero-Ramireza, F., Munoz-Salinasa, R., & Medina-Carnicer, R.
arXiv:1611.06689. (2018). 3D Human pose estimation from depth maps using a deep combination of
Elboushaki, A., Hannane, R., Afdel, K., & Koutti, L. (2020). MultiD-CNN: A multi- poses. Journal of Visual Communication and Image Representation, 627–639.
dimensional feature learning approach based on deep convolutional networks for John, V., Boyali, A., Mita, S., Imanishi, M., & Sanma, N. (2016). Deep learning-based
gesture recognition in RGB-D image sequences. Expert Systems With Applications, fast hand gesture recognition using representative frames. In International conference
139. on digital image computing: techniques and applications (DICTA). Australia.
Escalera, S., Athitsos, V., & Guyon, I. (2016). Challenges in multi-modal gesture Kang, B., Tripathi, S., & Nguyen, T. (2015). Real-time sign language finger-spelling
recognition. Journal of Machine Learning Research, 17, 1–54. recognition using convolutional neural networks from depth map. In 3rd IAPR Asian
Escalera, S., Gonzàlez, J., Baró, X., Reyes, M., Lopés, O., Guyon, I., Athitsos, V., conference on pattern recognition (ACPR). Kuala Lumpur, Malaysia.
& Escalante, H. (2013). Multi-modal gesture recognition challenge 2013: dataset Kapuscinski, T., Oszust, M., Wysocki, M., & Warchol, D. (2015). Recognition of hand
and results. In 15th ACM conference, Sydney, Australia. http://dx.doi.org/10.1145/ gestures observed by depth cameras. International Journal of Advanced Robotic
2522848.2532595. Systems, 12(4).
Escobedo-Cardenas, E., & Camara-Chavez, G. (2015). A robust gesture recognition using Kazakos, E., Nikou, C., & Kakadiaris, I. (2018). On the fusion of rgb and depth
hand local data and skeleton trajectory. In 2015 IEEE international conference on information for hand pose estimation. In 25th IEEE international conference on image
image processing (ICIP), Quebec City, QC, 2015 (pp. 1240–1244). processing (ICIP) (pp. 868–872). Athens, Greece.
Escobedo-Cardenas, E., & Camara-Chavez, G. (2020). Multi-modal hand gesture recog- Kim, S., Ban, Y., & Lee, S. (2017). Tracking and classification of in-air hand gesture
nition combining temporal and pose information based on CNN descriptors and based on thermal guided joint filter. Sensors.
histogram of cumulative magnitudes. Journal of Visual Communication and Image Kocabas, M., Karagoz, S., & Akbas, E. (2018). MultiPoseNet: Fast multi-person pose
Representation. estimation using pose residual network. In CVPR. Utah, United States.
Fang, X., & Lei, X. (2017). Hand pose estimation on hybrid CNN-AE model. In Koller, O., Forster, J., & Hermann, N. (2015). Continuous sign language recognition:
Proceedings of the 2017 IEEE, International conference on information and automation Towards large vocabulary statistical recognition systems handling multiple signers.
(ICIA), China. Computer Vision and Image Understanding, 141, 108–125.
Ferreira, P., Cardoso, J., & Rebelo, A. (2019). On the role of multi-modal learning in the Koller, O., Ney, H., & Bowden, R. (2015). Deep learning of mouth shapes for sign
recognition of sign language. Multimedia Tools and Applications, 78, 10035–10056. language. In IEEE international conference on computer vision workshop (ICCVW),
Fischer, A., & Igel, C. (2012). An introduction to restricted Boltzmann machines. In santiago, Chile.
Proceedings of the 17th Iberoamerican congress on pattern recognition (CIARP 2012) Koller, O., Zargaran, S., Ney, H., & Bowden, R. (2016). Deep sign: Hybrid CNN-HMM
LNCS 7441, uenos Aires, Argentina. http://dx.doi.org/10.1007/978-3-642-33275- for continuous sign language recognition. In BMVC, UK.
3_2. Kopuklu, O., Gunduz, A., Kose, N., & Rigoll, G. (2019). Real-time hand gesture detection
Forster, J., Schmidt, C., Hoyoux, T., Koller, O., Zelle, U., Piater, J., & Ney, H. (2012). and classification using convolutional neural networks. arXiv:1901.10323.
RWTH-PHOENIX-weather: A large vocabulary sign language recognition and trans- Le, T., Jaw, D., Lin, I., Liu, H., & Huang, S. (2018). An efficient hand detection method
lation corpus. In International conference on language resources and evaluation. based on convolutional neural network. In 7th IEEE international symposium on
Istanbul, Turkey. next-generation electronics. Taipei, Taiwan.
Forster, J., Schmidt, C., Koller, O., Bellgardt, M., & Ney, H. (2014). Extensions of Li, Y., Xue, Z., Wang, Y., Ge, L., Ren, Z., & Rodriguez, J. (2019). End-to-end 3D hand
the sign language recognition and translation corpus RWTH-PHOENIX-weather. In pose estimation from stereo cameras. In BMVC. UK.
International conference on language resources and evaluation (LREC), harpa conference Lifshitz, I., Fetaya, E., & Ullman, S. (2016). Human pose estimation using deep
centre in Reykjavik (Iceland). consensus voting. In ECCV (pp. 246–260).
Frederic, B., Lamblin, P., Pascanu, R., et al. (2012). Theano: new features and Lim, K., Tan, A., Lee, C., & Tan, S. (2019). Isolated sign language recognition using
speed improvements. In NIPS Workshop, Canada. http://deeplearning.net/sofware/ convolutional neural network hand modelling and hand energy image. Multimedia
theano/. Tools and Applications, 78, 19917–19944.
Ganapathi, V., Plagemann, C., Koller, D., & Thrun, S. Real-time human pose tracking Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C., & Berg, A. (2016). SSD:
from range data. In ECCV (pp. 738–751). Italy. Single shot multibox detector. In ECCV (pp. 21–37). Amsterdam, Netherlands.
Gattupalli, S., Ghaderi, A., & Athitsos, V. (2016). Evaluation of deep learning based Liu, L., & Shao, L. (2013). Learning discriminative representations from RGB-D video
pose estimation for sign language recognition. arXiv:1602.09065. data. In Proceedings of the twenty-third international joint conference on artificial
Ge, L., Liang, H., Yuan, J., & Thalmann, D. (2017). 3D convolutional neural networks intelligence (IJCAI). Beijing, China.
for efficient and robust hand pose estimation from single depth images. In CVPR Ma1, M., Chen, Z., & Wu, J. (399–404). A recognition method of hand gesture with
(pp. 1991–2000). Hawaii, United States. CNN-SVM model. In International conference on bio-inspired computing: theories and
Ge, L., Liang, H., Yuan, J., & Thalmann, D. (2018). Robust 3D hand pose estimation in applications (pp. 399–404). Harbin, China.
single depth images: from single-view CNN to multi-view CNNs. IEEE Transactions Madadi, M., Bertiche, H., & Escalera, S. (2020). SMPLR: Deep SMPL reverse for 3D
on Image Processing. human pose and shape recovery. Pattern Recognition, 106, http://dx.doi.org/10.
Ge, L., Ren, Z., & Yuan, J. (2018). Point-to-point regression pointnet for 3D hand pose 1016/j.patcog.2020.107472.
estimation. In ECCV (pp. 1–17). Munich, Germany. Madadi, M., Escalera1, S., Baro, X., & Gonzalez, J. (2017). End-to-end global to local
Girshick, R. (2015). Fast R-CNN. In 2015 IEEE international conference on computer vision CNN learning for hand pose recovery in depth data. arXiv:1705.09606.
(ICCV), Santiago, Chile. http://dx.doi.org/10.1109/ICCV.2015.169. Matilainen, M., Sangi, P., Holappa, J., & Silven, O. (2016). OUHANDS Database
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2015). Region-based convolutional for hand detection and pose recognition. In International conference on image
networks for accurate object detection and segmentation. IEEE Transactions on processing theory, tools and applications, Finland. http://dx.doi.org/10.1109/IPTA.
Pattern Analysis and Machine Intelligence, 38(1), 142–158. 2016.7821025.
26
R. Rastgoo et al. Expert Systems With Applications 164 (2021) 113794
McCulloch, W., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous Sinha, A., Choi, C., & Ramani, K. (2016). DeepHand: Robust hand pose estimation by
activity. Bulletin of Mathematical Biology, 5, 115–133. completing a matrix imputed with deep features. In CVPR (pp. 4150–4159). Las
Mittal, V. (2018). Top 15 deep learning applications that will rule the world in 2018 Vegas, NV, USA.
and beyond. www.medium.com. Smedt, Q., Wannous, H., & Vandeborre, J. (2016). Dynamic hand gesture recognition
Mocialov, B., Turner, G., Lohan, K., & Hastie, H. (2017). Towards continuous using skeleton-based features. In CVPRW. Las Vegas, Nevada, United States.
sign language recognition with deep learning. semanticscholar. https://www. Spurr, A., Song, J., Park, S., & Hilliges, O. (2018). Cross-modal deep variational hand
semanticscholar.org/paper/Towards-Continuous-Sign-Language-Recognition-with- pose estimation. In CVPR (pp. 89–98). Salt Lake City, Utah, United States.
Mocialov-Turner/f24c82e85906bc7325b296d37370febd65833fdd. Supancic, J., Rogez, G., Yang, Y., Shotton, J., & Ramana, D. (2018). Depth-based hand
Molchanov, P., Gupta, S., Kim, K., & Kautz, J. (2015). Hand gesture recognition with pose estimation: methods, data, and challenges. International Journal of Computer
3D convolutional neural networks. In IEEE conference on computer vision and pattern Vision, 1180–1198.
recognition workshops (CVPRW). Boston, Massachusetts. Tagliasacchi, A., Schröder, M., Tkach, A., Bouaziz, S., Botsch, M., & Pauly, M. (2015).
Molchanov, P., Yang, X., Gupta, S., Kim, K., Tyree, S., & Kautz, J. (2016). Online Robust articulated-ICP for real-time hand tracking. In Eurographics symposium on
detection and classification of dynamic hand gestures with recurrent 3D convolu- geometry processing.
tional neural networks. In IEEE conference on computer vision and pattern recognition Tang, A., Lu, K., Wang, Y., Huang, J., & Li, H. (2015). A real-time hand posture
(CVPR). Las Vegas, NV, USA. recognition system using deep neural networks. In ACM transactions on intelligent
Moon, G., Chang, J., & Lee, K. (2018). V2V-PoseNet: Voxel-to-voxel prediction network systems and technology (TIST) - special section on visual understanding with RGB-D
for accurate 3D hand and human pose estimation from a single depth map. In CVPR. sensors.
Salt Lake City, Utah, United States. TensorFlow (2020). Tensorflow. Retrieved from Available online: Accessed date: Jun,
Mueller, F., Bernard, F., Sotnychenko, O., Mehta, D., Sridhar, S., Casas, D., & 2020.
Theobalt, C. (2018). Ganerated hands for realtime 3d hand tracking from monoc- Thangali, A., Nash, J., Sclaroff, S., & Neidle, C. (2011). Exploiting phonological
ular rgb. In CVPR, Salt Lake City, Utah, United States (pp. 1–11). http://dx.doi.org/ constraints for handshape inference in ASL video. In CVPR. USA.
10.1109/CVPR.2018.00013. Tompson, J., Stein, M., Lecun, Y., & Perlin, K. (2014). Real-time continuous pose
Murray, J. (2018). World Federation of the deaf. Rome, Italy. Retrieved from http: recovery of human hands using convolutional networks. ACM Transactions on
//wfdeaf.org/our-work/. (Accessed 30 January 2020). Graphics, 33, 1–10.
MXNET (2020). MXNET. Available online: Accessed date: Jun, 2020. Toshev, A., & Szegedy, C. (2014). Deeppose: Human pose estimation via deep neural
Neverova, N., Wolf, C., Taylor, G., & Nebout, F. (2014). Hand segmentation with networks. arXiv:1312.4659.
structured convolutional learning. In Asian conference on computer vision (ACCV) Varol, G., Romero, J., Martin, X., Mahmood, N., Black, M., Laptev, I., & Schmid, C.
2014: Computer vision (pp. 687–702). Singapore. (2017). Learning from synthetic humans. In CVPR. Hawaii, United States.
Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks for human pose Voulodimos, A., Doulamis, N., Doulamis, A., & Protopapadakis, E. (2018). Deep
estimation. In European conference on computer vision (ECCV) (pp. 483–499). learning for computer vision: A brief review. Hindawi Computational Intelligence and
Oberweger, M., Riegler, G., Wohlhart, P., & Lepetit, V. (2016). Efficiently creating 3D Neuroscience, 1–13. http://dx.doi.org/10.1155/2018/7068349.
training data for fine hand pose estimation. In CVPR. Nevada, United States. Wadhawan, A., & Kumar, P. (2020). Deep learning-based sign language recognition
Oberweger, M., Wohlhart, P., & Lepetit, V. (2015). Hands deep in deep learning system for static signs. Neural Computing and Applications, 1–12. http://dx.doi.org/
for hand pose estimation. In Proceedings of 20th computer vision winter workshop 10.1007/s00521-019-04691-y.
(CVWW) (pp. 21–30). Wan, J., Zhao, Y., Zhou, S., Guyon, I., Escalera, S., & Li, S. (2016). Chalearn looking at
Oszust, M., & Wysocki, M. (2013). Polish sign language words recognition with Kinect. people RGB-D isolated and continuous datasets for gesture recognition.In CVPRW
In 6th International conference on human system interactions (HSI). Sopot, Poland. 2016. Nevada, United States.
Pagebites, I. (2018). Imo. United States. http://www.imo.com. Wang, T. (2016). Recurrent neural network. Machine Learning Group, Univer-
Pu, J., Zhou, W., & Li, H. (2018). Dilated convolutional network with iterative sity of Toronto, for CSC 2541, Sport Analytics, https://www.cs.toronto.edu/
optimization for continuous sign language recognition. In IJCAI18: Proceedings of ~tingwuwang/rnn_tutorial.pdf.
the 27th international joint conference on artificial intelligence. Stockholm. Wang, M., Chen, X., Liu, W., Qian, C., Lin, L., & Ma, L. (2018). DRPose3D: Depth rank-
Pugeault, N., & Bowden, R. (2011). Spelling it out: Real-Time ASL finger-spelling ing in 3D human pose estimation. In Proceedings of the twenty-seventh international
recognition. In Proceedings of the 1st IEEE workshop on consumer depth cameras for joint conference on artificial intelligence (IJCAI-18) (pp. 978–984).
computer vision, jointly with ICCV’2011. Barcelona, Spain. Wang, P., Li, W., Liu, S., Gao, Z., Tang, C., & Ogunbona, P. (2017). Large-scale isolated
Rao, G., Syamala, K., Kishore1, P., & Sastry, A. (2018). Deep convolutional neural gesture recognition using convolutional neural networks. arXiv:1701.01814.
networks for sign language recognition. In Conference on signal processing and Wei, S., Ramakrishna, V., Kanade, T., & Sheikh, Y. (2016). Convolutional pose
communication engineering systems (SPACES). India. machines. In CVPR. Las Vegas, Nevada.
Rastgoo, R., Kiani, K., & Escalera, S. (2018). Multi-modal deep hand sign language Wei, C., Zhou, W., Pu, J., & Li, H. (2019). Deep grammatical multi-classifier for
recognition in still images using restricted Boltzmann machine. Entropy. continuous sign language recognition. In 2019 IEEE fifth international conference
Rastgoo, R., Kiani, K., & Escalera, S. (2020a). Hand sign language recognition using on multimedia big data (BigMM). Singapore.
multi-view hand skeleton. Expert Systems With Applications, 150. Wu, J. (2019). Convolutional neural networks. LAMDA Group, National Key Lab for
Rastgoo, R., Kiani, K., & Escalera, S. (2020b). Video-based isolated hand sign language Novel Software Technology Nanjing University, China, https://cs.nju.edu.cn/wujx/
recognition using a deep cascaded model. Multimedia Tools and Applications, http: teaching/15_CNN.pdf.
//dx.doi.org/10.1007/s11042-020-09048-5. Wu, J., Chen, J., Ishwar, P., & Konrad, J. (2016). Two-stream CNNs for gesture-based
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, verification and identification: learning user style. In Computer vision and pattern
real-time object detection. arXiv:1506.02640. recognition (CVPR). Las Vegas, Nevada.
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object Yan, S., Xia, Y., Smith, J., Lu, W., & Zhang, B. (2017). Multi-scale convolutional neural
detection with region proposal networks. In NIPS. Quebec, Canada. networks for hand detection. Applied Computational Intelligence and Soft Computing,
Ronchetti, F., Quiroga, F., Estrebou, C., & Lanzarini, L. (2016). Handshape recognition 2017.
for argentinian sign language using probsom. Journal of Computer Science & Yang, Y., Li, Y., Fermuller, C., & Aloimonos, Y. (2015). Robot learning manipulation
Technology, 16(1). action plans by ‘‘watching’’ unconstrained videos from the world wide web. In
Ronchetti, F., Quiroga, F., Estrebou, C., Lanzarini, L., & Rosete, A. (2016). LSA64: Proceedings of the twenty-ninth AAAI conference on artificial intelligence.
An argentinian sign language dataset. In Congreso Argentino de Ciencias de la Ye, Y., Tian, Y., Huenerfauth, M., & Liu, J. (2018). Recognizing American sign language
Computación (CACIC 2016). gestures from within continuous videos. In CVPR. Utah, United States.
Canuto-dos Santos, C., Leonid-Aching-Samatelo, J., & Frizera-Vassallo, R. (2020). Yuan, S., Ye, Q., Stenger, B., Jain, S., & Kim, T.-K. (2017). Big hand 2.2M benchmark:
Dynamic gesture recognition by using CNNs and star RGB: A temporal information Hand pose dataset and state of the art analysis. In CVPR. Honolulu, Hawaii, USA.
condensation. Neurocomputing, 400, 238–254. Zheng, L., Liang, B., & Jiang, A. (2017). Recent advances of deep learning for sign
Sapp, B., & Taskar, B. (2013). MODEC: Multi-modal decomposable models for human language recognition. In 2017 International Conference on Digital Image Computing:
pose estimation. In CVPR. Portland, Oregon. Techniques and Applications (DICTA), Sydney, NSW, Australia. IEEE.
Simon, T., Joo, H., Matthews, I., & Sheikh, Y. (2017). Hand keypoint detection in single Zhou, X., Wan, Q., Zhang, W., Xue, X., & Wei, Y. (2016). Model-based deep hand pose
images using multi-view bootstrapping. arXiv:1704.07809. estimation. In IJCAI.
Zimmerman, T., Lanier, J., Blanchard, C., Bryson, S., & Harvill, Y. (1987). A hand
gesture interface device. In 87th Proceedings of the SIGCHI/GI conference on human
factors in computing systems and graphics, toronto, Ontario, Canada (pp. 189–192).
Zimmermann, C., & Brox, T. (2017). Learning to estimate 3D hand pose from single
RGB images. In ICCV. Venice, Italy.
27