Video-Based Sign Language Recognition With R21D and LSTM Networks
Video-Based Sign Language Recognition With R21D and LSTM Networks
Abstract—Sign language, serving as a crucial means of convenient auxiliary tool for sign language users,
communication for individuals with hearing facilitating smoother communication.
impairments, plays a pivotal role in achieving barrier-
free communication with the broader population. With the ascent of deep learning, data-driven
Despite advancements in utilizing deep learning for approaches to isolated sign language word recognition
automatic recognition of dynamic sign language, have demonstrated commendable performance on
challenges persist, especially in the context of video-based widely-used datasets [3]. Due to constraints in
sign language recognition. This requires careful computing resources and data volume, prevailing
consideration of spatial and temporal features, along continuous sign language recognition methods
with the inherent long-term characteristics of the typically leverage two-dimensional convolutional
sequence. In this paper, we introduce a novel sign neural networks (CNNs) to extract visual features
language recognition model framework that integrates frame by frame. Subsequently, CNNs are employed to
the R(2+1)D and LSTM networks. Initially, the R(2+1)D extract temporal features for recognition [4]. This
network is employed to extract spatial and temporal sequential extraction of visual and temporal features
features from sign language sequences. Subsequently,
these features are input into the LSTM network to
capture long-term features and eliminate redundant
information. The recognition of the video sequence is
accomplished through a fully connected layer. We
evaluate the proposed network on two distinct Sign
Language datasets, which achieved 96.21% on CSL and
99.69% on LSA64. Our method demonstrates significant (a) “New Year’s Day” (b) “Valentine’s Day”
improvements over existing methods. Figure 1. The diagram shows the manifestation of a single
dynamic sign language and the rules it forms. Numbers represent the
Keywords—Sign Language Recognition, R(2+1)D display order, sign language words are composed of posture, and
convolution, LSTM, Deep Learning dashed lines with arrows represent the rules of gesture movement;
Left is gesture sign language words, Right is sign language word
based on composite posture combination; The meaning “New Year’s
I. INTRODUCTION Day”, “Valentine’s Day” [19].
Language serves as the cornerstone of social
interactions, yet hearing-impaired individuals face
global challenges in seamlessly integrating into social
contexts due to communication barriers [1]. Sign
language, a natural mode of communication
extensively employed by people with hearing and
Figure 2. The image shows the keyframes of sign language poses
language impairments, relies on visual elements like for each video; sign language meaning “Partial”
gestures, body movements, and facial expressions.
Despite its significance, there is no universally not only mitigates the demand for computing resources
recognized sign language, and even within the same but also diminishes overfitting issues, resulting in
country, diverse regions may employ distinct sign improved generalization performance. Fig. 2 shows the
language dialects, creating substantial hurdles for users keyframes contained in sign language words, and we
in their daily interactions [2]. Sign language, being use deep learning networks to extract features from
relatively abstract, requires systematic training for these keyframes. Koller et al. [5] achieved notable
comprehension. Fig. 1 shows the manifestation of sign recognition rates on the PHOENIX-2014 dataset by
language words. To address this, visual-based sign integrating CNNs with Hidden Markov Models
language recognition technology has emerged. This (HMMs) for continuous sign language sentences.
technology can automatically identify expressed sign Addressing the temporal dynamics of sign language
language words from video streams, employing non- videos, Tran et al. extended traditional 2D convolution
invasive technical methods. By doing so, it offers a to 3D convolution, capturing temporal features
between video frames [6]. Pigou et al. [7] utilized a
Authorized licensed use limited to: Zhejiang University. Downloaded on July 26,2024 at 10:37:03 UTC from IEEE Xplore. Restrictions apply.
214
CNN structure to capture hand features of the human The organization of this paper as follows. Section
body, successfully constructing an Italian sign 2 presents a description of the related work. Section 3
language recognition system with an accuracy of describes our method, and Section 4 shows the results.
91.7% on Italian sign language datasets. Cui et al. [8] Section 5 draws some conclusions.
adopted Connectionist Temporal Classification (CTC)
to label time segments. They combined CNNs and II. RELATED WORK
Recurrent Neural Networks (RNNs) to enhance the
recognition rate of sign language videos, extracting A. 3D Residul Convolution Neural Network
advanced features from the temporal information Deep convolutional neural networks have many
embedded in video sequences. parameters, and using large-scale datasets is very
important. The depth of 3D convolutional networks is
While the aforementioned methods prove effective shallow, and when the data size of video datasets is
in sign language recognition, the current feature relatively small, it is easy to overfit. Hara et al. [10]
extraction approach still grapples with several proposed 3D ResNet (R3D) to train deeper networks
limitations. One notable issue lies in the substantial and larger video datasets; The experiment shows that
redundancy within video data, where frame-by-frame the accuracy gradually improves with the increase of
visual feature extraction introduces a plethora of network depth. Liao et al. [11] presents a multimodal
redundant calculations. Moreover, the reliance on dynamic sign language recognition method based on a
highly abstract visual features for temporal feature deep 3-dimensional residual network and bi-directional
extraction raises concerns about the insufficient LSTM networks, which is named as BLSTM-3D
modeling of temporal feature motion. Given that sign residual network (B3D ResNet).
language is inherently a visual language rich in fine-
grained hand movements, there exists a pertinent B. R(2+1)D Convolution Network
research challenge in efficiently capturing dynamic While R3D has demonstrated favorable outcomes,
information through actions and facial expressions. optimizing and training it pose challenges due to the
How to efficiently model dynamic information in sign parameter and computational complexities associated
language through actions and facial expressions is a with 3D convolution. The extension of two-
worthwhile research topic. dimensional spatial convolution to three-dimensional
To analyze hierarchical information within sign spatiotemporal convolution characterizes 3D
language and enhance the efficiency of sign language convolution. Its operation encompasses pixels
recognition algorithms, this paper introduces an corresponding to positions in a single frame of the
innovative approach that integrates R(2+1)D [9] and video, as well as its preceding and following frames,
Long Short-Term Memory (LSTM) networks. Our effectively preserving both temporal and spatial
proposed method aims to extract high-quality sign information. In response to these challenges, Tran et al.
language features through a multi-step process. [9] introduced the R(2+1)D convolution network,
Initially, the R(2+1)D network is employed to extract which decomposes a 3D convolutional kernel into two,
spatial and temporal features from sign language video allowing for the addition of an extra nonlinear layer
sequences. Subsequently, the LSTM network is (activation function) to enhance the model's expressive
leveraged to capture long-term sequence features, capacity. Interestingly, experiments following the
allowing for the learning of key information embedded decomposition revealed that, even when the parameters
in temporal sequences. Ultimately, these processed of the decomposed network and the 3D network are
sign language sequences are input for classification, identical, the decomposed network is more amenable to
resulting in effective sign language recognition. optimization. In a related development, Wang et al.
[12] proposed a (2+1)D_SLR network based on
The notable contributions of this paper can be R(2+1)D convolution. Distinguishing itself from other
succinctly summarized as follows: methods, this network achieves higher accuracy with
• A new end-to-end fusion R(2+1)D and LSTM faster processing speeds, showcasing its efficiency in
sign language recognition.
network is proposed for video based sign
language sequence recognition, which can C. LSTM Network
extract key information of spatiotemporal The Long Short-Term Memory (LSTM) network,
features, eliminate redundant information, and as introduced by Hochreiter and Schmidhuber [13],
distinguish small differences between sign represents a specialized Recurrent Neural Network
languages to achieve better classification (RNN) model distinguished by its unique structural
performance. design, adept at mitigating long-term dependency
• Compared with current methods, the proposed issues. Unlike conventional RNNs, LSTM inherently
fusion network is more generalized and can be possesses the ability to retain early information,
trained not only on larger sign language incurring no additional computational cost for this
datasets, but also on other sign language default behavior. LSTM achieves this by employing
databases. four distinct neural network layers that interact in a
specialized manner. These layers are strategically
• Through network training, our method has designed to facilitate the learning of feature
validated the effectiveness of the proposed information in sequences. The incorporation of
network and achieved efficient video-based forgetting gates, memory gates, and output gates allows
sign language recognition. LSTM to selectively control the retention and
Authorized licensed use limited to: Zhejiang University. Downloaded on July 26,2024 at 10:37:03 UTC from IEEE Xplore. Restrictions apply.
215
transmission of sequence information. The cell state, are connected through ReLU activation functions to
serving as a repository of information, is utilized to form a convolutional network.
store and transmit relevant data to subsequent LSTM
units. This intricate process culminates in the reflection
of the learned information into the cell state and output,
enabling effective handling of sequential data with
long-term dependencies.
III. PREPARE YOUR PAPER BEFORE STYLING
A. Overview
Authorized licensed use limited to: Zhejiang University. Downloaded on July 26,2024 at 10:37:03 UTC from IEEE Xplore. Restrictions apply.
216
achieve the desired prediction results. The equations of each sign language data, the performers in the
for a LSTM cell are as follows: training set and the test set are different. Fig. 2 shows
the samples from the Chinese sign language dataset
(https://ustc-slr.github.io/datasets/).
2) Argentinian Sign Language Dataset: LSA64
[14] contains 64 daily sign language words, each
repeated 5 times by 10 sign language users. The entire
sign language dataset contains 3200 videos. In order to
better demonstrate the trajectory characteristics of sign
language, each performer wears pink gloves on their
Figure 5. The diagram shows the structural information of LSTM,
right hand and fluorescent gloves on their right hand.
where C represents the cell state and is used to store the current state
In order to verify the generalization ability of the
information and transmit it to the next LSTM. h represents the
algorithm for non-specific individuals and compare it
output information of the previous time, which is the output gate. ft
with other methods, this paper selected samples from
represents the forgetting gate, it represents the memory gate, and ct
8 sign language learners as the training set, and the
represents the information that needs to be updated. remaining 2 foreign language learners as the test set.
(2) Fig. 6 shows the samples of LSA64
ft = σ (Wxr * xt + Whr * H tk−1 + br )
(http://facundoq.github.io/datasets/lsa64/).
it = σ (Wxi * xt + Whi * Htk−1 + bi ) (3)
ct = f t ct −1 + it g t (6)
ht = ot tanh(ct ) (7) Figure 6. The keyframes of six different sign languages in the
LSA64 database. The movement trajectory of its gesture overlaps in
where σ is the sigmoid function, tanh is a position and shape, and the frame on the left comes from the first
hyperbolic tangent function. Forgetting Gate f t selects recorded state.
Authorized licensed use limited to: Zhejiang University. Downloaded on July 26,2024 at 10:37:03 UTC from IEEE Xplore. Restrictions apply.
217
D. Results within a manageable range, indicating that the model
In these experiments, we assessed the effectiveness has achieved stability in training.
of the network on both the CSL and LSA64. To gauge In terms of verification accuracy, as illustrated in
the robustness of the results, we varied the batch sizes, Fig. 9, our method demonstrated promising results. The
obtaining the average test accuracy for each iteration. verification accuracy around 500 iterations,
Additionally, we fixed the epoch size at 50. As showcasing effective recognition outcomes, and in the
presented in Tab. 1, different batch sizes have little subsequent verification process, its recognition results
impact on the results. In the CSL dataset, the testing tended to stabilize.
accuracy peaked at 96.21% with 50 epochs and a batch
size of 16. Conversely, in the LSA64 dataset, the testing
accuracy reached its maximum at 99.69% with 50
epochs and a batch size of 32. From the overall results,
as the iteration increases, the loss gradually converges,
and the validation results tend to stabilize.
Authorized licensed use limited to: Zhejiang University. Downloaded on July 26,2024 at 10:37:03 UTC from IEEE Xplore. Restrictions apply.
218
To demonstrate the generalization of the proposed [3] Oscar Koller, Sepehr Zargaran, and Hermann Ney. Re-sign:
model, we validated it on the LSA64 sign language Re-aligned end-to-end sequence modelling with deep recurrent
cnn-hmms. In 2017 IEEE Conference on Computer Vision and
dataset and compared it with some traditional LSA64 Pattern Recognition (CVPR), pages 3416–3424, 2017.
recognition models such as 3DCNN_LSA64, [4] Ka Leong Cheng, Zhaoyang Yang, Qifeng Chen, and Yu-Wing
Transformer and CNN_LSTM made a comparison. Tai. Fully convolutional networks for continuous sign
The experimental results are shown in Tab. 3, The language recognition. In Computer Vision – ECCV 2020: 16th
recognition accuracy is 93.90%, 98.25%, and 95.20%, European Conference, Glasgow, UK, August 23–28, 2020,
respectively. Our method has an accuracy rate of Proceedings, Part XXIV, page 697–714, Berlin, Heidelberg,
2020. Springer-Verlag.
99.69%, and our test results are better than other
[5] Oscar Koller, Sepehr Zargaran, Hermann Ney, and Richard
recognition models. This proves the generalization and Bowden. Deep sign: Enabling robust statistical continuous
effectiveness of our model. sign language recognition via hybrid cnn-hmms. Int. J.
Comput. Vision, 126(12):1311–1325, dec 2018.
TABLE III. COMPARISON WITH DIFFERENT METHODS ON [6] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani,
LSA64. and Manohar Paluri. Learning spatiotemporal features with 3d
Method Modality LSA64 convolutional networks. In 2015 IEEE International
Conference on Computer Vision (ICCV), pages 4489–4497,
3DCNN_LSA64[16] RGB 93.90% 2015.
Transformer [17] RGB 98.25% [7] Lionel Pigou, Sander Dieleman, Pieter-Jan Kindermans, and
CNN_LSTM [18] RGB 95.20% Benjamin Schrauwen. Sign language recognition using
R(2+1) D_LSTM(Ours) RGB 99.69% convolutional neural networks. In Lourdes Agapito, Michael
M. Bronstein, and Carsten Rother, editors, Computer Vision -
ECCV 2014 Workshops, pages 572– 578, Cham, 2015.
V. CONCLUSION AND FUTURE WORK Springer International Publishing.
In this paper, we leverage the distinctive [8] Runpeng Cui, Hu Liu, and Changshui Zhang. Recurrent
characteristics of video sequences to extract key convolutional neural networks for continuous sign language
recognition by staged optimization. In 2017 IEEE Conference
features from sign language videos using both R(2+1)D on Computer Vision and Pattern Recognition (CVPR), pages
networks and Long Short-Term Memory (LSTM). The 1610–1618, 2017.
R(2+1)D network excels in learning spatial and [9] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann
spatiotemporal features, enhancing its non-linear LeCun, and Manohar Paluri. A closer look at spatiotemporal
capability through the incorporation of linear convolutions for action recognition. In 2018 IEEE/CVF
correction units. Throughout the training process, the Conference on Computer Vision and Pattern Recognition,
pages 6450–6459, 2018.
spatial and temporal decomposition aids in
optimization, leading to reductions in both training and [10] Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. Can
spatiotemporal 3d cnns retrace the history of 2d cnns and
testing losses. We utilize LSTM after obtaining the imagenet? In 2018 IEEE/CVF Conference on Computer
feature space from the R(2+1)D network. This step Vision and Pattern Recognition, pages 6546– 6555, 2018.
aims to extract long-term features while eliminating [11] Yanqiu Liao, Pengwen Xiong, Weidong Min, Weiqiong Min,
redundant information. The processed features, after and Jiahao Lu. Dynamic sign language recognition based on
undergoing LSTM network analysis, are then classified video sequence with blstm-3d residual networks. IEEE
through fully connected layers to derive prediction Access, 7:38044–38054, 2019.
results. Empirical analysis validates the effectiveness [12] Fei Wang, Yu Du, Guorui Wang, Zhen Zeng, and Lihong
Zhao. (2+1)d- slr: an efficient network for video sign language
of our proposed framework. However, the proposed recognition. Neural Computing and Applications, 34:2413–
framework exhibits limitations, particularly in 2423, 2021.
capturing subtle expressions of sign language actions. [13] Sepp Hochreiter and Ju¨rgen Schmidhuber. Long short-term
In future research, we suggest exploring multi-feature memory.Neural Computation, 9(8):1735–1780, 1997
fusion frameworks to learn features from different [14] Franco Ronchetti, Facundo Quiroga, Cesar Estrebou, Laura
segments and subsequently fuse them. Additionally, Lanzarini, and Alejandro Rosete. Lsa64: A dataset of
incorporating attention mechanisms into the model argentinian sign language. In In Proceedings of the Congreso
Argentino de Ciencias de la Computacion (CACIC), 2016.
could further enhance the feature space, leading to
improved recognition accuracy. [15] Joa˜o Carreira and Andrew Zisserman. Quo vadis, action
recognition? a new model and the kinetics dataset. In 2017
IEEE Conference on Computer Vision and Pattern
REFERENCES Recognition (CVPR), pages 4724–4733, 2017.
[1] Jie Huang, Wengang Zhou, Houqiang Li, and Weiping Li. [16] Geovane Neto, Geraldo Junior, Joa˜o Almeida, and Anselmo
Attention- based 3d-cnns for large-vocabulary sign language Paiva. Sign Language Recognition Based on 3D Convolutional
recognition. IEEE Transactions on Circuits and Systems for Neural Networks, pages 399–407. 06 2018.
Video Technology, 29(9):2822– 2832, 2019.
[17] Sarah Alyami, Hamzah Luqman, and Mohammad
[2] Jie Huang, Wengang Zhou, Qilin Zhang, Houqiang Li, and Hammoudeh. Isolated arabic sign language recognition using
Weiping Li. Video-based sign language recognition without a transformer-based model and landmark keypoints. ACM
temporal segmentation. In Proceedings of the Thirty-Second Trans. Asian Low-Resour. Lang. Inf. Process., feb 2023. Just
AAAI Conference on Artificial Intelligence and Thirtieth Accepted.
Innovative Applications of Artificial Intelligence Conference
[18] Sarfaraz Masood, Adhyan Srivastava, Harish Chandra Thuwal,
and Eighth AAAI Symposium on Educational Advances in
and Musheer Ahmad. Real-time sign language gesture (word)
Artificial Intelligence, AAAI’18/IAAI’18/EAAI’18. AAAI recognition from video sequences using cnn and rnn. 2018.
Press, 2018.
[19] https://blog.csdn.net/cungudafa/article/details/10
Authorized licensed use limited to: Zhejiang University. Downloaded on July 26,2024 at 10:37:03 UTC from IEEE Xplore. Restrictions apply.
219