0% found this document useful (0 votes)
17 views6 pages

Video-Based Sign Language Recognition With R21D and LSTM Networks

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views6 pages

Video-Based Sign Language Recognition With R21D and LSTM Networks

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Video-based Sign Language Recognition with

R(2+1)D and LSTM Networks


Jiayu Huang Jeerayut Chaijaruwanich Varin Chouvatut*
Department of Computer Department of Computer Department of Computer
Science Science Science
Faculty of Science Faculty of Science, Faculty of Science
Chiang Mai University Chiang Mai University, Chiang Mai University
Chiang Mai, Thailand Chinag Mai, Thailand Chiang Mai, Thailand
[email protected] [email protected] [email protected]*
2024 16th International Conference on Knowledge and Smart Technology (KST) | 979-8-3503-7073-7/24/$31.00 ©2024 IEEE | DOI: 10.1109/KST61284.2024.10499646

Abstract—Sign language, serving as a crucial means of convenient auxiliary tool for sign language users,
communication for individuals with hearing facilitating smoother communication.
impairments, plays a pivotal role in achieving barrier-
free communication with the broader population. With the ascent of deep learning, data-driven
Despite advancements in utilizing deep learning for approaches to isolated sign language word recognition
automatic recognition of dynamic sign language, have demonstrated commendable performance on
challenges persist, especially in the context of video-based widely-used datasets [3]. Due to constraints in
sign language recognition. This requires careful computing resources and data volume, prevailing
consideration of spatial and temporal features, along continuous sign language recognition methods
with the inherent long-term characteristics of the typically leverage two-dimensional convolutional
sequence. In this paper, we introduce a novel sign neural networks (CNNs) to extract visual features
language recognition model framework that integrates frame by frame. Subsequently, CNNs are employed to
the R(2+1)D and LSTM networks. Initially, the R(2+1)D extract temporal features for recognition [4]. This
network is employed to extract spatial and temporal sequential extraction of visual and temporal features
features from sign language sequences. Subsequently,
these features are input into the LSTM network to
capture long-term features and eliminate redundant
information. The recognition of the video sequence is
accomplished through a fully connected layer. We
evaluate the proposed network on two distinct Sign
Language datasets, which achieved 96.21% on CSL and
99.69% on LSA64. Our method demonstrates significant (a) “New Year’s Day” (b) “Valentine’s Day”
improvements over existing methods. Figure 1. The diagram shows the manifestation of a single
dynamic sign language and the rules it forms. Numbers represent the
Keywords—Sign Language Recognition, R(2+1)D display order, sign language words are composed of posture, and
convolution, LSTM, Deep Learning dashed lines with arrows represent the rules of gesture movement;
Left is gesture sign language words, Right is sign language word
based on composite posture combination; The meaning “New Year’s
I. INTRODUCTION Day”, “Valentine’s Day” [19].
Language serves as the cornerstone of social
interactions, yet hearing-impaired individuals face
global challenges in seamlessly integrating into social
contexts due to communication barriers [1]. Sign
language, a natural mode of communication
extensively employed by people with hearing and
Figure 2. The image shows the keyframes of sign language poses
language impairments, relies on visual elements like for each video; sign language meaning “Partial”
gestures, body movements, and facial expressions.
Despite its significance, there is no universally not only mitigates the demand for computing resources
recognized sign language, and even within the same but also diminishes overfitting issues, resulting in
country, diverse regions may employ distinct sign improved generalization performance. Fig. 2 shows the
language dialects, creating substantial hurdles for users keyframes contained in sign language words, and we
in their daily interactions [2]. Sign language, being use deep learning networks to extract features from
relatively abstract, requires systematic training for these keyframes. Koller et al. [5] achieved notable
comprehension. Fig. 1 shows the manifestation of sign recognition rates on the PHOENIX-2014 dataset by
language words. To address this, visual-based sign integrating CNNs with Hidden Markov Models
language recognition technology has emerged. This (HMMs) for continuous sign language sentences.
technology can automatically identify expressed sign Addressing the temporal dynamics of sign language
language words from video streams, employing non- videos, Tran et al. extended traditional 2D convolution
invasive technical methods. By doing so, it offers a to 3D convolution, capturing temporal features
between video frames [6]. Pigou et al. [7] utilized a

979-8-3503-7073-7/24/$31.00 ©2024 IEEE

Authorized licensed use limited to: Zhejiang University. Downloaded on July 26,2024 at 10:37:03 UTC from IEEE Xplore. Restrictions apply.
214
CNN structure to capture hand features of the human The organization of this paper as follows. Section
body, successfully constructing an Italian sign 2 presents a description of the related work. Section 3
language recognition system with an accuracy of describes our method, and Section 4 shows the results.
91.7% on Italian sign language datasets. Cui et al. [8] Section 5 draws some conclusions.
adopted Connectionist Temporal Classification (CTC)
to label time segments. They combined CNNs and II. RELATED WORK
Recurrent Neural Networks (RNNs) to enhance the
recognition rate of sign language videos, extracting A. 3D Residul Convolution Neural Network
advanced features from the temporal information Deep convolutional neural networks have many
embedded in video sequences. parameters, and using large-scale datasets is very
important. The depth of 3D convolutional networks is
While the aforementioned methods prove effective shallow, and when the data size of video datasets is
in sign language recognition, the current feature relatively small, it is easy to overfit. Hara et al. [10]
extraction approach still grapples with several proposed 3D ResNet (R3D) to train deeper networks
limitations. One notable issue lies in the substantial and larger video datasets; The experiment shows that
redundancy within video data, where frame-by-frame the accuracy gradually improves with the increase of
visual feature extraction introduces a plethora of network depth. Liao et al. [11] presents a multimodal
redundant calculations. Moreover, the reliance on dynamic sign language recognition method based on a
highly abstract visual features for temporal feature deep 3-dimensional residual network and bi-directional
extraction raises concerns about the insufficient LSTM networks, which is named as BLSTM-3D
modeling of temporal feature motion. Given that sign residual network (B3D ResNet).
language is inherently a visual language rich in fine-
grained hand movements, there exists a pertinent B. R(2+1)D Convolution Network
research challenge in efficiently capturing dynamic While R3D has demonstrated favorable outcomes,
information through actions and facial expressions. optimizing and training it pose challenges due to the
How to efficiently model dynamic information in sign parameter and computational complexities associated
language through actions and facial expressions is a with 3D convolution. The extension of two-
worthwhile research topic. dimensional spatial convolution to three-dimensional
To analyze hierarchical information within sign spatiotemporal convolution characterizes 3D
language and enhance the efficiency of sign language convolution. Its operation encompasses pixels
recognition algorithms, this paper introduces an corresponding to positions in a single frame of the
innovative approach that integrates R(2+1)D [9] and video, as well as its preceding and following frames,
Long Short-Term Memory (LSTM) networks. Our effectively preserving both temporal and spatial
proposed method aims to extract high-quality sign information. In response to these challenges, Tran et al.
language features through a multi-step process. [9] introduced the R(2+1)D convolution network,
Initially, the R(2+1)D network is employed to extract which decomposes a 3D convolutional kernel into two,
spatial and temporal features from sign language video allowing for the addition of an extra nonlinear layer
sequences. Subsequently, the LSTM network is (activation function) to enhance the model's expressive
leveraged to capture long-term sequence features, capacity. Interestingly, experiments following the
allowing for the learning of key information embedded decomposition revealed that, even when the parameters
in temporal sequences. Ultimately, these processed of the decomposed network and the 3D network are
sign language sequences are input for classification, identical, the decomposed network is more amenable to
resulting in effective sign language recognition. optimization. In a related development, Wang et al.
[12] proposed a (2+1)D_SLR network based on
The notable contributions of this paper can be R(2+1)D convolution. Distinguishing itself from other
succinctly summarized as follows: methods, this network achieves higher accuracy with
• A new end-to-end fusion R(2+1)D and LSTM faster processing speeds, showcasing its efficiency in
sign language recognition.
network is proposed for video based sign
language sequence recognition, which can C. LSTM Network
extract key information of spatiotemporal The Long Short-Term Memory (LSTM) network,
features, eliminate redundant information, and as introduced by Hochreiter and Schmidhuber [13],
distinguish small differences between sign represents a specialized Recurrent Neural Network
languages to achieve better classification (RNN) model distinguished by its unique structural
performance. design, adept at mitigating long-term dependency
• Compared with current methods, the proposed issues. Unlike conventional RNNs, LSTM inherently
fusion network is more generalized and can be possesses the ability to retain early information,
trained not only on larger sign language incurring no additional computational cost for this
datasets, but also on other sign language default behavior. LSTM achieves this by employing
databases. four distinct neural network layers that interact in a
specialized manner. These layers are strategically
• Through network training, our method has designed to facilitate the learning of feature
validated the effectiveness of the proposed information in sequences. The incorporation of
network and achieved efficient video-based forgetting gates, memory gates, and output gates allows
sign language recognition. LSTM to selectively control the retention and

Authorized licensed use limited to: Zhejiang University. Downloaded on July 26,2024 at 10:37:03 UTC from IEEE Xplore. Restrictions apply.
215
transmission of sequence information. The cell state, are connected through ReLU activation functions to
serving as a repository of information, is utilized to form a convolutional network.
store and transmit relevant data to subsequent LSTM
units. This intricate process culminates in the reflection
of the learned information into the cell state and output,
enabling effective handling of sequential data with
long-term dependencies.
III. PREPARE YOUR PAPER BEFORE STYLING
A. Overview

Figure 4. R(2+1)D convolution network, the structure of


R(2+1)D convolution block: resolving the 3D convolution of size into
2D spatial convolution of size and 1D temporal convolution of size,
respectively, used to extract the spatial and temporal features in the
video.
A (2+1)D neural network can extract spatial and
temporal features of sign language videos from frame
cubes using a (2+1)D convolutional kernel, and form a
feature space sequence. However, (2+1)D neural
Figure 3. The framework flow of our proposed method, dividing
the video sequence into 16 frames of video clips, then using the R
networks can better extract spatiotemporal information,
(2+1) D network to obtain the feature space, inputs the feature space but as the network deepens, phenomena such as
into LSTM for learning, and finally obtains recognition results gradient explosion and vanishing may occur; Inspired
through a classifier. by Resnet [12], we utilize additive residual functions to
At present, sign language recognition faces learn deeper spatiotemporal features through quick
challenges in effectively extracting temporal features connections, thereby mitigating the aforementioned
from video sequences, especially considering the risks. Representing the expected underlying mapping
inherently long-term nature of these actions. To solve as h ( x ) , we make the stacked nonlinear layers
this issue, we propose a dynamic sign language suitable for another mapping f ( x) = h( x) − x . The
recognition method based on a fusion network original mapping is converted back to (1)
comprising R(2+1)D and LSTM components within
video sequences, as illustrated in Fig. 3 This method h ( x ) = f ( x , w) + x (1)
consists of two main components. The first part is the
feature extraction module for video sequences, which where x is the output value neuron from the
conducts spatiotemporal feature extraction by taking previous layer, and w is the weight of the neuron;
segmented video frames as input. Through training on h ( x ) is the output value from the activation function
the video sequence, this module enhances the learning
of video spatial features. The second part is the long- within the neuron. Develop 3D residual units as (2+1)D
term recognition module, which analyzes the long-term architecture information for encoding spatiotemporal
temporal dynamics. It predicts sign language video videos, with basic residual units based on the (2+1)D
labels by leveraging the feature sequence obtained in convolution principle described above, as shown in Fig.
the first part. Ultimately, by classifying the temporal 4. Apply (2+1)D convolution using kernel size (with a
sequence, the method produces output results, spatial dimension of 1× H × W and a temporal
achieving effective video-based sign language dimension of T ×1×1 ). For sign language recognition
recognition. in videos, this model can automatically extract
spatiotemporal features by applying residual
B. Method connections to (2+1)D neural networks to extract from
A sign language dataset with N labels training the input video sequence (called R(2+1)D). Along with
examples is denoted by S i = { X i , Yi } , where these spatial features, it is hoped to capture multiple
consecutive frames. In addition, the R(2+1)D model is
X i ∈ R C×T × H ×W , C is the channels of frame; T is the easy to train on larger video sequence datasets.
number of frames; H and W are the height and The method described above acquires short-term
m
width of the frame respectively, and Yi ∈ R is a spatiotemporal features from a specific length of a
label of M classes. Let X denote an input fragment of video sequence. The entire sign language video
includes multiple spatiotemporal features and
size C × T × H × W . The first stage is feature extracting long-term spatiotemporal features can
extraction of sign language video sequences. As for improve its recognition accuracy. Using LSTM units to
video sequence frame recognition, we use the R(2+1)D obtain future and past information sequences through
network to extract spatial and temporal features of the different units, as shown in the Fig. 5; In theory,
video; By constructing (2+1) convolutions, the spatial memory cells store contextual information from time
and temporal feature maps of each video segment are series, while forgetting cells can selectively retain
connected to capture the trajectory information of sign important information from time series and output
language. The design principle of (2+1)D decomposes updated cell states. The unique structure of LSTM can
3D convolution kernels into 2D spatial convolution obtain remote features of video sequences, so
kernels and 1D temporal convolution kernels, which predicting video-based sign language sequences can

Authorized licensed use limited to: Zhejiang University. Downloaded on July 26,2024 at 10:37:03 UTC from IEEE Xplore. Restrictions apply.
216
achieve the desired prediction results. The equations of each sign language data, the performers in the
for a LSTM cell are as follows: training set and the test set are different. Fig. 2 shows
the samples from the Chinese sign language dataset
(https://ustc-slr.github.io/datasets/).
2) Argentinian Sign Language Dataset: LSA64
[14] contains 64 daily sign language words, each
repeated 5 times by 10 sign language users. The entire
sign language dataset contains 3200 videos. In order to
better demonstrate the trajectory characteristics of sign
language, each performer wears pink gloves on their
Figure 5. The diagram shows the structural information of LSTM,
right hand and fluorescent gloves on their right hand.
where C represents the cell state and is used to store the current state
In order to verify the generalization ability of the
information and transmit it to the next LSTM. h represents the
algorithm for non-specific individuals and compare it
output information of the previous time, which is the output gate. ft
with other methods, this paper selected samples from
represents the forgetting gate, it represents the memory gate, and ct
8 sign language learners as the training set, and the
represents the information that needs to be updated. remaining 2 foreign language learners as the test set.
(2) Fig. 6 shows the samples of LSA64
ft = σ (Wxr * xt + Whr * H tk−1 + br )
(http://facundoq.github.io/datasets/lsa64/).
it = σ (Wxi * xt + Whi * Htk−1 + bi ) (3)

gt = tanh(Wxg * xt + Whg * H tk−1 + bg ) (4)

ot = σ (Wxo * xt + Who * H tk−1 + bo ) (5)

ct = f t ct −1 + it g t (6)
ht = ot tanh(ct ) (7) Figure 6. The keyframes of six different sign languages in the
LSA64 database. The movement trajectory of its gesture overlaps in
where σ is the sigmoid function, tanh is a position and shape, and the frame on the left comes from the first
hyperbolic tangent function. Forgetting Gate f t selects recorded state.

appropriate information for storage and discards B. Metrics


unimportant information. Input gate it to determine In the classification results, there are four results
including true positive (TP), true negative (TN), false
when to update information. The tanh layer gt
positive (FP), and false negative (FN), from which
generates a set of candidate values, which will be added calculates the accuracy, recall, and precision, etc. Our
to the storage unit if the input gate allows. According evaluation metrics are mainly analyzed from the
to (6), update the storage unit ct based on the output accuracy, recall rate, and precision. Accuracy is one of
the most common classification evaluation metrics to
of the forget gate f t , input gate it , and new candidate measure the classification accuracy of a classifier and
value gt . In formula (7), the output gate ot controls refers to the proportion of correctly classified samples
to the total samples.
the state and memory information of the hidden state.
Finally, the hidden state is represented as the product of TP + TN (8)
ACC =
the storage unit state and the function of the output gate. TP + FP + TN + FN
IV. EXPERIMENTS AND RESULTS C. Network Training
We validated the effectiveness of the proposed During model training, the input consists of video
framework on two sign language datasets; Our data, specifically a 16-frame video segment. Before
experimental environment is using Ubuntu 18.04, a inputting into the model, we randomly crop the frames
deep learning framework pytorch, and the hardware is to 128 × 128 and apply batch normalization. This step
Intel (R) Xeon (R) CPU E5-2620 [email protected] GHz CPU, aims to reduce deviations caused by displacements
16 GB RAM, NVIDIA GeForce GTX 1080 Ti GPU within the frames and expedite the neural network
× 4. The programming language used is Python 3. training process. Regarding model settings, the batch
size is set to 16. The learning rate is 0.1, the weight
A. Datasets decay parameter is 0.0005, and an adaptive learning
1) Chinese Sign Language Dataset: In the Chinese rate adjustment method is employed for automatic
Sign Language Recognition Dataset (CSL), we used adjustments. A label smoothing cross-loss function is
the RGB dataset, where each word was recorded 5 utilized, and the number of iterations is set to 50.In the
times by 50 sign language presenters. Each video LSTM model, the input neuron parameter is set to 512,
sample is annotated by a professional Chinese sign and the hidden neuron parameter is set to 256.
Additionally, to enhance accuracy further, the R(2+1)D
language teacher. In this article, we use 100 categories
model undergoes pretraining on the Kinetics-400
of sign language words as the data source. We divided dataset [15].
the dataset into training and testing sets, with 80% and
20%, respectively. To demonstrate the independence

Authorized licensed use limited to: Zhejiang University. Downloaded on July 26,2024 at 10:37:03 UTC from IEEE Xplore. Restrictions apply.
217
D. Results within a manageable range, indicating that the model
In these experiments, we assessed the effectiveness has achieved stability in training.
of the network on both the CSL and LSA64. To gauge In terms of verification accuracy, as illustrated in
the robustness of the results, we varied the batch sizes, Fig. 9, our method demonstrated promising results. The
obtaining the average test accuracy for each iteration. verification accuracy around 500 iterations,
Additionally, we fixed the epoch size at 50. As showcasing effective recognition outcomes, and in the
presented in Tab. 1, different batch sizes have little subsequent verification process, its recognition results
impact on the results. In the CSL dataset, the testing tended to stabilize.
accuracy peaked at 96.21% with 50 epochs and a batch
size of 16. Conversely, in the LSA64 dataset, the testing
accuracy reached its maximum at 99.69% with 50
epochs and a batch size of 32. From the overall results,
as the iteration increases, the loss gradually converges,
and the validation results tend to stabilize.

TABLE I. THE ACCURACY OF VALIDATION ON CSL


AND LSA64.

Epochs Batch Size Acc_CSL Acc_LSA64


10 16 93.75% 95.62% Figure 9. Validation accuracy: As the number of iterations increases,
10 32 92.61% 94.53% the accuracy of validation gradually increases, and the final region
20 16 93.93% 98.59% stabilizes.
20 32 93.67% 99.38%
50 16 96.21% 99.53%
50 32 95.53% 99.69%

Fig. 7 depicts the accuracy of training on the


Chinese Sign Language dataset. With an increasing
number of iterations, the recognition accuracy
demonstrates a gradual upward trend. Notably, the
accuracy stabilizes after approximately 1000 iterations,
indicating convergence in the learning process. In the
subsequent iterations, the accuracy fluctuations remain
within a manageable range. indicating that the model Figure 10. Validation loss: As the number of iterations increases, the
has achieved stability in training. validation loss gradually decreases.

The verification process, as depicted in Figure 10,


demonstrates the loss stabilization around 500
iterations. The loss function is a mathematical
operation used to quantify the disparity between the
predicted values and the true values of the model. The
gradual stabilization of the loss indicates a convergence
of the model, suggesting that the predictions are
aligning more closely with the ground truth values as
the training progresses.
Figure 7. Training accuracy: As the number of iterations increases,
the accuracy of the trained network gradually increases, and the final
region stabilizes. TABLE II. COMPARISON WITH DIFFERENT METHODS ON CSL.
Method Modality CSL
3DCNN [1] RGB+depth+skeleton 88.70%
Resnet_18 [10] RGB 90.04%
B3D ResNet [11] RGB 86.90%
R(2+1)D_LSTM(Ours) RGB 96.21%

We utilized recognition accuracy as the


performance metric for the network model. Compared
Figure 8. Training loss: As the number of iterations increases, the
trained network loss gradually decreases.
with some traditional sign language sequence
recognition models, including 3DCNN, Resnet_18, and
Fig. 8 illustrates the training loss on the CSL. B3D Resnet, on the Chinese sign language dataset, the
Employing a label smoothing cross-entropy loss results (as shown in Table 2) indicated recognition
function and introducing a regularization strategy to accuracies of 88.7%, 90.04%, and 86.9%, respectively.
mitigate the impact of real sample labels on the loss In contrast, our proposed method achieved a
calculation (thus addressing overfitting concerns), the remarkable accuracy rate of 96.21%. our test results are
loss decreases progressively with an increasing number significantly higher than other sign language
of iterations. The graph indicates that the loss stabilizes recognition models. The experiment proved that the
regionally around 1000 iterations. In the subsequent proposed model can effectively perform the task of sign
iterations, the fluctuation amplitude of the loss remains language recognition.

Authorized licensed use limited to: Zhejiang University. Downloaded on July 26,2024 at 10:37:03 UTC from IEEE Xplore. Restrictions apply.
218
To demonstrate the generalization of the proposed [3] Oscar Koller, Sepehr Zargaran, and Hermann Ney. Re-sign:
model, we validated it on the LSA64 sign language Re-aligned end-to-end sequence modelling with deep recurrent
cnn-hmms. In 2017 IEEE Conference on Computer Vision and
dataset and compared it with some traditional LSA64 Pattern Recognition (CVPR), pages 3416–3424, 2017.
recognition models such as 3DCNN_LSA64, [4] Ka Leong Cheng, Zhaoyang Yang, Qifeng Chen, and Yu-Wing
Transformer and CNN_LSTM made a comparison. Tai. Fully convolutional networks for continuous sign
The experimental results are shown in Tab. 3, The language recognition. In Computer Vision – ECCV 2020: 16th
recognition accuracy is 93.90%, 98.25%, and 95.20%, European Conference, Glasgow, UK, August 23–28, 2020,
respectively. Our method has an accuracy rate of Proceedings, Part XXIV, page 697–714, Berlin, Heidelberg,
2020. Springer-Verlag.
99.69%, and our test results are better than other
[5] Oscar Koller, Sepehr Zargaran, Hermann Ney, and Richard
recognition models. This proves the generalization and Bowden. Deep sign: Enabling robust statistical continuous
effectiveness of our model. sign language recognition via hybrid cnn-hmms. Int. J.
Comput. Vision, 126(12):1311–1325, dec 2018.
TABLE III. COMPARISON WITH DIFFERENT METHODS ON [6] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani,
LSA64. and Manohar Paluri. Learning spatiotemporal features with 3d
Method Modality LSA64 convolutional networks. In 2015 IEEE International
Conference on Computer Vision (ICCV), pages 4489–4497,
3DCNN_LSA64[16] RGB 93.90% 2015.
Transformer [17] RGB 98.25% [7] Lionel Pigou, Sander Dieleman, Pieter-Jan Kindermans, and
CNN_LSTM [18] RGB 95.20% Benjamin Schrauwen. Sign language recognition using
R(2+1) D_LSTM(Ours) RGB 99.69% convolutional neural networks. In Lourdes Agapito, Michael
M. Bronstein, and Carsten Rother, editors, Computer Vision -
ECCV 2014 Workshops, pages 572– 578, Cham, 2015.
V. CONCLUSION AND FUTURE WORK Springer International Publishing.
In this paper, we leverage the distinctive [8] Runpeng Cui, Hu Liu, and Changshui Zhang. Recurrent
characteristics of video sequences to extract key convolutional neural networks for continuous sign language
recognition by staged optimization. In 2017 IEEE Conference
features from sign language videos using both R(2+1)D on Computer Vision and Pattern Recognition (CVPR), pages
networks and Long Short-Term Memory (LSTM). The 1610–1618, 2017.
R(2+1)D network excels in learning spatial and [9] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann
spatiotemporal features, enhancing its non-linear LeCun, and Manohar Paluri. A closer look at spatiotemporal
capability through the incorporation of linear convolutions for action recognition. In 2018 IEEE/CVF
correction units. Throughout the training process, the Conference on Computer Vision and Pattern Recognition,
pages 6450–6459, 2018.
spatial and temporal decomposition aids in
optimization, leading to reductions in both training and [10] Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. Can
spatiotemporal 3d cnns retrace the history of 2d cnns and
testing losses. We utilize LSTM after obtaining the imagenet? In 2018 IEEE/CVF Conference on Computer
feature space from the R(2+1)D network. This step Vision and Pattern Recognition, pages 6546– 6555, 2018.
aims to extract long-term features while eliminating [11] Yanqiu Liao, Pengwen Xiong, Weidong Min, Weiqiong Min,
redundant information. The processed features, after and Jiahao Lu. Dynamic sign language recognition based on
undergoing LSTM network analysis, are then classified video sequence with blstm-3d residual networks. IEEE
through fully connected layers to derive prediction Access, 7:38044–38054, 2019.
results. Empirical analysis validates the effectiveness [12] Fei Wang, Yu Du, Guorui Wang, Zhen Zeng, and Lihong
Zhao. (2+1)d- slr: an efficient network for video sign language
of our proposed framework. However, the proposed recognition. Neural Computing and Applications, 34:2413–
framework exhibits limitations, particularly in 2423, 2021.
capturing subtle expressions of sign language actions. [13] Sepp Hochreiter and Ju¨rgen Schmidhuber. Long short-term
In future research, we suggest exploring multi-feature memory.Neural Computation, 9(8):1735–1780, 1997
fusion frameworks to learn features from different [14] Franco Ronchetti, Facundo Quiroga, Cesar Estrebou, Laura
segments and subsequently fuse them. Additionally, Lanzarini, and Alejandro Rosete. Lsa64: A dataset of
incorporating attention mechanisms into the model argentinian sign language. In In Proceedings of the Congreso
Argentino de Ciencias de la Computacion (CACIC), 2016.
could further enhance the feature space, leading to
improved recognition accuracy. [15] Joa˜o Carreira and Andrew Zisserman. Quo vadis, action
recognition? a new model and the kinetics dataset. In 2017
IEEE Conference on Computer Vision and Pattern
REFERENCES Recognition (CVPR), pages 4724–4733, 2017.
[1] Jie Huang, Wengang Zhou, Houqiang Li, and Weiping Li. [16] Geovane Neto, Geraldo Junior, Joa˜o Almeida, and Anselmo
Attention- based 3d-cnns for large-vocabulary sign language Paiva. Sign Language Recognition Based on 3D Convolutional
recognition. IEEE Transactions on Circuits and Systems for Neural Networks, pages 399–407. 06 2018.
Video Technology, 29(9):2822– 2832, 2019.
[17] Sarah Alyami, Hamzah Luqman, and Mohammad
[2] Jie Huang, Wengang Zhou, Qilin Zhang, Houqiang Li, and Hammoudeh. Isolated arabic sign language recognition using
Weiping Li. Video-based sign language recognition without a transformer-based model and landmark keypoints. ACM
temporal segmentation. In Proceedings of the Thirty-Second Trans. Asian Low-Resour. Lang. Inf. Process., feb 2023. Just
AAAI Conference on Artificial Intelligence and Thirtieth Accepted.
Innovative Applications of Artificial Intelligence Conference
[18] Sarfaraz Masood, Adhyan Srivastava, Harish Chandra Thuwal,
and Eighth AAAI Symposium on Educational Advances in
and Musheer Ahmad. Real-time sign language gesture (word)
Artificial Intelligence, AAAI’18/IAAI’18/EAAI’18. AAAI recognition from video sequences using cnn and rnn. 2018.
Press, 2018.
[19] https://blog.csdn.net/cungudafa/article/details/10

Authorized licensed use limited to: Zhejiang University. Downloaded on July 26,2024 at 10:37:03 UTC from IEEE Xplore. Restrictions apply.
219

You might also like