A Survey of Sign Language Recognition
A Survey of Sign Language Recognition
net/publication/375240372
Article in INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT · October 2023
DOI: 10.55041/IJSREM26316
CITATIONS READS
0 521
4 authors, including:
Vaishnavi Karanjkar
Smt. Kashibai Navale College Of Engineering
1 PUBLICATION 0 CITATIONS
SEE PROFILE
All content following this page was uploaded by Vaishnavi Karanjkar on 18 January 2024.
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - Sign Language is mainly used by deaf (hard Various techniques and methods for sign language recognition
hearing) and dumb people to exchange information was developed by different researchers.
between their own community and with other people. It is One example of this is the use of Recurrent Neural
a language where people use their hand gestures to Networks (RNN) which are commonly used for sign language
communicate as they can't speak or hear. The goal of sign recognition systems that rely on sequential data [1]. One of the
language recognition (SLR) is to identify acquired hand most common types of RNN that is used for sign language
motions and to continue until related hand gestures are recognition is the Long Short-term memory (LSTM) which is
translated into text and speech. Here, static and dynamic created to solve the vanishing gradient problem that can occur
hand gestures for sign language can be distinguished. The in traditional RNNs, where the gradient becomes too small to
human community values both types of recognition, even if be useful during backpropagation, resulting in poor training and
static hand gesture recognition is easier than dynamic hand performance [1]. The study of [1] in which used the LSTM
gesture recognition. By creating Deep Neural Network model for the system to recognize Indian sign language,
designs (Convolution Neural Network designs), where the presents a high accuracy result.
model will learn to detect the hand motions images It is a system which uses a camera to sense the
throughout an epoch, we are using Deep Learning information that has been obtained through finger motions. It is
Computer Vision to recognize the hand gestures. After the the most commonly used visual-based method. It has been a
model successfully recognizes the motion, an English text tremendous effort and has been gone into the development of
file is created that can subsequently be translated to speech. vision-based sign recognition systems through worldwide [8].
The user can choose from a variety of translations for this In recent years, there has been an increasing interest
paragraph. This application can be used without an in deep learning applied to various fields, and it has contributed
internet connection and is entirely offline. With this model's to technological improvement [10]. There have recently been
improved efficiency, communication will be easier for the numerous studies in the field of sign language recognition using
deaf (hard of hearing) and disabled people. We shall discuss deep learning to classify images or videos.
the use of deep learning for sign language recognition in this The reason we chose sign language recognition is
paper. because it has both the characteristics of motion recognition
and the characteristics of a time series language translation.
Deep learning models that classify images have low complexity
Key Words: sign language, convolutional neural network, compared to models that classify videos [10].
computer vision. A multilayer perceptron (MLP) is a deep, artificial
neural network in which the first layer, that is, the input layer
is used to receive the signal and the last layer, the output layer,
predicts the class of the input. Between these two layers, there
1. INTRODUCTION
consists an arbitrary number of hidden layers that is the true
computational engine of the MLP [7].
The application of Sign language to multilingual text Researchers have used several picture-capturing tools
and voice output is an innovative nexus of technology, to classify photos. This technology includes a camera or
linguistic accessibility, and inclusivity is created by the webcam, a data glove, a Kinect, and jump controls. Contrary to
incorporation of sign language into multilingual text and voice data glove-based systems, a camera or webcam is the
output. For Deaf and hard of hearing people, sign language is a instrument that most researchers employ since it offers better
crucial means of communication that opens up the outside and more natural interaction with no need for extra equipment.
world to them. However, this particular language has Data gloves have shown to be more accurate in data collection,
frequently encountered difficulties when dealing with spoken despite being relatively pricey and cumbersome [4].
and written languages, posing obstacles in daily life and the An overview of [9] is with three main modules
worlds of education and the workplace. including the feature extraction module, the processing module,
Innovative solutions have surfaced in response to and the classification module. The feature extraction module
these issues, utilizing technology to close the communication uses MMDetection to detect hand or body bounding boxes
gap. A varied, multilingual world will benefit from these depending on the dataset’s characteristics. If the dataset has
applications' increased accessibility, comprehension, and full-body images, the body bounding boxes are extracted. On
inclusivity of sign language. These apps are redefining how the other hand, the hand bounding boxes are extracted from the
sign language is incorporated into our global culture by only-hand dataset. After that, the detected bounding boxes will
utilizing cutting-edge advancements in natural language be forwarded to HRNet, a CNN based model, to determine the
processing, computer vision, and machine learning. key points normalized in the processing module. In addition,
with the whole-body dataset, the hand bounding boxes are These features can then be used as input to your overall
evaluated from the farthest key-point to the left, the farthest recognition model.
key-point to the right, the farthest key-point on the top, and the LSTMs can be employed for recognizing sign
farthest key-point to the bottom in the processing module. The language gestures as sequences. As a user signs a phrase or
hand gestures are identified in the classification module that sentence, the LSTM network processes each video frame
uses key points and hand bounding boxes as inputs [9]. sequentially and maintains context. This allows it to make
predictions about the sign language signs being performed in
A.MediaPipe real time.
Google has created an open-source framework called LSTMs are often used in combination with CNNs in a
MediaPipe that allows developers to build machine learning hybrid architecture. CNNs are suitable for processing static
and computer vision pipelines for multimedia applications. It visual features in individual frames, while LSTMs excel in
provides pre-built components and tools for processing, handling temporal sequences. The output of the CNNs can be
analyzing, and visualizing multimedia data. The framework's fed as sequences into the LSTM network, allowing the model
modular architecture enables the proponents to create pipelines to consider both spatial and temporal information for sign
for gestures of static and dynamic Filipino Sign Language language recognition.
recognition [1]. The LSTM network is trained using labeled sign
First step is to generate dataset as there is no publicly language data, where the sequences of video frames are
available datasets for words. Signs are captured from webcam associated with specific sign language signs or phrases. The
and dataset was generated. The commonly used words in which LSTM learns to capture the dynamics and context for accurate
only one hand represents a particular word are 'okay', ‘yes’ recognition.
'peace', 'thumbs up', 'call me', 'stop', 'live long', 'fist', 'smile', Once trained, the LSTM model can be used for real-
'thumbs down', 'rock', and words which uses both hands are time sign language recognition. It takes in video frames as input
‘alright’, ‘hello’, ‘good’, ‘no’. After capturing 2536 images by and provides predictions on the signs being performed as the
stable camera, they are converted into frames. The dataset user signs. The model's ability to maintain context and consider
images are divided into 75% for training and 25% for testing. temporal dependencies is particularly valuable for
Second step is to pass video frames to MediaPipe framework. this application.
Google’s MediaPipe Hands is a solution for accurate hand and
finger tracking. It uses machine learning (ML) to deduce 21 3D C. CNN
hand landmarks from a single frame. Various existing state-of CNNs are primarily used for image feature extraction.
the-art approaches rely on desktop environments for inference In this system, each frame of the sign language video can be
whereas the proposed approach achieves real time performance considered an image. CNNs are employed to capture and
even on a mobile phone and scales to multiple hands [2]. analyze the spatial features within these images. They can
identify handshapes, facial expressions, and the position of
B.LSTM hands in the frame. CNNs consist of convolutional layers that
Long Short-Term Memory is a kind of recurrent apply filters to the input images. These filters detect various
neural network. LSTM was designed by Hochreiter & patterns and features, such as edges, corners, and textures in the
Schmidhuber. It tackled the problem of long-term sign language video frames. The network learns to recognize
dependencies of RNN in which the RNN cannot predict the the most relevant features for sign language recognition.
word stored in the long-term memory but can give more Pooling layers downsample the feature maps,
accurate predictions from the recent information. As the gap reducing the spatial dimension while preserving the essential
length increases RNN does not give an efficient performance. features. This helps in reducing computational complexity and
LSTM can by default retain the information for a long period enhances the model's invariance to small variations in hand
of time. It is used for processing, predicting, and classifying on positions or orientations.
the basis of time-series data. The convolutional filters can be trained to identify key
Long Short-Term Memory (LSTM) is a type of aspects of sign language gestures, such as the shapes made by
Recurrent Neural Network (RNN) that is specifically designed fingers and the positions of the hands relative to the face. This
to handle sequential data, such as time series, speech, and text. helps the CNN in understanding the visual characteristics of
LSTM networks are capable of learning long-term signs. CNN often requires preprocessing techniques, such as
dependencies in sequential data, which makes them well suited image resizing, normalization, and data augmentation.
for tasks such as language translation, speech recognition, and Preprocessing ensures that the input data is appropriately
time series forecasting. prepared for the network. Data augmentation techniques can
LSTMs can be employed to process sign language help increase the robustness of the model by introducing
video sequences, which are essentially sequential data frames. variations in the training data.
Each video frame can be considered a time step, and the LSTM CNNs are often used in combination with Long Short-
network can analyze the temporal patterns and dependencies Term Memory (LSTM) networks in a hybrid architecture.
between these frames. For example, it can capture the dynamic While CNNs capture spatial features, LSTMs handle the
movement of hands and facial expressions in sign temporal aspects of sign language gestures. The output of the
language gestures.
CNNs can be passed as sequences to the LSTM network,
LSTMs can be used for feature extraction from the
video data. They can learn to represent important temporal allowing the model to consider both spatial and temporal
features from the video frames, such as the trajectory of hand information for recognition. The CNN model is trained using
movements, handshapes, and the order of signs in a sentence. labeled sign language image data. The model learns to
recognize important visual features and patterns associated
3. CONCLUSION AND FUTURE WORK [5] Jashwanth Peguda, V Sai Sriharsha Santosh, Y Vijayalata, Ashlin
Deepa R N, Vaddi Mounish: Speech to Sign Language Translation
for Indian Languages (2022)
Sign language recognition has greatly benefited from
advancements in machine learning, particularly in computer [6] Dr. Aruna Bhat, Vinay Yadav, Vishesh Dargan, Yash: Sign
vision and natural language processing. Deep learning models, Language to Text Conversion using Deep Learning (2022)
such as convolutional neural networks (CNNs) and Long Short
Term Memory (LSTM), have demonstrated impressive results [7] Jaya Nirmala: Sign language translator using machine learning
in recognizing signs accurately. (2022)
Sign language recognition technology is poised to make a [8] Mrs.Aerpula Swetha, Vamja Pooja, Vundi Vedavyas, Challa Datha
significant impact on the lives of Deaf and hard of hearing Venkata Naga Sai Kiran, Sadu Sravan: SIGN LANGUAGE TO
SPEECH TRANSLATION USING MACHINE LEARNING
individuals. This report underscores the importance of ongoing (2022)
research and collaboration between experts in machine
learning, computer vision, and the Deaf community to drive [9] Tuan Linh Dang a,∗ , Sy Dat Tran a , Thuy Hang Nguyen a , Suntae
innovation and make sign language recognition more accurate, Kim b , Nicolas Monet b: An improved hand gesture recognition
accessible, and inclusive. One of the main important future system using keypoints and hand bounding boxes (2022)
works to be done is to improve the response time and accuracy
of the textual and speech outputs. With further advancements [10] Sang-Geun Choi , Yeonji Park and Chae-Bong Sohn : Dataset
and increased awareness, we can look forward to a future where Transformation System for Sign Language Recognition Based on
communication barriers are significantly reduced for the Deaf Image Classification Network (2022)
community.
[11] E. B. Villagomez, R. A. King, M. J. Ordinario, J. Lazaro and J. F.
Villaverde, "Hand Gesture Recognition for Deaf-Mute using
FuzzyNeural Network," 2019 IEEE International Conference on
ACKNOWLEDGEMENT Consumer Electronics - Asia (ICCE-Asia), 2019, pp. 30- 33, doi:
10.1109/ICCEAsia46551.2019.8942220.
The present world of competition there is a race of existence in
which those who have the will to come forward succeed. [12] G. K. R. Madrid, R. G. R. Villanueva and M. V. C. Caya,
"Recognition of Dynamic Filipino Sign Language using MediaPipe
Project is like a bridge between theoretical and practical work. and Long ShortTerm Memory," 2022 13th International
With this willing we joined this particular project. First of all, Conference on Computing Communication and Networking
we would like to thank the supreme power the Almighty God Technologies (ICCCNT), Kharagpur, India, 2022, pp. 1-6, doi:
who is obviously the one who has always guided us to work on 10.1109/ICCCNT54827.2022.9984599.
the right path of life. We sincerely thank Prof. R. H. Borhade
sir, Head of the Department of Computer Engineering of Smt [13] M. B. D. Jarabese, C. S. Marzan, J. Q. Boado, R. R. M. F. Lopez,
Kashibai Navale college of engineering, for all the facilities L. G. B. Ofiana and K. J. P. Pilarca, "Sign to Speech Convolutional
provided to us in the pursuit of this project. Neural Network-Based Filipino Sign Language Hand Gesture
Recognition System," 2021 International Symposium on Computer
Science and Intelligent Controls (ISCSIC), Rome, Italy, 2021, pp.
We are indebted to our project guide Prof. P. V. Bhaskare,
147-153, doi: 10.1109/ISCSIC54682.2021.00036.
Department of Computer Engineering of Smt. Kashibai Navale
college of engineering. We feel it’s a pleasure to be indebted to [14] K. E. Oliva, L. L. Ortaliz, M. A. Tobias and L. Vea, "Filipino
our guide for his valuable support, advice and encouragement Sign Language Recognition for Beginners using Kinect," 2018
and we thank him for his superb and constant guidance towards IEEE 10th International Conference on Humanoid,
this project. Nanotechnology, Information Technology,Communication and
Control, Environment and Management (HNICEM), Baguio City,
We are deeply grateful to all the staff members of the computer Philippines, 2018, pp. 1-6, doi: 10.1109/HNICEM.2018.8666346.
department, for supporting us in all aspects. We acknowledge
[15] M. Allen Cabutaje, K. Ang Brondial, A. Franchesca Obillo, M.
our deep sense of gratitude to our loving parents for being a Abisado, S. Lor Huyo-a and G. Avelino Sampedro, "Ano Raw: A
constant source of inspiration and motivation. Deep Learning Based Approach to Transliterating the Filipino Sign
Language," 2023 International Conference on Electronics,
REFERENCES Information, and Communication (ICEIC), Singapore, 2023, pp. 1-
6, doi: 10.1109/ICEIC57457.2023.10049890.
[1] Carmela Louise L. Evangelista, Criss Jericho R. Geli, Marc Marion
V. Castillo: Long Short-Term Memory-based Static and Dynamic [16] A. S. M. Miah, J. Shin, M. A. M. Hasan, and M. A. Rahim,
Filipino Sign Language Recognition (2023) “BenSignNet: Bengali Sign Language Alphabet Recognition Using
Concatenated Segmentation and Convolutional Neural Network,”
[2] Roli Kushwaha, Gurjit Kaur, Manjeet Kumar: Hand Gesture Based Applied Sciences 2022, Vol. 12, Page 3933, vol. 12, no. 8, p. 3933,
Sign Language Recognition Using Deep Learning (2023) Apr. 2022, doi: 10.3390/APP12083933.
[3] Rinki Gupta, Roohika Manodeep Dadwal: Deep Learning based [17] D. Li, X. Yu, C. Xu, L. Petersson, and H. Li, “Transferring Cross-
Sign Language Recognition robust to Sensor Displacement (2023) Domain Knowledge for Video Sign Language Recognition,”
Proceedings of the IEEE Computer Society Conference on
[4] Subhangi Kumari, Ernest Tarlue, Aissatou Diallo, Megha Chhabra, Computer Vision and Pattern Recognition, pp. 6204–6213, 2020,
Gouri Shankar Mishra, Mayank Kumar Goyal: A Review of doi: 10.1109/CVPR42600.2020.00624.
Segmentation and Recognition Techniques for Indian Sign
Language using Machine Learning and Computer Vision (2023) [18] L. Pigou, A. van den Oord, S. Dieleman, M. van Herreweghe, and
J. Dambre, “Beyond Temporal Pooling: Recurrence and Temporal