0% found this document useful (0 votes)
15 views7 pages

Dynamic Gesture Recognition For Sign Language Using Long Short Term Memory Networks

This paper presents a method for dynamic gesture recognition in sign language using Long Short-Term Memory (LSTM) networks, aiming to bridge communication gaps for individuals with hearing and speech impairments. The authors developed a custom dataset and evaluated their LSTM-based model in real-time, achieving significant accuracy in recognizing Indian Sign Language gestures. The research emphasizes the importance of enhancing human-machine interactions and fostering inclusivity through technology.

Uploaded by

Rohan singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views7 pages

Dynamic Gesture Recognition For Sign Language Using Long Short Term Memory Networks

This paper presents a method for dynamic gesture recognition in sign language using Long Short-Term Memory (LSTM) networks, aiming to bridge communication gaps for individuals with hearing and speech impairments. The authors developed a custom dataset and evaluated their LSTM-based model in real-time, achieving significant accuracy in recognizing Indian Sign Language gestures. The research emphasizes the importance of enhancing human-machine interactions and fostering inclusivity through technology.

Uploaded by

Rohan singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Dynamic Gesture Recognition for Sign Language

Using Long Short-Term Memory Networks

Rohan Singh Rahul Sharma Sushil Kumar Gautam Youddha Beer Singh
department of CSE, department of CSE, department of CSE, department of CSE,
Galgotias College of Galgotias College of Galgotias College of Galgotias College of
Engineering & Technology Engineering & Technology Engineering & Technology Engineering & Technology
Greater Noida, India Greater Noida, India Greater Noida, India Greater Noida, India
rohansinghrohansingh64@ [email protected] [email protected] youddhabeersingh@gmail.
gmail.com com

Abstract- A significant flaw in our society is the social Sign language is a visual language expressed through
divide that exists between people with disabilities and physical movements instead of spoken words. The language
those without. Communication is perhaps one of the relies on visible cues from hands, eyes, facial expressions,
most significant characteristics of humans, who are and movements to communicate. Although sign language is
thought of as social animals. One of the biggest used primarily by people who are deaf or hard of hearing, it
challenges for those with hearing and vocal impairments is also used by many hearing people.
is communication. For someone who has hearing and As with any spoken language, sign language has
voice impairments, this incapacity to communicate grammar and structure rules, and it has evolved over time.
causes regular issues and interferes with their everyday Just like with spoken languages, there is no “universal” sign
tasks. We have suggested a way to get beyond this language. Different countries typically have their own
communication obstacle in our research work. Everyone version of sign language, which is unique to their region and
can use this solution with ease, and with a few tweaks, it culture. For example, American Sign Language (ASL) is
can be made to function on the majority of systems different from Australia’s Auslan sign language, which is
having camera modules. Our method employs an different from the British Sign Language (BSL) used in the
integrated camera module to record hand movements in United Kingdom. A person fluent in ASL may travel to
real time, based on landmarks or hand key points. Sydney, Australia, and have trouble understanding someone
Proposed LSTM based model have been evaluated in real using a local version of sign language instead of different
time environment and a significant result have been dialects or accents apparent in oral language, the signs and
reported as mentioned in the result section. gestures are different. Today, there are more than 300
different sign languages in the world, spoken by more than
Keywords— Deep Learning, Image Processing, Long 72 million deaf or hard-of-hearing people worldwide[2].
Short-Term Memory, Sign Language Recognition. Hands are the most important object in the inputs of the sign
language recognition models. Tracking the detected hands
I. INTRODUCTION are one of the substantial challenges for video inputs due to
high occlusions of hand fingers and joints.
Sign language recognition is rapidly emerging as a Investing in research and innovation in this domain is not
significant field of research, particularly in the context of merely a technological pursuit but a profound step towards
enhancing human-machine interactions. Gestures represent a building a compassionate, inclusive, and equitable society.
form of non-verbal communication through which By breaking down communication barriers, these
individuals convey messages and express themselves advancements empower individuals with speech and
without the use of spoken language. In everyday life, we hearing impairments to fully participate in every aspect of
frequently encounter various hand gestures used for life. As we continue to develop and refine these systems, we
communication. Examples include the thumbs-up sign, move closer to creating a world where no individual is
which typically indicates approval or agreement, the thumbs-
limited by their ability to communicate, fostering a stronger
down gesture for disapproval, the victory sign often
sense of connection, understanding, and shared progress that
associated with success or triumph, and directional signals
used to indicate movement or guidance. Such gestures are benefits all of humanity. This journey represents not only a
not only prevalent in casual interactions but also play crucial triumph of technology but also a testament to the boundless
roles in specific contexts. For instance, in cricket, umpires potential of human ingenuity and empathy.
utilize distinct hand gestures to denote different events
occurring during the match. Similarly, traffic police officers
use hand signals to manage and direct vehicular movement.
The main contribution of the paper is as follows: time communication for speech and hearing-impaired
individuals.
• Database: In this paper, we create our own sign
language dataset for the purpose of model testing
and validation in this work. Our dataset contains 50 Raheja et al. [6] present an Indian Sign Language
images of each 26 classes with the tags: ‘A’ to ‘Z’, recognition system using Support Vector Machines (SVM)
which includes the sign language data captured by for classifying dynamic hand gestures. The approach pre-
ourself with the static camera. processes video frames in HSV colour space, extracts
feature like Hu-Moments and hand trajectory, and uses
• Novel approach: The aim of the present study is to
SVM for classification. Tested with MS Kinect and
develop a LSTM bases system that recognizes Indian
webcams, the system achieves a recognition accuracy of
sign language, and that can be used offline.
97.5% for four signs. This real-time solution is aimed at
enhancing communication for hearing and speech-impaired
individuals.
The rest of the paper is organized as follows: Section II
describes the related work. Section III provides an overview
of the materials and tools used. Section IV details the Adithya et al. [7] propose a vision-based method for
implementation and methodologies. Section V presents the recognizing Indian Sign Language using artificial neural
experiments and results. Finally, Section VI concludes the networks (ANN). The system employs image pre-
paper and highlights future research directions. processing, hand segmentation in YCbCr colour space, and
feature extraction using distance transformation and Fourier
II. RELATED WORK descriptors. A feed-forward neural network trained with
extracted features achieves an average recognition accuracy
Indian Sign Language is a visual–spatial language that
of 91.11% across 36 signs, offering a non-intrusive and
was developed in India. Indian Sign Language is a natural
computationally efficient approach to assist communication
language with its own phonology, morphology, and
for hearing-impaired individuals.
grammar. It uses arm motions, hands, facial expressions,
and the body/head, which generate semantic information
conveying words and emotions. Hand gesture recognition The paper by Yogeshwar I. Rokade and rashant M. Jadav
problems are addressed in various different ways by presents a vision-based Indian Sign Language (ISL)
researchers. recognition system leveraging Artificial Neural Networks
Ankita Wadhawan and Parteek Kumar [3] present a deep (ANN) and Support Vector Machines (SVM) [8]. The
learning-based system for Indian Sign Language (ISL) method includes skin segmentation, binary conversion, and
recognition using Convolutional Neural Networks (CNNs). feature extraction through Euclidean distance
A dataset comprising 35,000 images of 100 static signs was transformation, Central Moments, and Hu Moments. Using
developed, including alphabets, digits, and commonly used ANN, the system achieves an accuracy of 94.37%,
words, captured under varied environmental conditions. The compared to 92.12% with SVM, demonstrating its
system, tested on 50 CNN architectures with different capability to effectively recognize 17 ISL alphabets,
hyperparameters and optimizers, achieved a maximum enhancing communication for hearing and speech-impaired
training accuracy of 99.90% and validation accuracy of individuals.
98.70% using the Stochastic Gradient Descent (SGD)
optimizer. The approach outperformed previous methods in The paper by Divya Deora and Nikesh Bajaj proposes an
precision, recall, and F1-score metrics, showcasing its Indian Sign Language (ISL) recognition system using
robustness for ISL recognition tasks. Principal Component Analysis (PCA) [9]. It incorporates a
fingertip detection algorithm to improve feature selection.
Anshul Mittal [4] propose a Modified Long Short-Term Images are segmented using colour masks for red and blue
Memory (LSTM) model integrated with Leap Motion gloves. The system achieved 94% accuracy for static signs,
sensors for continuous Indian Sign Language (ISL) with future proposals suggesting neural networks and
recognition. This novel system segments and recognizes integrating fingertip data with PCA to enhance robustness
connected sign gestures using CNN-extracted spatial and handle motion-based gestures.
features and a Reset gate in LSTM to handle transitions
between signs. Evaluated on 942 signed sentences involving The paper by Piyusha Vyavahare, Sanket Dhawale,
35 unique words, the model achieved an accuracy of 72.3% Priyanka Takale, Vikrant Koli, Bhavana Kanawade, and
for sentences and 89.5% for isolated sign words, Shraddha Khonde [10] proposes an Indian Sign Language
significantly outperforming traditional LSTM architectures. (ISL) recognition system using Long Short-Term Memory
(LSTM) networks. It utilizes computer vision algorithms for
Kothadiya et al. propose a deep learning model feature extraction and employs LSTM networks to capture
combining LSTM and GRU for Indian Sign Language temporal dependencies in dynamic gestures. A custom
recognition [5], achieving 97% accuracy on the custom dataset of 40 ISL actions was created, including annotated
IISL2020 dataset. Using InceptionResNetV2 for feature video recordings of native ISL users. The system achieved
extraction, the model processes video frames under natural a training accuracy of 96% and a test accuracy of 87% for
conditions without specialized equipment, enabling real- recognizing individual words. Future directions include
developing sentence-level recognition, addressing
environmental variations, and expanding the dataset to • PyTorch: Another powerful deep learning library with
improve robustness and generalizability. dynamic computation graph support, making it easier to
implement CNNs.
The paper by Ahmed Mateen Buttar, Usama Ahmad, • Keras: A high-level API built on TensorFlow, used for
Abdu H. Gumaei, Adel Assiri, Muhammad Azeem Akbar, creating CNN architectures with minimal code.
and Bader Fahad Alkhamees [11] proposes a hybrid • Computer Vision Libraries: OpenCV, used for
approach for American Sign Language (ASL) recognition capturing video input, image processing, and
using LSTM and YOLOv6 models. The study combines augmenting gesture data for training CNNs.
dynamic sign detection with skeleton-based LSTM for • Media Pipe: A framework by Google for real-time hand
temporal gesture recognition and static sign recognition tracking and landmark detection, which can assist in
with YOLOv6. A custom dataset of ASL signs, including gesture recognition.
static and dynamic gestures, was used, achieving 92% • Data Preprocessing and Augmentation: NumPy, for
accuracy for dynamic signs with LSTM and 96% for static numerical computations, such as normalizing pixel
signs with YOLOv6. The authors highlight the advantages values and reshaping data.
of the hybrid approach in enhancing recognition accuracy • Pandas: For organizing and manipulating datasets.
for different sign types and propose future work integrating • Visualization Tools: Matplotlib, for visualizing data
fuzzy logic to handle noise and lighting variations. distributions, training metrics (accuracy, loss), and
results.
Table-1 represent the comparative analysis of the
approaches used for the sign language recognition in terms IV. METHODOLOGY
of methodology and the average accuracy.
Data Collection: The system captures real-time video
Table 1:- Comparative analysis of various learning frames from a webcam and stores them into categorized
models/methods. directories, labeled from 'A' to 'Z'(Total 26 alphabets). The
system dynamically counts the number of images in each
Author Methodology Accuracy directory and assigns a unique filename for each saved
image, ensuring no duplication. The images are stored in
Wadhawan & CNN 99.90% (train), specific folders corresponding to the user-selected
Kumar 98.70% (Val) categories, which can be triggered by pressing a key (from
Mittal. Modified LSTM 72.3% 'a' to 'z'). For every key pressed, the captured frame is stored
with Leap (sentences), in the appropriate folder, and the filename reflects the
89.5% (words) current count of images in that folder.
Kothadiya et al.. LSTM-GRU 97%
hybrid A designated region of interest (ROI) is displayed,
Raheja et al. SVM 97.5% which shows a focused area of the video frame. The frame
Adithya et al. ANN 91.11% is captured from the ROI, providing the user with an updated
preview of what is being saved. As the system continues to
Rokade & ANN and SVM 94.37% (ANN), run, it provides real-time feedback to the user by displaying
Jadav. 92.12% (SVM) the number of images in each directory on the screen. This
Deora & Bajaj PCA with 94% ensures users can track how many frames have been
fingertip detection collected for each category. The system continues to capture
and store frames until manually stopped, making it suitable
From table 1, it is clear that the majority of the researchers for applications such as machine learning dataset creation
have been used deep learning-based approaches to get the and image analysis. Data images collected from samples are
better result for the sign language recognition. shown in the figure 1.
III. MATERIAL AND TOOLS
To provide a comprehensive understanding of the materials
and tools used in the paper “Sign Language Recognition
System Using Deep Learning”, the following sections detail
the various software libraries, frameworks, and tools
employed in the development and implementation of sign
language recognition system.

• Programming Language: Python used for its extensive


libraries and frameworks that support machine learning
and computer vision tasks.
• Deep Learning Frameworks: TensorFlow, A popular
open-source framework for building and training deep
learning models, including CNNs.
Figure 1:Samples collected Data images
Key point Extraction and Data Preparation for Hand efficiently recognize and classify hand gestures in real-time
Gesture Recognition: This process utilizes a computer applications. Proposed model architecture and the
vision framework to capture hand landmarks from a video parameters are shown in the figure 3.
stream and store key point data for gesture recognition tasks.
The system processes a sequence of images representing
different hand gestures, where each gesture corresponds to
a specific action, such as 'A', 'B', 'C', etc. For each action,
multiple sequences of frames are collected, and key point
data from the hand landmarks are extracted for training
purposes. The images are processed by converting from
BGR to RGB format and passing through a hand tracking
model that detects the positions of various hand landmarks.

During this procedure, for every sequence and


frame, the hand landmarks are drawn on the captured image
to provide a visual representation of the detection.
Additionally, each key point (landmark) data is flattened
into a one-dimensional array and stored as a NumPy file in
a structured directory. The images and key point data are
organized into separate folders for each action and
sequence, enabling the collection of labeled data for
machine learning tasks. This setup ensures efficient storage
and retrieval of data for later use in training models for
gesture recognition or other similar tasks. Key point
Extraction and Data Preparation for Hand Gesture
Recognition are shown in the figure 2.

Figure 3:Model Architecture with Parameters

Mathematical Calculation of Model Parameters: LSTM


Layer Parameters, the total parameters (PLSTM) for an LSTM
layer are calculated using the formula:

PLSTM = 4 × (nunits x (ninputs + nunits) + nunits) (1)

Where ninputs is number of input features to the layer, nunits is


number of hidden units in the LSTM layer. Accounts for the
four weight matrices in the LSTM (input gate, forget gate,
cell state, and output gate). Parameters for Each LSTM
Layer are as follows
Figure 2:Key Point Extraction

Model Training: The system utilizes a deep learning model First LSTM Layer (ninputs = 15, nunits = 64): PLSTM = 4 × (64
for gesture recognition, leveraging LSTM (Long Short- × (15 + 64) + 64) = 32,768
Term Memory) networks to process sequential data from Second LSTM Layer (ninputs = 15, nunits = 128): PLSTM = 4
hand landmarks. The labeled dataset is prepared by × (128 × (15 + 64) + 128) = 98,816.
extracting key points from various hand gestures, which are Third LSTM Layer (ninputs = 64, nunits = 64): PLSTM = 4 ×
then transformed into sequences of data. These sequences (64 × (64 + 64) + 64) = 49,408.
are divided into training and testing sets, with the labels one-
hot encoded to match the categories of gestures. Dense Layer Parameters: The total parameters (PDense) for
a Dense layer are calculated using the formula:
A neural network is designed using a sequential
architecture, consisting of three LSTM layers and two dense
PDense = (ninputs × nunits) + nunits (2)
layers to learn the temporal patterns in the data. The model
is compiled with the Adam optimizer and categorical cross-
entropy loss function, optimizing the network for Where ninputs are number of input features to the layer, and
categorical accuracy. Training is performed for 200 epochs, nunits are number of units in the layer. Total number of
with a Tensor Board callback to monitor the model's Parameters for Each Dense Layer are given below:
performance during the training process. After training, the
model architecture is saved in JSON format, while the First Dense Layer (ninputs = 64, nunits = 64): PDense= (64 × 64)
trained model's weights are stored in an H5 file for later use + 64 = 4,160.
in predictions. This setup ensures that the system can
Second Dense Layer (ninputs = 64, nunits = 26): PDense= (64 × figure 5. This structured approach provides a well-organized
26) + 26 = 858. dataset suitable for training and testing machine learning
Third Dense Layer (ninputs = 64, nunits = 26): PDense= (64 × models for sign language recognition. Created dataset are
26) + 26 = 858. uploaded on the Kaggle and any on access the dataset
through the given link for the research purpose. Dataset
Real-Time Hand Gesture Recognition and Prediction: link: - https://www.kaggle.com/datasets/rahulsh123/sign-
After training, the model is integrated into the real-time language-recognition-dataset [1].
gesture recognition system. Sample output of the real-time
prediction are shown in the figure 4. The key steps in real-
time recognition include:

• Live Video Capture: A webcam captures frames in real-


time.
• Hand Landmark Detection: Media pipe is used to detect
and track the landmarks of the user's hand.
• Gesture Prediction: The real-time frames are processed
and fed into the trained model to predict the
corresponding gesture.
• Gesture Output: The recognized gesture is displayed on
the screen.

Figure 5:sample Dataset

5.2. RESULTS AND DISCUSSION


The system achieved an average accuracy of over 90% in
real-time gesture recognition, with most letters performing
consistently well. Slight variations in accuracy were
observed for similar gestures or challenging conditions,
emphasizing areas for improvement. Overall, the system
demonstrates strong potential for practical applications.
Figure 4:Real Time Prediction Experimental results of the proposed model evaluated in the
real time environment are given in the table 2 in terms of
V. EXPERIMENTS AND RESULTS average accuracy of each alphabet.

5.1. DATA SET Table 2:- Experiments Results

In this study, we compile a custom sign language ALPHABETS A B C D E F G H I


dataset intended for model evaluation and validation in this
AVG
work. This dataset contains all the 26 alphabets of each 50
99.98

97.93

99.93

94.54

87.52

92.45

86.42

82.49

93.44

ACCURACY
images used in paper "Sign Language Recognition using (%)
Deep Learning". So, there are total 1300 images of alphabets
used in the work. A webcam was used to capture images of ALPHABETS J K L M N O P Q R
hand gestures representing each letter of the English
AVG
alphabet.
92.47

94.33

92.41

86.45

80.55

82.47

86.42

90.61

88.24

ACCURACY
A designated area on the screen, known as the Region of (%)
Interest (ROI), helped ensure consistency in gesture
positioning. By pressing specific keys, users could save ALPHABETS S T U V W X Y Z
images of gestures to corresponding folders for each letter. AVG
93.20

87.24

84.22

92.41

87.06

80.86

80.92

86.47

A counter displayed the number of images collected for each ACCURACY


letter in real-time, ensuring a balanced dataset. Sample (%)
dataset of the custom sign language dataset are shown in the
The accuracy graph shown in Figure 6 illustrates indicates that the optimization process is efficient, with no
the model's performance improvement over the training evidence of overfitting, underfitting, or erratic learning
epochs. Initially, accuracy is low, reflecting the model's behavior. Figure 8 shows the capture output results.
limited understanding of the task. As epochs progress,
accuracy increases gradually, showing that the model is
learning patterns in the data. The curve eventually plateaus
near the 100% mark, indicating that the model is
consistently making correct predictions on the training data.
The absence of abrupt fluctuations or dips suggests a stable
and effective learning process, without interruptions or
anomalies.

Figure 6:Model Accuracy over Epochs


Figure 8:Output
The loss graph shown in the Figure 7 represents the model's
error reduction throughout the training process. At the
beginning, the loss is significantly high, as the model starts
with random or poorly-initialized weights.
VI. CONCLUSIONS AND FUTURE
WORK

In conclusion, the developed sign language recognition


system demonstrates the potential of computer vision and
machine learning to bridge communication barriers for
individuals with hearing and speech impairments. By
effectively recognizing and interpreting hand gestures
corresponding to letters of the alphabet, the system offers a
foundation for more advanced solutions that can support
real-time communication. This work highlights the
importance of accessible technologies and provides a
framework for future enhancements, such as the inclusion
of dynamic gestures, words, and phrases, to create a
comprehensive tool for inclusive interaction.

Future work could involve expanding the system to


recognize a broader range of sign languages, as well as
improving its real-time performance and accuracy through
the integration of more advanced machine learning models.
Figure 7:Model Loss over Epochs
Additionally, incorporating user feedback could help tailor
the system to better meet the needs of diverse individuals
Over the epochs, the loss steadily decreases,
and communities.
reflecting the model's ability to minimize prediction errors.
By the final epochs, the loss approaches zero, signaling
near-perfect predictions. The smooth downward trajectory
REFERENCES [7] V. Adithya, P. R. Vinod and U. Gopalakrishnan,
"Artificial neural network-based method for Indian sign
language recognition," 2013 IEEE Conference on
[1] Datasetlink:https://www.kaggle.com/datasets/rahulsh1 Information & Communication Technologies,
23/sign-language-recognition-dataset Thuckalay, India, 2013, pp. 1080-1085, doi:
[2] Razieh Rastgoo, Kourosh Kiani, Sergio Escalera, Sign 10.1109/CICT.2013.6558259.
Language Recognition: A Deep Survey, Expert [8] Rokade, Yogeshwar & Jadav, Prashant. (2017). Indian
Systems with Applications, Volume 164, 2021, Sign Language Recognition System. International
113794, ISSN 0957- Journal of Engineering and Technology.9.189-196.
4174,https://doi.org/10.1016/j.eswa.2020.113794. 10.21817/ijet/2017/v9i3/170903S030.
[3] Wadhawan, A., Kumar, P. Deep learning-based sign [9] D. Deora and N. Bajaj, "Indian sign language
language recognition system for static signs. Neural recognition," 2012 1st International Conference on
Comput & Applic 32, 7957–7968 (2020). Emerging Technology Trends in Electronics,
https://doi.org/10.1007/s00521-019-04691-y Communication & Networking, Surat, India, 2012, pp.
[4] A. Mittal, P. Kumar, P. P. Roy, R. Balasubramanian 1-5, doi: 10.1109/ET2ECN.2012.6470093.
and B. B. Chaudhuri, "A Modified LSTM Model for [10] Vyavahare, P., Dhawale, S., Takale, P., Koli, V.,
Continuous Sign Language Recognition Using Leap Kanawade, B., & Khonde, S. (2023). Detection and
Motion," in IEEE Sensors Journal, vol. 19, no. 16, pp. interpretation of Indian Sign Language using LSTM
7056-7063, 15 Aug.15, 2019, doi: networks. J. Intell Syst. Control, 2(3), 132-142.
10.1109/JSEN.2019.2909837. [11] Buttar, A. M., Ahmad, U., Gumaei, A. H., Assiri, A.,
[5] Kothadiya, D., Bhatt, C., Sapariya, K., Patel, K., Gil- Akbar, M. A., & Alkhamees, B. F. (2023). Deep
González, A. -B., & Corchado, J. M. (2022). Deepsign: learning in sign language recognition: a hybrid
Sign Language Detection and Recognition Using Deep approach for the recognition of static and dynamic
Learning. Electronics, signs. Mathematics, 11(17), 3729.
11(11),1780.https://doi.org/10.3390/electronics111117
80
[6] Raheja, J.L., Mishra, A. & Chaudhary, A. Indian sign
language recognition using SVM. Pattern Recognit.
Image Anal. 26, 434–441 (2016).
https://doi.org/10.1134/S1054661816020164

You might also like