Dynamic Gesture Recognition For Sign Language Using Long Short Term Memory Networks
Dynamic Gesture Recognition For Sign Language Using Long Short Term Memory Networks
Rohan Singh Rahul Sharma Sushil Kumar Gautam Youddha Beer Singh
department of CSE, department of CSE, department of CSE, department of CSE,
Galgotias College of Galgotias College of Galgotias College of Galgotias College of
Engineering & Technology Engineering & Technology Engineering & Technology Engineering & Technology
Greater Noida, India Greater Noida, India Greater Noida, India Greater Noida, India
rohansinghrohansingh64@ [email protected] [email protected] youddhabeersingh@gmail.
gmail.com com
Abstract- A significant flaw in our society is the social Sign language is a visual language expressed through
divide that exists between people with disabilities and physical movements instead of spoken words. The language
those without. Communication is perhaps one of the relies on visible cues from hands, eyes, facial expressions,
most significant characteristics of humans, who are and movements to communicate. Although sign language is
thought of as social animals. One of the biggest used primarily by people who are deaf or hard of hearing, it
challenges for those with hearing and vocal impairments is also used by many hearing people.
is communication. For someone who has hearing and As with any spoken language, sign language has
voice impairments, this incapacity to communicate grammar and structure rules, and it has evolved over time.
causes regular issues and interferes with their everyday Just like with spoken languages, there is no “universal” sign
tasks. We have suggested a way to get beyond this language. Different countries typically have their own
communication obstacle in our research work. Everyone version of sign language, which is unique to their region and
can use this solution with ease, and with a few tweaks, it culture. For example, American Sign Language (ASL) is
can be made to function on the majority of systems different from Australia’s Auslan sign language, which is
having camera modules. Our method employs an different from the British Sign Language (BSL) used in the
integrated camera module to record hand movements in United Kingdom. A person fluent in ASL may travel to
real time, based on landmarks or hand key points. Sydney, Australia, and have trouble understanding someone
Proposed LSTM based model have been evaluated in real using a local version of sign language instead of different
time environment and a significant result have been dialects or accents apparent in oral language, the signs and
reported as mentioned in the result section. gestures are different. Today, there are more than 300
different sign languages in the world, spoken by more than
Keywords— Deep Learning, Image Processing, Long 72 million deaf or hard-of-hearing people worldwide[2].
Short-Term Memory, Sign Language Recognition. Hands are the most important object in the inputs of the sign
language recognition models. Tracking the detected hands
I. INTRODUCTION are one of the substantial challenges for video inputs due to
high occlusions of hand fingers and joints.
Sign language recognition is rapidly emerging as a Investing in research and innovation in this domain is not
significant field of research, particularly in the context of merely a technological pursuit but a profound step towards
enhancing human-machine interactions. Gestures represent a building a compassionate, inclusive, and equitable society.
form of non-verbal communication through which By breaking down communication barriers, these
individuals convey messages and express themselves advancements empower individuals with speech and
without the use of spoken language. In everyday life, we hearing impairments to fully participate in every aspect of
frequently encounter various hand gestures used for life. As we continue to develop and refine these systems, we
communication. Examples include the thumbs-up sign, move closer to creating a world where no individual is
which typically indicates approval or agreement, the thumbs-
limited by their ability to communicate, fostering a stronger
down gesture for disapproval, the victory sign often
sense of connection, understanding, and shared progress that
associated with success or triumph, and directional signals
used to indicate movement or guidance. Such gestures are benefits all of humanity. This journey represents not only a
not only prevalent in casual interactions but also play crucial triumph of technology but also a testament to the boundless
roles in specific contexts. For instance, in cricket, umpires potential of human ingenuity and empathy.
utilize distinct hand gestures to denote different events
occurring during the match. Similarly, traffic police officers
use hand signals to manage and direct vehicular movement.
The main contribution of the paper is as follows: time communication for speech and hearing-impaired
individuals.
• Database: In this paper, we create our own sign
language dataset for the purpose of model testing
and validation in this work. Our dataset contains 50 Raheja et al. [6] present an Indian Sign Language
images of each 26 classes with the tags: ‘A’ to ‘Z’, recognition system using Support Vector Machines (SVM)
which includes the sign language data captured by for classifying dynamic hand gestures. The approach pre-
ourself with the static camera. processes video frames in HSV colour space, extracts
feature like Hu-Moments and hand trajectory, and uses
• Novel approach: The aim of the present study is to
SVM for classification. Tested with MS Kinect and
develop a LSTM bases system that recognizes Indian
webcams, the system achieves a recognition accuracy of
sign language, and that can be used offline.
97.5% for four signs. This real-time solution is aimed at
enhancing communication for hearing and speech-impaired
individuals.
The rest of the paper is organized as follows: Section II
describes the related work. Section III provides an overview
of the materials and tools used. Section IV details the Adithya et al. [7] propose a vision-based method for
implementation and methodologies. Section V presents the recognizing Indian Sign Language using artificial neural
experiments and results. Finally, Section VI concludes the networks (ANN). The system employs image pre-
paper and highlights future research directions. processing, hand segmentation in YCbCr colour space, and
feature extraction using distance transformation and Fourier
II. RELATED WORK descriptors. A feed-forward neural network trained with
extracted features achieves an average recognition accuracy
Indian Sign Language is a visual–spatial language that
of 91.11% across 36 signs, offering a non-intrusive and
was developed in India. Indian Sign Language is a natural
computationally efficient approach to assist communication
language with its own phonology, morphology, and
for hearing-impaired individuals.
grammar. It uses arm motions, hands, facial expressions,
and the body/head, which generate semantic information
conveying words and emotions. Hand gesture recognition The paper by Yogeshwar I. Rokade and rashant M. Jadav
problems are addressed in various different ways by presents a vision-based Indian Sign Language (ISL)
researchers. recognition system leveraging Artificial Neural Networks
Ankita Wadhawan and Parteek Kumar [3] present a deep (ANN) and Support Vector Machines (SVM) [8]. The
learning-based system for Indian Sign Language (ISL) method includes skin segmentation, binary conversion, and
recognition using Convolutional Neural Networks (CNNs). feature extraction through Euclidean distance
A dataset comprising 35,000 images of 100 static signs was transformation, Central Moments, and Hu Moments. Using
developed, including alphabets, digits, and commonly used ANN, the system achieves an accuracy of 94.37%,
words, captured under varied environmental conditions. The compared to 92.12% with SVM, demonstrating its
system, tested on 50 CNN architectures with different capability to effectively recognize 17 ISL alphabets,
hyperparameters and optimizers, achieved a maximum enhancing communication for hearing and speech-impaired
training accuracy of 99.90% and validation accuracy of individuals.
98.70% using the Stochastic Gradient Descent (SGD)
optimizer. The approach outperformed previous methods in The paper by Divya Deora and Nikesh Bajaj proposes an
precision, recall, and F1-score metrics, showcasing its Indian Sign Language (ISL) recognition system using
robustness for ISL recognition tasks. Principal Component Analysis (PCA) [9]. It incorporates a
fingertip detection algorithm to improve feature selection.
Anshul Mittal [4] propose a Modified Long Short-Term Images are segmented using colour masks for red and blue
Memory (LSTM) model integrated with Leap Motion gloves. The system achieved 94% accuracy for static signs,
sensors for continuous Indian Sign Language (ISL) with future proposals suggesting neural networks and
recognition. This novel system segments and recognizes integrating fingertip data with PCA to enhance robustness
connected sign gestures using CNN-extracted spatial and handle motion-based gestures.
features and a Reset gate in LSTM to handle transitions
between signs. Evaluated on 942 signed sentences involving The paper by Piyusha Vyavahare, Sanket Dhawale,
35 unique words, the model achieved an accuracy of 72.3% Priyanka Takale, Vikrant Koli, Bhavana Kanawade, and
for sentences and 89.5% for isolated sign words, Shraddha Khonde [10] proposes an Indian Sign Language
significantly outperforming traditional LSTM architectures. (ISL) recognition system using Long Short-Term Memory
(LSTM) networks. It utilizes computer vision algorithms for
Kothadiya et al. propose a deep learning model feature extraction and employs LSTM networks to capture
combining LSTM and GRU for Indian Sign Language temporal dependencies in dynamic gestures. A custom
recognition [5], achieving 97% accuracy on the custom dataset of 40 ISL actions was created, including annotated
IISL2020 dataset. Using InceptionResNetV2 for feature video recordings of native ISL users. The system achieved
extraction, the model processes video frames under natural a training accuracy of 96% and a test accuracy of 87% for
conditions without specialized equipment, enabling real- recognizing individual words. Future directions include
developing sentence-level recognition, addressing
environmental variations, and expanding the dataset to • PyTorch: Another powerful deep learning library with
improve robustness and generalizability. dynamic computation graph support, making it easier to
implement CNNs.
The paper by Ahmed Mateen Buttar, Usama Ahmad, • Keras: A high-level API built on TensorFlow, used for
Abdu H. Gumaei, Adel Assiri, Muhammad Azeem Akbar, creating CNN architectures with minimal code.
and Bader Fahad Alkhamees [11] proposes a hybrid • Computer Vision Libraries: OpenCV, used for
approach for American Sign Language (ASL) recognition capturing video input, image processing, and
using LSTM and YOLOv6 models. The study combines augmenting gesture data for training CNNs.
dynamic sign detection with skeleton-based LSTM for • Media Pipe: A framework by Google for real-time hand
temporal gesture recognition and static sign recognition tracking and landmark detection, which can assist in
with YOLOv6. A custom dataset of ASL signs, including gesture recognition.
static and dynamic gestures, was used, achieving 92% • Data Preprocessing and Augmentation: NumPy, for
accuracy for dynamic signs with LSTM and 96% for static numerical computations, such as normalizing pixel
signs with YOLOv6. The authors highlight the advantages values and reshaping data.
of the hybrid approach in enhancing recognition accuracy • Pandas: For organizing and manipulating datasets.
for different sign types and propose future work integrating • Visualization Tools: Matplotlib, for visualizing data
fuzzy logic to handle noise and lighting variations. distributions, training metrics (accuracy, loss), and
results.
Table-1 represent the comparative analysis of the
approaches used for the sign language recognition in terms IV. METHODOLOGY
of methodology and the average accuracy.
Data Collection: The system captures real-time video
Table 1:- Comparative analysis of various learning frames from a webcam and stores them into categorized
models/methods. directories, labeled from 'A' to 'Z'(Total 26 alphabets). The
system dynamically counts the number of images in each
Author Methodology Accuracy directory and assigns a unique filename for each saved
image, ensuring no duplication. The images are stored in
Wadhawan & CNN 99.90% (train), specific folders corresponding to the user-selected
Kumar 98.70% (Val) categories, which can be triggered by pressing a key (from
Mittal. Modified LSTM 72.3% 'a' to 'z'). For every key pressed, the captured frame is stored
with Leap (sentences), in the appropriate folder, and the filename reflects the
89.5% (words) current count of images in that folder.
Kothadiya et al.. LSTM-GRU 97%
hybrid A designated region of interest (ROI) is displayed,
Raheja et al. SVM 97.5% which shows a focused area of the video frame. The frame
Adithya et al. ANN 91.11% is captured from the ROI, providing the user with an updated
preview of what is being saved. As the system continues to
Rokade & ANN and SVM 94.37% (ANN), run, it provides real-time feedback to the user by displaying
Jadav. 92.12% (SVM) the number of images in each directory on the screen. This
Deora & Bajaj PCA with 94% ensures users can track how many frames have been
fingertip detection collected for each category. The system continues to capture
and store frames until manually stopped, making it suitable
From table 1, it is clear that the majority of the researchers for applications such as machine learning dataset creation
have been used deep learning-based approaches to get the and image analysis. Data images collected from samples are
better result for the sign language recognition. shown in the figure 1.
III. MATERIAL AND TOOLS
To provide a comprehensive understanding of the materials
and tools used in the paper “Sign Language Recognition
System Using Deep Learning”, the following sections detail
the various software libraries, frameworks, and tools
employed in the development and implementation of sign
language recognition system.
Model Training: The system utilizes a deep learning model First LSTM Layer (ninputs = 15, nunits = 64): PLSTM = 4 × (64
for gesture recognition, leveraging LSTM (Long Short- × (15 + 64) + 64) = 32,768
Term Memory) networks to process sequential data from Second LSTM Layer (ninputs = 15, nunits = 128): PLSTM = 4
hand landmarks. The labeled dataset is prepared by × (128 × (15 + 64) + 128) = 98,816.
extracting key points from various hand gestures, which are Third LSTM Layer (ninputs = 64, nunits = 64): PLSTM = 4 ×
then transformed into sequences of data. These sequences (64 × (64 + 64) + 64) = 49,408.
are divided into training and testing sets, with the labels one-
hot encoded to match the categories of gestures. Dense Layer Parameters: The total parameters (PDense) for
a Dense layer are calculated using the formula:
A neural network is designed using a sequential
architecture, consisting of three LSTM layers and two dense
PDense = (ninputs × nunits) + nunits (2)
layers to learn the temporal patterns in the data. The model
is compiled with the Adam optimizer and categorical cross-
entropy loss function, optimizing the network for Where ninputs are number of input features to the layer, and
categorical accuracy. Training is performed for 200 epochs, nunits are number of units in the layer. Total number of
with a Tensor Board callback to monitor the model's Parameters for Each Dense Layer are given below:
performance during the training process. After training, the
model architecture is saved in JSON format, while the First Dense Layer (ninputs = 64, nunits = 64): PDense= (64 × 64)
trained model's weights are stored in an H5 file for later use + 64 = 4,160.
in predictions. This setup ensures that the system can
Second Dense Layer (ninputs = 64, nunits = 26): PDense= (64 × figure 5. This structured approach provides a well-organized
26) + 26 = 858. dataset suitable for training and testing machine learning
Third Dense Layer (ninputs = 64, nunits = 26): PDense= (64 × models for sign language recognition. Created dataset are
26) + 26 = 858. uploaded on the Kaggle and any on access the dataset
through the given link for the research purpose. Dataset
Real-Time Hand Gesture Recognition and Prediction: link: - https://www.kaggle.com/datasets/rahulsh123/sign-
After training, the model is integrated into the real-time language-recognition-dataset [1].
gesture recognition system. Sample output of the real-time
prediction are shown in the figure 4. The key steps in real-
time recognition include:
97.93
99.93
94.54
87.52
92.45
86.42
82.49
93.44
ACCURACY
images used in paper "Sign Language Recognition using (%)
Deep Learning". So, there are total 1300 images of alphabets
used in the work. A webcam was used to capture images of ALPHABETS J K L M N O P Q R
hand gestures representing each letter of the English
AVG
alphabet.
92.47
94.33
92.41
86.45
80.55
82.47
86.42
90.61
88.24
ACCURACY
A designated area on the screen, known as the Region of (%)
Interest (ROI), helped ensure consistency in gesture
positioning. By pressing specific keys, users could save ALPHABETS S T U V W X Y Z
images of gestures to corresponding folders for each letter. AVG
93.20
87.24
84.22
92.41
87.06
80.86
80.92
86.47