Real time Hand Gesture Recognition
Alok kumar
MCA Student, School of Computing, GEHU, Dehradun
Abstract: Hand gestures are a form of nonverbal communication that can be used in several
fields such as communication between deaf-mute people, robot control, human–computer
interaction (HCI), home automation and medical applications.
This paper explores the techniques and methodologies involved in recognizing hand gestures,
focusing on key stages such as hand detection, feature extraction, and gesture classification.
Research papers based on hand gestures have adopted many different techniques, including
those based on instrumented sensor technology and computer vision. In other words, the hand
sign can be classified under many headings, such as posture and gesture, as well as dynamic
and static, or a hybrid of the two.
This paper focuses on a review of the literature on hand gesture techniques and introduces
their merits and limitations under different circumstances. In addition, it tabulates the
performance of these methods, focusing on computer vision techniques that deal with the
similarity and difference points, technique of hand segmentation used, classification
algorithms and drawbacks, number and types of gestures, dataset used, detection range
(distance) and type of camera used. This paper is a thorough general overview of hand
gesture methods with a brief discussion of some possible applications.
Keywords: hand gesture; hand posture; computer vision; human–computer interaction (HCI)
1.Introduction
Hand gestures are an aspect of body language that can be conveyed through the center of the
palm, the finger position and the shape constructed by the hand. Hand gestures can be
classified into static and dynamic. the static gesture refers to the stable shape of the hand,
whereas the dynamic gesture comprises a series of hand movements such as waving. There
are a variety of hand movements within a gesture. for example, a handshake varies from one
person to another and changes according to time and place. The main difference between
posture and gesture is that posture focuses more on the shape of the hand whereas gesture
focuses on the hand movement. The main approaches to hand gesture research can be
classified into the wearable glove-based sensor approach and the camera vision-based sensor
approach .
In the last years, the Internet of Things (IoT) paradigm has been widely adopted to develop
wearable IoT systems thanks to the ever-increasing interest in many fields of applications,
such as health care monitoring, fitness tracking, gaming, and virtual and augmented reality, to
name a few. According to Grand View Research smart wearable technology is driving
industry growth.
1.1 Background
Human activity recognition gained a lot of attention in recent years because of the broad
range of real-world applications. Early diagnosis, rehabilitation, and patient assistance can be
provided in medical decision processes for healthcare monitoring purposes. Industrial
applications, gaming, and sport/fitness tracking are of great interest as well. Two main
approaches are leveraged for HAR. camera-based and sensor-based recognition. Camera and
inertial sensors allow to detect a set of daily human activities via computer vision techniques
and acceleration/location sensors.
1.2 Deep learning
The very first approach to HAR comes from the learning theory, which leverages
classification algorithms (such as support vector machines) to identify patterns. Since HAR is
considered a multi-class classification problem, such an approach showed promising results.
However, a limitation of this type of method is that the phase of feature selection, which is
carried out prior to the classification algorithm employment, must be performed manually,
typically by a domain expert.
1.3 Motivation
Deploying HAR systems on low-power devices has many practical applications:
Healthcare: Real-time activity monitoring for patients.
Fitness: Tracking and analyzing physical activities.
Industrial safety: Detecting falls or unusual activity in hazardous environments.
The need for energy-efficient and accurate DL models drives this research
1.4 TensorFlow And Keras
TensorFlow is an end-to-end machine learning platform powered by Google . TF provides
tools and libraries to process and load data, build custom models or leverage existing ones,
and run and deploy models on several environments, including production systems. TF
provides also TensorFlow Lite (TFL) a library for network models deployment on mobile devices,
microcontrollers, and edge devices. In particular, TFL allows converting a base TF model into a
compressed version via the so-called TFLite converter. Keras is a deep learning framework, built on
top of TF version 2.12.1, which provides a Python API to allow developers to simplify network model
creation and experimentation. In this work, we used Keras, TF, and TFL to carry out our investigation.
1.5 Scope
Define the scope, such as the use of computer vision and/or deep learning techniques .
Mention whether the system targets specific applications (e.g., sign language interpretation,
AR/VR control, gesture-based robotics). Highlight constraints, like data collection methods
or hardware requirements.
To help them access everyday necessities, this concept can be expanded to include blind
persons or those with other disabilities. It can be applied to home automation, sign language
interpretation and consumer electronics to unlock smartphones.
2. Literature Review
American Sign Language (ASL) is not merely a language of hand gestures. it is a rich,
complex language that employs signs made with the hands, movements of the face, and body
postures. It is the primary language of who are deaf or hard of hearing and is used by a
significant population worldwide. The importance of ASL lies in its role as a critical
communication tool, offering deaf individuals a means of education, expression, and
community building. ASL is also recognized for its contribution to linguistic studies, social
inclusion, cultural diversity, and the promotion of equal opportunities for the deaf
community.
Every existing Virtual Assistant in today’s date is found to be Voice Automated thereby
making it unusable by Deafmutes and people with certain disabilities. This leads to the need
of a system which can help people with speaking or listening disabilities to make use of such
virtual Personal Assistants . Artificial Neural Network is used in majority cases where static
recognition is performed as shown in but there are few drawbacks related to the efficiency of
recognizing distinctive features from images which can be improved by using Convolutional
Neural Network. Convolutional Neural Network when compared to its predecessors,
recognizes important distinctive features more efficiently and without any human
supervision.
Many systems are designed in such a way that there application is limited to only certain Sign
language or series of Hand gestures whereas the proposed system is designed in such a way
that it gives us the flexibility of changing to any standard sign language just by changing the
dataset and training the model for the same.
2.1 Categories of Hand Gesture Recognition
(a) Static Gesture Recognition
Techniques for recognizing gestures represented by a single frame/image (e.g., hand
signs in sign language).
Examples of traditional methods like contour-based or feature-based techniques.
Modern approaches using deep learning (e.g., CNNs).
(b) Dynamic Gesture Recognition
Techniques for recognizing sequences of gestures over time (e.g., waving, pointing).
Use of models like RNNs, LSTMs, GRUs, or transformers for temporal sequence
model.
(c) Combined Recognition
Hybrid approaches that integrate static and dynamic gesture recognition for
comprehensive systems.
2.2 Recognition Algorithms
a) Traditional Machine Learning
Earlier techniques like Support Vector Machines (SVMs), k-Nearest Neighbors
(kNN), Decision Trees, and Hidden Markov Models (HMMs).
Discuss their limitations in scalability and accuracy.
(b) Deep Learning
Convolutional Neural Networks (CNNs) for spatial data processing.
Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM), and
transformers for dynamic gestures.
Combined CNN-RNN architectures for both static and dynamic gestures.
2.3 Real-Time Gesture Recognition Challenges
Discussion of challenges in achieving real-time performance:
o Latency and inference time.
o Limited computational resources for embedded or edge devices.
o Dynamic environmental conditions (e.g., lighting, occlusion).
3. Data acquisition
A real-time sign language detection system is being developed for Indian Sign Language. For
data acquisition, images are captured by webcam using Python and OpenCV. OpenCV
provides functions which are primarily aimed at the real-time computer vision. It accelerates
the use of machine perception in commercial products and provides a common infrastructure
for the computer vision-based applications. The OpenCV library has more than 2500 efficient
computer vision and machine learning algorithms which can be used for face detection and
recognition, object identification, classification of human actions, tracking camera and object
movements, extracting 3D object models, and many more .
For every alphabet, 250 images are captured to make the dataset. The images are captured in
every 1 seconds providing time to record gesture with a bit of difference every time and a
break of five seconds are given between two individual signs, i.e., to change the sign of one
alphabet to the sign of a different alphabet, five seconds interval is provided. The captured
images are stored in their respective folder.
For data acquisition, dependencies like cv2, i.e., OpenCV, OS, Time have been imported.
The dependency OS is used to help work with file paths. It comes under standard utility
modules of Python and provides functions for interacting with the operating systems. With
the help of the time module in Python, time can be represented in multiple ways in code like
objects, numbers, and strings. Apart from representing time, it can be used to measure code
efficiency or wait during code execution.
Once all the images have been captured, they are then one by one labelled using the Labelling
package. Labelling is a free open-source tool for graphically labelling images. The hand
gesture portion of the image is labelled by what the gesture in the box or the sign represents
as shown in Fig. 1 and Fig. 2. On saving the labelled image, its JPEG file is created. The
JPEG files have all the details of the images including the detail of the labelled portion. After
labelling all the images, their XML files are available. This is used for creating the TF
(TensorFlow) records. All the images along with their XML files are then divided into
training data and validation data in the ratio of 80:20. From 250 images of every alphabet,
200(80%) of them were taken and stored as a training dataset and the remaining 50(20%)
were taken and stored as validation dataset. This task was performed for all the images of all
26 alphabets another sing gesture . This research aims to create a model to recognize
and combine each gesture into fingerspelled-based hand gestures. The gestures that
this project wishes to teach are depicted in image.
Fig 1:Sing language database image
Fig 2 : image of collection data
4. Methodology
The most basic explanation of workflow of the system goes as follows - A hand gesture is
performed in front of the webcam just as observed. This sign gesture is converted to text and
the text output is converted to audio and is served as an input to the assistant. The assistant
processes the question and responds in audio format. This audio format is converted to text
output. The text output will be then displayed on the display screen.
4.1 Data Collection
I collected the data through hardware webcam in the label categories of sing language . each
category have atleast 300 image 300x300 pixel .some example given in the table.
Table 1
Categories base on gesture
Gesture ID Label
1 A
2 B
3 Good luck
4 I love you
Each input image is subjected to image alteration by the algorithm to produce these images.
For this, the hand’s fundamental threshold histogram is computed.
A . Open CV
OpenCV is an open source library for Computer Vision. Now since all the training and
classification is ready to be executed when it needed an eye for the designed system to
capture real-time images of Hand Gestures which can then be sent for classification and
identification. OpenCV adds intelligence to Deep Learning models for visualization image
processing.
B . Deep learning
Deep Learning is basically a subset of Machine Learning model which consists of algorithms
that make use of multilayer neural networks. Deep Learning makes use of Neural Network
most of the times to implement its functioning. A Neural Network is a collection of layers
that transforms the input in some way to produce output.
C. Tensor Flow
The best part of using TensorFlow library is that it is an open Source Library with lots of
predesigned models , useful in Machine Learning and especially Deep Learning. For
understanding the conceptual use of Tensor Flow is required to understand the meaning of
two terms, where the Tensor here is considered as N-Dimensional Array and Flow refers to
graph of operations. Every mathematical computation in TensorFlow is considered as graph
of operations where Nodes in the Graph are operations and Edges are nothing but tensors.
4.2 Work Flow the project
Fig 3. Work Flow of project
The first step in implementing the suggested system involves collecting data,
researchers typically make use of sensors or cameras to capture images of hand
Capture and Process Hand Movement Images.
A . Input Data from webcam
Captured through a webcam are images of hand movements made while performing
Hand gesture recognition and American Sign Language (ASL) gestures which serves as
input data for the proposed system and included in the dataset is a total of 2000 images
for ASL gestures and other hand movement which have been separated into both training
and test datasets.
B . Image processing
Noise Removal: Filters (e.g., Gaussian or median filters) are applied to reduce noise
in the image.
Normalization: Standardizes the pixel values to ensure consistency in brightness
and contrast.
Background Subtraction: Separates the foreground (hand) from the background.
Segmentation: Identifies and isolates the hand region using techniques like skin-color
detection, contour detection, or deep learning-based segmentation.
C . Classification
Traditional Classifiers: Use models like Support Vector Machines (SVM), k-
Nearest Neighbors (k-NN), or Decision Trees.
Deep Learning Models: Employ advanced architectures like CNNs, RNNs, or
Transformer-based models to classify gestures.
Hybrid Models: Combine CNNs (feature extraction) and RNNs (sequence
recognition) for dynamic gestures.
5 . Experimental Evaluation
5.1 Dataset and Experimental Setup
The dataset is created for Indian Sign Language where signs are alphabets of the English
language. The dataset is created following the data acquisition method described in Section 3.
The experimentation was carried out on a system with an Intel i3 7th generation 2.70 GHz
processor, 8 GB memory and webcam (HP TrueVision HD camera with 0.31 MP and
640x480 resolution), running Windows 10 operating system. The programming environment
includes Python (version 3.11.7), Jupyter Notebook, OpenCV (version 4.2.0),
TensorFlow(12.1.2) Object Detection API.
5.2 Results and Discussion
The developed system is able to detect Sign Language alphabets and hand gesture recognition
in real-time. The system has been created using TensorFlow object detection API. The pre-
trained model that has been taken from the TensorFlow model zoo is SSD MobileNet v2
320x320. It has been trained using transfer learning on the created dataset which contains
2000 images in total, 300 images for each alphabet and gestured .
6. Conclusion and Future Works
Sign languages are kinds of visual languages that employ movements of hands, body, and
facial expression as a means of communication. Sign languages are important for specially-
abled people to have a means of communication. Through it, they can communicate and
express and share their feelings with others. The drawback is that not everyone possesses the
knowledge of sign languages which limits communication. This limitation can be overcome
by the use of automated Sign Language Recognition systems which will be able to easily
translate the sign language gestures into commonly spoken language. In this paper, it has
been done by TensorFlow object detection API. The system has been trained on the Indian
Sign Language alphabet dataset. The system detects sign language in real-time. For data
acquisition, images have been captured by a webcam using Python and OpenCV which
makes the cost cheaper. The developed system is showing an average confidence rate of
85.45%.
Though the system has achieved a high average confidence rate, the dataset it has been
trained on is small in size and limited. In the future, the dataset can be enlarged so that the
system can recognize more gestures. The TensorFlow model that has been used can be
interchanged with another model as well. The system can be implemented for different sign
languages by changing the dataset.
Sign Language Recognition HGR will make communication easier for people with hearing
and speech impairments, enabling real-time translation of sign languages. Rehabilitation
Programs Gestures can track patient movements for physical therapy, ensuring exercises are
performed correctly. Assistive Devices Gesture recognition can enhance devices for disabled
individuals, such as wheelchair controls, prosthetics, and robotic arms.
Robot Control: Robots in manufacturing, healthcare, or personal assistance can be
operated using hand gestures for intuitive human-robot interaction.
Drones: Gesture recognition can simplify drone navigation, allowing users to control
drones with simple hand movements.
Autonomous Systems: Cars or machines can incorporate gesture recognition to
improve user safety and convenience, such as managing vehicle controls without
physical buttons.
REFERENCES:
1. Yusnita, L., Rosalina, R., Roestam, R. and Wahyu, R., 2017. Implementation of Real-
Time Static Hand Gesture Recognition Using Artificial Neural Network. CommIT
(Communication and Information Technology) Journal, 11(2), p.85
2. Kapur, R.: The Types of Communication. MIJ. 6, (2020).
3. Suharjito, Anderson, R., Wiryana, F., Ariesta, M.C., Kusuma, G.P.: Sign Language
Recognition Application Systems for Deaf-Mute People: A Review Based on
InputProcess-Output. Procedia Comput. Sci. 116, 441–448 (2017).
https://doi.org/10.1016/J.PROCS.2017.10.028.
4. Dutta, K.K., Bellary, S.A.S.: Machine Learning Techniques for Indian Sign Language
Recognition. Int. Conf. Curr. Trends Comput. Electr. Electron. Commun. CTCEEC
2017. 333–336 (2018). https://doi.org/10.1109/CTCEEC.2017.8454988.
5. Bragg, D., Koller, O., Bellard, M., Berke, L., Boudreault, P., Braffort, A., Caselli, N.,
Huenerfauth, M., Kacorri, H., Verhoef, T., Vogler, C., Morris, M.R.: Sign Language
Recognition, Generation, and Translation: An Interdisciplinary Perspective. 21st Int.
ACM SIGACCESS Conf. Comput. Access. (2019). https://doi.org/10.1145/3308561.
6. Kumar, P., Gauba, H., Roy, P. P., & Dogra, D. P. (2018). A multi-sensor approach to
automate translation of Indian sign language. IEEE Sensors Journal, 18(12), 5169-
5176. https://doi.org/10.1109/JSEN.2018.2832648.
7. Rosero-Montalvo, P.D., Godoy-Trujillo, P., Flores-Bosmediano, E., Carrascal-Garcia,
J.
8. S. Hussain, R. Saxena, X. Han, J. A. Khan and H. Shin, ”Hand gesture recognition
using deep learning,” 2017 International SoC Design Conference (ISOC).