(Sign Language Detection) : ACE Engineering College
(Sign Language Detection) : ACE Engineering College
BACHELOR OF TECHNOLOGY
In
M. Rithvik (22AG1A6735)
N. Harikrishna(22AG1A6746)
1
DEPARTMENT OF CSE (DATA SCIENCE)
CERTIFICATE
This is to certify that the Real time project report entitled “project
title” is a bonafide work done by <Revan,Ruthvik,
Harikrishna>bearing<22AG1A6739,22AG1A6735,22AG1A6746> in
partial fulfillment for the award of Degree of BACHELOR OF
TECHNOLOGY in CSE (Data Science) from JNTUH University,
Hyderabad during the academic year 2023- 2024. This record of bonafide
work carried out by them under our guidance and supervision.
The results embodied in this report have not been submitted by the
student to any other University or Institution for the award of any degree
or diploma.
2
ACKNOWLEDGEMENT
We would like to express my gratitude to all the people behind the screen who
have helped me transform an idea into a real time application.
We would like to express my heart-felt gratitude to my parents without whom
we would not have been privileged to achieve and fulfill my dreams.
A special thanks to our General Secretary, Prof. Y. V. Gopala Krishna
Murthy, for having founded such an esteemed institution. Sincere thanks to our Joint
Secretary Mrs. M. Padmavathi, for support in doing project work. I am also grateful
to our beloved principal, Dr. B. L. RAJU for permitting us to carry out this project.
We profoundly thank Dr. P. Chiranjeevi, Associate Professor and Head of
the Department of Computer Science and Engineering (Data Science), who has been
an excellent guide and also a great source of inspiration to my work.
We extremely thank Mrs. B. Saritha and Mr. M. Hari Krishna, Assistant
Professors, Project coordinators, who helped us in all the way in fulfilling of all
aspects in completion of our Mini-Project.
We are very thankful to my internal guide Mrs. A. Sarala Devi who has been
an excellent and also given continuous support for the Completion of my project
work.
The satisfaction and euphoria that accompany the successful completion of the
task would be great, but incomplete without the mention of the people who made it
possible, whose constant guidance and encouragement crown all the efforts with
success. In this context, I would like to thank all the other staff members, both
teaching and non-teaching, who have extended their timely help and eased my task.
Mk.Revan (22AG1A6739)
M.Rithvik (22AG1A6735)
N.Harikrishna(22AG1A6746)
3
SIGN LANGUAGE DETECTION
4
ABSTRACT
Sign Language is mainly used by deaf (hard hearing) and dumb people to exchange
information between their own community and with other people. It is a language
where people use their hand gestures to communicate as they can't speak or hear. Sign
Language Recognition (SLR) deals with recognizing the hand gestures acquisition
and continues till text or speech is generated for corresponding hand gestures. Here
hand gestures for sign language can be classified as static and dynamic. However,
static hand gesture recognition is simpler than dynamic hand gesture recognition, but
both recognition is important to the human community. We can use Deep Learning
Computer Vision to recognize the hand gestures by building Deep Neural Network
architectures (Convolution Neural Network Architectures) where the model will learn
to recognize the hand gestures images over an epoch. Once the model Successfully
recognizes the gesture the corresponding English text is generated and then text can
be converted to speech. This model will be more efficient and hence communicate for
the deaf (hard hearing) and dump people will be easier. In this paper, we will discuss
how Sign Language Recognition is done using Deep Learning.
5
CONTENTS
1 INTRODUCTION 7-8
2 EXISTING SYSTEM 9
4 LITERATURE REVIEW 11
HARDWARE REQUIREMENTS 13
7 SYSTEM ANALYSIS 15
8 METHODOLOGY 16-17
9 OVERALL STRUCTURE 20
10 CONCLUSION 21
6
1.INTRODUCTION
Deaf (hard hearing) and dumb people use Sign Language (SL) [1] as their
primary means to express their ideas and thoughts with their own community and
with other people with hand and body gestures. It has its own vocabulary,
meaning, and syntax which is different from the spoken language or written
language. Spoken language is a language produced by articulate sounds mapped
against specific words and grammatical combinations to convey meaningful
messages. Sign language uses visual hand and body gestures to convey
meaningful messages. There are somewhere between 138 and 300 different types
of Sign Language used around globally today. In India, there are only about 250
certified sign language interpreters for a deaf population of around 7 million.
This would be a problem to teach sign language to the deaf and dumb people as
there is a limited number of sign language interpreters exits today. Sign
Language Recognition is an attempt to recognize these hand gestures and convert
them to the corresponding text or speech. Today Computer Vision and Deep
Learning have gained a lot of popularity and many State of the Art (SOTA)
models can be built. Using Deep Learning algorithms and Image Processing
we can able to classify these hand gestures and able to produce corresponding
text. An example of “A” alphabet in sign language notion to English “A” text or
speech.
7
2. EXISTING SYTSEM
1. Computer Vision: Many sign language detection systems use computer vision
techniques to analyze video input and identify hand gestures and movements.
They often rely on techniques like image segmentation, feature extraction, and
object detection to locate and track the signer's hands and other relevant features.
8
between signers and non-signers. Low-latency processing is essential to facilitate
smooth and natural interactions between users.
Existing systems for sign language detection have made significant progress, but
they still face several drawbacks:
9
4. Real-time performance: Some systems may face challenges in achieving real-
time performance, especially when dealing with complex hand movements or
processing high-resolution video streams.
4.LITERATURE REVEIW
This works focuses on static fingerspelling in American Sign Language A method for
implementing a sign language to text/voice conversion system without using handheld
gloves and sensors, by capturing the gesture continuously and converting them to
voice. In this method, only a few images were captured for recognition. The design of
a communication aid for the physically challenged.
The system was developed under the MATLAB environment. It consists of mainly
two phases via training phase and the testing phase. In the training phase, the author
used feed-forward neural networks. The problem here is MATLAB is not that
efficient and also integrating the concurrent attributes as a whole is difficult.
American Sign Language Interpreter System for Deaf and Dumb Individuals.
The discussed procedures could recognize 20 out of 24 static ASL alphabets. The
alphabets A, M, N, and S couldn’t be recognized due to the occlusion problem. They
have used only a limited number of images.
10
In Machine Learning we have an ensemble technique where we train multiple sub-
models and average them. Random Forest algorithm is an example where it uses
multiple Decision tree algorithms. Similarly, we can perform ensemble for Neural
Networks as well. There are a lot of ensemble techniques for Neural Networks like
Stacked generalization, Ensemble learning via negative correlation and, Probabilistic
Modelling with Neural Networks . We have implemented the Horizontal Voting
Ensemble method to improve the performance of neural networks.
1. Expanded vocabulary: The proposed system could aim to recognize a wider range
of signs and gestures, including both common and less frequently used signs. This
would increase its usefulness and applicability in real-world scenarios.
11
6.SOFTWARE REQUIREMENTS AND HARDWARE
REQUIREMENTS
4.Developing Environment: You can use any code editor or integrated development
environment (IDE).
6.CPU: A modern CPU with multiple cores (e.g., Intel Core i5 or AMD Ryzen 5)
would provide adequate processing power.
8.RAM: Adequate RAM is essential, especially when working with large datasets or
complex models(at least 8 GB of RAM).
9.Storage: Sufficient storage space is required for storing datasets, code, and model
checkpoints. SSDs are preferred over HDDs for faster data access and reduced
loading times.
10.Webcam (optional): Most modern laptops come with built-in webcams, but you
can also use external webcams for desktop computers.
12
7 .SYSTEM ANALYSIS
•Data Processing: The load data.py script contains functions to load the Raw
Image Data and save the image data as numpy arrays into file storage. The
process data.py script will load the image data from data.npy and preprocess the
image by resizing/rescaling the image, and applying filters and ZCA whitening to
enhance features. During training the processed image data was split into
training, validation, and testing data and written to storage. Training also involves
a load dataset.py script that loads the relevant data split into a Dataset class. For
use of the trained model in classifying gestures, an individual image is loaded and
processed from the filesystem.
• Training: The training loop for the model is contained in train model.py. The
model is trained with hyperparameters obtained from a config file that lists the
learning rate, batch size, image filtering, and number of epochs. The
configuration used to train the model is saved along with the model architecture
for future evaluation and tweaking for improved results. Within the training loop,
13
the training and validation datasets are loaded as Dataloaders and the model is
trained using Adam Optimizer with Cross Entropy Loss. The model is evaluated
every epoch on the validation set and the model with best validation accuracy is
saved to storage for further evaluation and use. Upon finishing training, the
training and validation error and loss is saved to the disk, along with a plot of
error and loss over training.
• Classify Gesture: After a model has been trained, it can be used to classify a
new ASL gesture that is available as a file on the filesystem. The user inputs the
filepath of the gesture image and the test data.py script will pass the filepath to
process data.py to load and preprocess the file the same way as the model has
been trained.
14
8 .METHODOLOGY
1. Data Collection:
- Gather a diverse dataset of sign language videos, covering various sign languages,
gestures, and signers.
- Ensure that the dataset includes annotations specifying the signs performed in each
video frame.
2. Preprocessing:
- Normalize the data to account for differences in scale, rotation, and perspective.
- Augment the dataset to increase its size and variability, for example, by applying
transformations like rotation, scaling, and flipping.
3. Model Selection:
4. Training:
- Split the dataset into training, validation, and test sets to evaluate model
performance.
- Train the selected model using the training data, optimizing the model parameters to
minimize a chosen loss function (e.g., cross-entropy loss).
15
- Monitor the model's performance on the validation set and adjust hyperparameters
as needed to prevent overfitting.
5. Evaluation:
- Evaluate the trained model on the test set to assess its performance in real-world
scenarios.
- Measure metrics such as accuracy, precision, recall, and F1-score to quantify the
model's effectiveness in detecting sign language gestures.
- Fine-tune the model based on the evaluation results, addressing any weaknesses or
areas for improvement identified during testing.
- Optimize the model for deployment, considering factors such as inference speed,
memory usage, and energy efficiency, especially for real-time applications.
The primary source of data for this project was the compiled dataset of American
Sign Language (ASL) called the ASL Alphabet from Kaggle user Akash [3]. The
dataset is comprised of 87,000 images which are 200x200 pixels. There are 29
total classes, each with 3000 images, 26 for the letters A-Z and 3 for space, delete
and nothing. This data is solely of the user Akash gesturing in ASL, with the
images taken from his laptop’s webcam. These photos were then cropped,
rescaled, and labelled for use.
Figure 2: Examples of images from the Kaggle dataset used for training. Note
difficulty of distinguishing fingers in the letter E. A self-generated test set was
16
created in order to to investigate the neural network’s ability to generalize. Five
different test sets of images were taken with a webcam under different lighting
conditions, backgrounds, and use of dominant/non-dominant hand. These images
were then cropped and preprocessed.
The data preprocessing was done using the PILLOW library, an image processing
library, and sklearn.decomposition library, which is useful for its matrix
optimization and decomposition functionality. 2 MIE324 Fall 2018 Project
Proposal (a) Sample from nonuniform background test set (b) Sample from plain
white background test set (c) Sample from a darker test set (d) Sample from plain
white background test set Figure 3: Examples of the signed letter A T from two
test sets with differing lighting and background Image Enhancement: A
combination of brightness, contrast, sharpness, and color enhancement was used
on the images. For example, the contrast and brightness were changed such that
fingers could be distinguished when the image was very dark. Edge
Enhancement: Edge enhancement is an image filtering techniques that makes
edges more defined. This is achieved by the increase of contrast in a local region
of the image that is detected as an edge. This has the effect of making the border
of the hand and fingers, versus the background, much more clear and distinct.
This can potentially help the neural network identify the hand and its boundaries.
Image Whitening: ZCA, or image whitening, is a technique that uses the singular
value decomposition of a matrix. This algorithm decorrelates the data, and
removes the redundant, or obvious, information out of the data. This allows for
the neural network to look for more complex and sophisticated relationships, and
to uncover the underlying structure of the patterns it is being trained on. The
covariance matrix of the image is set to identity, and the mean to zero.
17
9.OVERALL STRUCTURE
18
faster and to establish a baseline for problem complexity. This smaller model
was built with only one “block” of convolutional layers consisting of two
convolutional layers with variable kernel sizes progressing from 5 X 5 to 10 X
10, ReLU activation, and the usual Max Pooling and Dropout. This fed into
three fully connected layers which output into the 29 classes of letters. The
variation of the kernel sizes was motivated by our dataset including the
background, whereas the paper preprocessed their data to remove the
background. The design followed the thinking that the first layer with smaller
kernel would capture smaller features such as hand outline, finger edges and
shadows. The larger kernel hopefully captures combinations of the smaller
features like finger crossing, angles, hand location, etc.
10.CONCLUSION
19