AI Report
AI Report
Submitted by
BACHELOR OF TECHNOLOGY
in
1
SRM INSTITUTE OF SCIENCE AND TECHNOLOGY
(Under Section 3 of UGC Act, 1956)
BONAFIDE CERTIFICATE
Certified that Mini project report titled “SIGN LANGUAGE INTERPRETER” is the
bonafide work of Kanapareddy Nithin Kumar (RA2111003010532), Karella Surya
Vardhan (RA2111003010549), G. Murali Naren Phani Kumar (RA2111003010544) who
carried out the minor project under my supervision. Certified further, that to the best of my
knowledge, the work reported herein does not form any other project report or dissertation on
the basis of which a degree or award was conferred on an earlier occasion on this or any other
candidate.
SIGNATURE SIGNATURE
Mrs. D. Vathana Dr. M. Pushpalatha
Assistant Professor Head of the Department
Computing Technologies Computing Technologies
2
3
ABSTRACT
The Sign Language Interpreter project aims to bridge the communication gap between the deaf or
hard-of-hearing community and the general public by developing a real-time sign language
translation system. Sign language is a primary mode of communication for many deaf individuals,
yet it presents a significant barrier for those who don't understand it. This project seeks to create a
solution that translates sign language gestures into spoken language or text in real-time,
facilitating better communication between deaf individuals and the wider community.
The system comprises several components, including computer vision for gesture recognition,
machine learning models for classification, and natural language processing for translation. It
utilizes a camera to capture hand gestures, processes the video feed to recognize and interpret the
gestures, and then generates spoken or written translations. The project focuses on building a
robust and accurate system capable of recognizing a wide range of sign language gestures in
various environmental conditions.
iii
4
TABLE OF CONTENTS
ABSTRACT iii
TABLE OF CONTENTS iv
LIST OF FIGURES v
ABBREVIATIONS vi
1 INTRODUCTION 7
2 LITERATURE SURVEY 8
3 SYSTEM ARCHITECTURE AND DESIGN 9
3.1 Architecture diagram of proposed IoT based smart agriculture project 9
3.2 Description of Module and components 10
4 METHODOLOGY 14
4.1 Methodological Steps 14
5 CODING AND TESTING 15
6 SREENSHOTS AND RESULTS
7 CONCLUSION AND FUTURE ENHANCEMENT 23
REFERENCES 24
5
LIST OF FIGURES
3.2.5 Motor 11
3.2.7 Buzzer 12
6
ABBREVIATIONS
vi
7
1 CHAPTER 1
INTRODUCTION
The goal of this project was to build a neural network able to classify which letter of the American
Sign Language (ASL) alphabet is being signed, given an image of a signing hand. This project is a
first step towards building a possible sign language translator, which can take communications in
sign language and translate them into written and oral language. Such a translator would greatly
lower the barrier for many deaf and mute individuals to be able to better communicate with others
in day-to-day interactions.
This goal is further motivated by the isolation that is felt within the deaf community. Loneliness
and depression exist in higher rates among the deaf population, especially when they are
immersed in a hearing world [1]. Large barriers that profoundly affect life quality stem from the
communication disconnect between the deaf and the hearing. Some examples are information
deprivation, limitation of social connections, and difficulty integrating in society.
Most research implementations for this task have used depth maps generated by depth camera and
high-resolution images. The objective of this project was to see if neural networks are able to
classify signed ASL letters using simple images of hands taken with a personal device such as a
laptop webcam. This is in alignment with the motivation as this would make a future
implementation of a real time ASL-to-oral/written language translator practical in an everyday
situation.
1
1 CHAPTER 2
LITERATURE SURVEY
These studies demonstrate the effectiveness of CNN models in real-time sign language
recognition, providing a foundation for the development of accurate and efficient real-time sign
language interpreters using computer vision techniques.
2
1
CHAPTER 3
SYSTEM ARCHITECTURE AND DESIGN
the project will be structured into 3 distinct functional blocks, Data Processing, Training, Classify
Gesture. The block diagram is simplified in detail to abstract some of the minutiae:
Data Processing: The load data.py script contains functions to load the Raw Image Data and save
the image data as NumPy arrays into file storage. The process data.py script will load the image
data from data.npy and preprocess the image by resizing/rescaling the image, and applying filters
and ZCA whitening to enhance features. During training the processed image data was split into
training, validation, and testing data and written to storage. Training also involves a load
dataset.py script that loads the relevant data split into a Dataset class. For use of the trained model
in classifying gestures, an individual image is loaded and processed from the filesystem.
Training: The training loop for the model is contained in train model.py. The model is trained
with hyperparameters obtained from a config file that lists the learning rate, batch size, image
filtering, and number of epochs. The configuration used to train the model is saved along with the
model architecture for future evaluation and tweaking for improved results. Within the training
loop, the training and validation datasets are loaded as Data loaders and the model is trained using
Adam Optimizer with Cross Entropy Loss. The model is evaluated every epoch on the validation
set and the model with best validation accuracy is saved to storage for further evaluation and use.
Upon finishing training, the training and validation error and loss is saved to the disk, along with a
plot of error and loss over training.
Classify Gesture: After a model has been trained, it can be used to classify a new ASL gesture
that is available as a file on the filesystem. The user inputs the file path of the gesture image and
the test data.py script will pass the file path to process data.py to load and preprocess the file the
same way as the model has been trained.
3
1
CHAPTER 4
METHODOLOGY
Sources of Data:
Data Collection: The primary source of data for this project was the compiled dataset of
American Sign Language (ASL) called the ASL Alphabet from Kaggle user Akash [3]. The
dataset is comprised of 87,000 images which are 200x200 pixels. There are 29 total classes, each
with 3000 images, 26 for the letters A-Z and 3 for space, delete and nothing. This data is solely of
the user Akash gesturing in ASL, with the images taken from his laptop’s webcam. These photos
were then cropped, rescaled, and labelled for use.
A self-generated test set was created in order to investigate the neural network’s ability to
generalize. Five different test sets of images were taken with a webcam under different lighting
conditions, backgrounds, and use of dominant/non-dominant hand. These images were then
cropped and preprocessed.
Data Pre-processing: The data preprocessing was done using the PILLOW library, an image
processing library, and sklearn.decomposition library, which is useful for its matrix optimization
and decomposition functionality.
1. Image Enhancement: A combination of brightness, contrast, sharpness, and color enhancement
was used on the images. For example, the contrast and brightness were changed such that fingers
could be distinguished when the image was very dark.
2. Edge Enhancement: Edge enhancement is an image filtering technique that makes edges more
defined. This is achieved by the increase of contrast in a local region of the image that is detected
as an edge. This has the effect of making the border of the hand and fingers, versus the
background, much clearer and more distinct. This can potentially help the neural network identify
the hand and its boundaries.
3. Image Whitening: ZCA, or image whitening, is a technique that uses the singular value
decomposition of a matrix. This algorithm decorrelates the data, and removes the redundant, or
obvious, information out of the data. This allows for the neural network to look for more complex
and sophisticated relationships, and to uncover the underlying structure of the patterns it is being
trained on. The covariance matrix of the image is set to identity, and the mean to zero.
4
1
We also decided to design a separate model to compare with the model in the paper. This model
was designed to be trained faster and to establish a baseline for problem complexity. This smaller
model was built with only one “block” of convolutional layers consisting of two convolutional
layers with variable kernel sizes progressing from 5 X 5 to 10 X 10, ReLU activation, and the
usual Max Pooling and Dropout. This fed into three fully connected layers which output into the
29 classes of letters. The variation of the kernel sizes was motivated by our dataset including the
background, whereas the paper preprocessed their data to remove the background. The design
followed the thinking that the first layer with smaller kernel would capture smaller features such
as hand outline, finger edges and shadows. The larger kernel hopefully captures combinations of
the smaller features like finger crossing, angles, hand location, etc.
5
1
CHAPTER 5
Source Code:
Dataset:
6
1
CHAPTER 5
Model Building:
Labels:
7
1
CHAPTER 5
Testing:
To determine whether our preprocessing of images actually results in a more robust model,
we verified on a test set comprised of images from the original dataset, and our own collected
image data. The performance of the models on the test set is shown in Table 2. We see that
the model trained on preprocessed images performs much better than the model trained on the
original images, likely due to the former’s lack of overfitting.
8
CHAPTER 6
9
CHAPTER 6
10
CHAPTER 6
Model Performance
Training and Validation: Our models were trained using Adam optimizer and Cross Entropy
Loss. Adam optimizer is known for converging quickly in comparison with Stochastic
Gradient Descent (SGD), even while using momentum. However, initially Adam would not
decrease our loss thus we abandoned it to use SGD. Debugging Adam optimizer after our
final
presentation taught us that lowering learning rate significantly can help Adam to converge
during training. Thus, allowing us to train more models towards the end of our project. Of
our two models, the one based on the paper was shown not to be viable, as it took much
longer to train without showing any significant decrease in accuracy or loss. We believe this
is likely due to the more difficult nature of our classification with the inclusion of
background in the images and the lower resolution, causing training to be more difficult.
Thus, we decided to focus on improving our smaller model which initially trained to 40%
validation accuracy. Although we had a very large dataset to work with; 3,000 samples for
each of 29 classes, after processing the images into NumPy arrays, we found our personal
computers could load a maximum of 50-100 samples/class and our Google Cloud server
could load 200 samples/class. The need to load small datasets actually led us to test the effect
of increasing the data available to our models. On our preliminary model, which used strides
of 2 on both layers (instead of the strides of 1 and then 2, on our final model) we found the
following relation between samples/class and model accuracy:
11
CHAPTER 6
Training our initial models on less data led to the models quickly overfitting as
shown in Figure 7. This is likely due to the small amount of samples to train on
leading to bad generalization and learning of the sample space. Increasing the
size of our dataset to 200 samples/class led to better model results, with peak
validation accuracy of 60.3% in epoch 17. However, taking a look at our loss
function, we see that the validation loss is increasing, indicating overfitting of
the model. After we applied filtering, enhancement, and ZCA whitening to our
dataset, the model performance increased drastically as shown in Figure 9. The
peak validation accuracy achieved is 77.25% in epoch 24. As shown by the plot
of loss, the validation loss is still decreasing, albeit at a slower rate than the
training loss, indicating that the model is not drastically overfitting. This shows
that preprocessing our images by applying filters and ZCA whitening helps to
enhance relevant features for the model to learn.
12
CHAPTER 7
CONCLUSION AND FUTURE ENHANCEMENTS
Key Learnings
Train Early and Often: Due to the long training time of our model, it became
cumbersome to test various hyperparameters, architectures, image filtering, etc. In
future projects, training earlier can identify such issues and more time can be
dedicated to training.
Experiment With Optimizers: Due to our initial difficulties with Adam, we
abandoned it for the slower SGD. However, further testing with Adam and varying
our hyperparameters allowed Adam to converge. The lesson to take away is to try
various optimizers early on to establish which ones converge the quickest for faster
training.
Amount of Data: As shown in Section 5.1, increasing the training set size leads to
drastic improvement in model performance. This likely increases the robustness of
the model through learning more of the possible sample space.
Advantages
13
Future Steps CHAPTER 7
Use Dynamic Loading for Dataset: Our original dataset was quite large and is
impossible to use without a server with a lot of RAM and disk space. A possible
solution is to split the file names into training, validation, and test sets and
dynamically loading images in the Dataset class. Using such a loading technique
would allow us to train the model on more samples in the dataset
14
CHAPTER 7
15
16