Final Report
Final Report
MACHINE LEARNING
APRIL-2024
Report Approval
Internal Examiner
Name:
Designation
Affiliation
External Examiner
Name:
Designation
Affiliation
Declaration
Further, I/we declare that the content of this Project work, in full or in parts, have
neither been taken from any other source nor have been submitted to any other
Institute or University for the award of any degree or diploma.
I/We, Prof. Rashmi Choudhary certify that the project entitled “SIGN
LANGUAGE RECOGNIZATION USING MACHINE LEARNING” submitted
in partial fulfillment for the award of the degree of Bachelor of Technology/Master of
Computer Applications by VIVEK PARIHAR, YAMAN KUMAR SAHOO,
YOGESH SEPTA is the record carried out by him/them under my/our guidance and
that the work has not formed the basis of award of any other degree elsewhere.
________________________________ __________________________
_____________________
I would like to express my deepest gratitude to the Honorable Chancellor, Shri R C Mittal, who
has provided me with every facility to successfully carry out this project, and my profound
indebtedness to Prof. (Dr.) D. K. Patnaik, Vice Chancellor, Medi-Caps University, whose unfailing
support and enthusiasm has always boosted up my morale. I also thank Prof. (Dr.) Pramod S. Nair,
Dean, Faculty of Engineering, Medi-Caps University, for giving me a chance to work on this project.
I would also like to thank my Head of the Department Dr. Ratnesh Litoriya for his continuous
encouragement for the betterment of the project.
It is their help and support, due to which we became able to complete the design and
technical report.
Without their support this report would not have been possible.
VIVEK PARIHAR
(EN21CS301874)
YAMAN KUMAR SAHOO
(EN21CS301878)
YOGESH SEPTA
(EN21CS301890)
B.Tech. III Year
Department of Computer Science & Engineering
Faculty of Engineering
Medi-Caps University, Indore
Abstract
The Sign Language Recognition System is a technology designed to understand and interpret
sign language gestures. It involves the collection of diverse sign language datasets using sensors
like gloves with sensors or cameras. Preprocessing techniques clean and enhance the captured
data, and feature extraction identifies key aspects of the gestures. Machine learning models, such
as Convolutional Neural Networks or Recurrent Neural Networks, are trained on the data to
associate hand movements with specific meanings. The system is validated and tested for
accuracy, and a user interface is implemented for communication. Real-time processing enables
immediate recognition, and continuous improvement is achieved through updates and user
feedback. This technology facilitates communication between individuals using sign language
and those who may not understand it, contributing to inclusivity and accessibility.
The primary objective of this system is to alleviate communication challenges faced by the
hearing-impaired community by automating the recognition of sign language gestures in real-
time. By utilizing advanced machine learning algorithms, the system can interpret and translate
sign language gestures into meaningful and accessible information. This technological
innovation not only promotes inclusivity but also fosters independence for individuals with
hearing impairments, allowing them to communicate effectively and seamlessly in various
contexts. The integration of these powerful technologies showcases a holistic approach to
bridging communication gaps and creating a more inclusive environment for the hearing-
impaired population.
Keywords:
• Machine Learning
• User Interface
• CNN
• Hearing Impaired
• Innovation
• Accessibility
• Gesture Recognition.
Table of Contents
Page No.
Report Approval ii
Declaration iii
Certificate iv
Acknowledgement v
Abstract vi
Table of Contents vii
List of figures viii
Abbreviations ix
Notations & Symbols x
Chapter 1 Introduction
1.1 Introduction 1
1.2 Literature Review 1-2
1.3 Objectives 2-3
1.4 Significance 3
1.5 Research Design 3
1.6 Source of Data 3
Chapter 2 REQUIREMENTS SPECIFICATION
2.1 User Characteristics 4
2.2 Functional Requirements 5
2. ML - Machine Learning
CHAPTER-1
Introduction
1.1 Introduction
Sign Language Recognition is a technology designed to bridge communication gaps between
individuals who use sign language and those who may not understand it. Sign language is a visual-
gestural language used by the deaf and hard of hearing community for communication. Sign
Language Recognition systems utilize advanced technologies such as sensors, machine learning,
and computer vision to interpret and translate sign language gestures into written or spoken
language.
The primary goal of Sign Language Recognition is to enable effective communication between
individuals who use sign language as their primary means of expression and those who rely on
spoken or written language. These systems play a crucial role in fostering inclusivity and
accessibility, breaking down barriers that may exist in everyday communication for individuals
with hearing impairments.
Validation and testing phases ensure the accuracy and reliability of the system, and a user
interface is implemented to convey the recognized gestures, making communication accessible
to a wider audience. Real-time processing capabilities enable immediate recognition, making the
technology practical for various applications.
Continuous improvement is a fundamental aspect of Sign Language Recognition systems,
allowing for updates, refinement, and adaptation over time. This iterative process ensures that the
system remains effective, accommodating different sign language variations and user needs.
Scholars investigate the linguistic properties of sign languages, treating them as complete
and unique languages with their own grammar, syntax, and semantics.
Research explores how the brain processes sign language, delving into cognitive and
neurological aspects to understand how sign language is perceived, produced, and
represented in the brain.
1.3 Objectives
Continuous Improvement: Evolve and improve over time by incorporating user feedback,
updating datasets, and refining algorithms to enhance the accuracy and effectiveness of sign
language recognition systems.
1.4 Significance
Sign language recognition is pivotal for accessibility, empowering deaf individuals to communicate
effectively with the broader community. It fosters inclusivity, breaking down communication
barriers and promoting equal participation. This technology enhances educational opportunities,
supports cultural preservation, and drives technological advancement. By acknowledging sign
language, societies affirm the rights and identities of deaf communities, ensuring legal recognition
and equal access to services. Overall, sign language recognition transforms lives, enabling
individuals to navigate the world more independently and fostering a more inclusive and
understanding society.
CHAPTER-2
REQUIREMENTS SPECIFICATION
5. Healthcare Professionals:
Medical personnel who may use sign language recognition technology to communicate with
deaf patients or provide healthcare information in sign language.
9. Accuracy and Reliability: Ensuring high accuracy and reliability in gesture recognition to
minimize misinterpretations and errors in communication.
10. User-friendly Interface: Providing an intuitive and easy-to-use interface for both deaf
users and communication partners to facilitate smooth interactions.
2.3 Dependencies
Dependencies of sign language recognition requirements include:
3. Hardware Compatibility: The system's performance may depend on the hardware used,
such as cameras or motion sensors, requiring compatibility and optimization for specific
devices.
4. User Feedback: Continuous user feedback is essential for refining and improving
recognition accuracy and usability based on real-world usage scenarios.
5. Ethical Considerations: Ensuring ethical data collection and usage practices, including
user consent and privacy protection, are crucial dependencies for responsible sign language
recognition development.
Performance requirements for sign language recognition systems aim to ensure efficient and
accurate communication between users. These requirements include:
1. Accuracy: The system must accurately recognize sign language gestures to facilitate
effective communication, with high precision and recall rates.
2. Real-time Processing: The system should process sign language gestures in real-time to
enable fluid and natural interactions without significant delays.
5. Adaptability: It should adapt to different users' signing styles and preferences, ensuring
accurate recognition for individuals with diverse communication styles.
6. Latency: Minimal latency in gesture recognition is essential for smooth and natural
communication, especially in interactive settings like video conferencing.
8. User Experience: It should provide a seamless and intuitive user experience, with clear
feedback mechanisms and minimal user effort required for interaction
2. Processing Units: Powerful processors or dedicated hardware accelerators are needed for
real-time processing of video data and running machine learning algorithms for gesture
recognition. This may include CPUs, GPUs, or specialized chips like TPUs or FPGAs.
3. Memory: Sufficient RAM is necessary to store and process video frames, intermediate
data, and model parameters during gesture recognition tasks.
4. Storage: Adequate storage space may be required for storing training data, pre-trained
models, and application data, depending on the system's requirements.
5. Sensors: Additional sensors, such as depth sensors or accelerometers, may enhance gesture
recognition accuracy or provide contextual information about the user's movements.
Constraints:
2. Data Availability: Limited availability of diverse and high-quality sign language datasets
for training and testing may constrain the system's accuracy and robustness.
5. User Variability: Variability in signing styles, gestures, and hand shapes among different
individuals poses challenges for accurate recognition.
Assumptions:
1. Standardized Gestures: Assumes a standardized set of sign language gestures and
vocabulary for recognition, which may not fully capture the diversity of sign languages and
dialects.
2. Clear Line of Sight: Assumes an unobstructed view of the signer's hands for accurate
gesture detection, which may not always be feasible in practical scenarios.
4. Limited Vocabulary: Assumes a limited vocabulary of signs and gestures for recognition,
which may not cover all possible communication needs.
5. User Cooperation: Assumes user cooperation and willingness to adapt signing behavior or
provide feedback for system improvement, which may vary among individuals.
CHAPTER-3
DESIGN
3.1 Algorithm
Algorithm Layer 1:
1. Apply Gaussian Blur filter and threshold to the frame taken with openCV to get the processed
image after feature extraction.
2. This processed image is passed to the CNN model for prediction and if a letter is detected for
more than 50 frames then the letter is printed and taken into consideration for forming the
word.
3. Space between the words is considered using the blank symbol.
Algorithm Layer 2:
1. We detect various sets of symbols which show similar results on getting detected.
2. We then classify between those sets using classifiers made for those sets only.
Layer 1:
CNN Model:
1. 1st Convolution Layer: The input picture has resolution of 128x128 pixels. It is first processed
in the first convolutional layer using 32 filter weights (3x3 pixels each). This will result in a
126X126 pixel image, one for each Filter-weights.
2. 1st Pooling Layer: The pictures are down sampled using max pooling of 2x2 i.e we keep the
highest value in the 2x2 square of array. Therefore, our picture is down sampled to 63x63 pixels.
3. 2nd Convolution Layer: Now, these 63 x 63 from the output of the first pooling layer is served
as an input to the second convolutional layer. It is processed in the second convolutional layer
using 32 filter weights (3x3 pixels each). This will result in a 60 x 60 pixel image.
4. 2nd Pooling Layer: The resulting images are down sampled again using max pool of 2x2 and
is reduced to 30 x 30 resolution of images.
5. 1st Densely Connected Layer: Now these images are used as an input to a fully connected
layer with 128 neurons and the output from the second convolutional layer is reshaped to an
array of 30x30x32 =28800 values. The input to this layer is an array of 28800 values. The output
of these layer is fed to the 2nd Densely Connected Layer. We are using a dropout layer of value
0.5 to avoid overfitting.
6. 2nd Densely Connected Layer: Now the output from the 1st Densely Connected Layer is used
as an input to a fully connected layer with 96 neurons.
7. Final layer: The output of the 2nd Densely Connected Layer serves as an input for the final
layer which will have the number of neurons as the number of classes we are classifying
(alphabets + blank symbol).
Activation Function:
We have used ReLU (Rectified Linear Unit) in each of the layers (convolutional as well as
fully connected neurons).
ReLU calculates max(x,0) for each input pixel. This adds nonlinearity to the formula and
helps to learn more complicated features. It helps in removing the vanishing gradient
problemand speeding up the training by reducing the computation time.
Pooling Layer:
We apply Max pooling to the input image with a pool size of (2, 2) with ReLU activation
function. This reduces the amount of parameters thus lessening the computation cost and
reduces overfitting.
Dropout Layers:
The problem of overfitting, where after training, the weights of the network are so tuned to
the training examples they are given that the network doesn’t perform well when given new
examples. This layer “drops out” a random set of activations in that layer by setting them
to zero. The network should be able to provide the right classification or output for a
specific example even if some of the activations are dropped out [5].
Optimizer:
We have used Adam optimizer for updating the model in response to the output of the loss
function.
Adam optimizer combines the advantages of two extensions of two stochastic gradient
descent algorithms namely adaptive gradient algorithm (ADA GRAD) and root mean
square propagation (RMSProp).
Layer 2:
We are using two layers of algorithms to verify and predict symbols which are more similar to each
other so that we can get us close as we can get to detect the symbol shown. In our testing we found
that following symbols were not showing properly and were giving other symbols also:
1. For D : R and U
2. For U : D and R
3. For I : T, D, K and I
4. For S : M and N
So, to handle above cases we made three different classifiers for classifying these sets:
1. {D, R, U}
2. {T, K, D, I}
3. {S, M, N}
1. Whenever the count of a letter detected exceeds a specific value and no other letter is close to
it by a threshold, we print the letter and add it to the current string (In our code we kept the
value as 50 and difference threshold as 20).
2. Otherwise, we clear the current dictionary which has the count of detections of present symbol
to avoid the probability of a wrong letter getting predicted.
3. Whenever the count of a blank (plain background) detected exceeds a specific value and if the
current buffer is empty no spaces are detected.
4. In other case it predicts the end of word by printing a space and the current gets appended to
the sentence below.
3.3.5 ER Diagram:
CHAPTER-4
Implementation, Testing, and Maintenance
4.1.1 Python:
Python is a high-level, interpreted programming language celebrated for its simplicity, readability,
and versatility. It was conceived by Guido van Rossum and introduced in 1991, emphasizing clean
syntax and ease of use for developers across skill levels. Noteworthy attributes of Python include its
straightforward and comprehensible syntax, which favors readability and clear code organization
through indentation rather than complex symbols. Being an interpreted language, Python executes
code line by line via an interpreter, enabling swift development and experimentation. Additionally,
Python offers an array of high-level data types like lists, dictionaries, tuples, sets, and strings,
simplifying data manipulation tasks. Its extensive standard library encompasses modules for various
functionalities such as file handling, networking, web development, and more, reducing reliance on
external dependencies. Python's dynamic typing determines variable types during runtime,
complemented by strong typing to catch type errors during execution. Furthermore, Python boasts
cross-platform compatibility, running seamlessly on diverse operating systems like Windows,
macOS, and Linux. Its vast and active community contributes to ongoing development, creates
libraries and frameworks, and offers support through various channels, solidifying Python's position
as a leading programming language. Widely utilized in web development, data analysis, artificial
intelligence, scientific computing, automation, and scripting, Python has gained immense popularity
for its simplicity, flexibility, and extensive ecosystem.
4.1.2 CNN:
Convolutional Neural Networks (CNNs) are a class of deep learning models specifically designed
for processing structured grid-like data, such as images. Introduced in the 1980s, CNNs have
revolutionized the field of computer vision and have become the cornerstone of various applications,
including image classification, object detection, facial recognition, and more.
CNNs are characterized by their hierarchical structure, consisting of multiple layers, including
convolutional layers, pooling layers, and fully connected layers. In a CNN, convolutional layers
apply convolutional operations to extract features from input images. These layers use learnable
filters or kernels to convolve over the input data, capturing local patterns and spatial dependencies.
Pooling layers then downsample the feature maps obtained from the convolutional layers, reducing
their spatial dimensions while retaining important information.
Through repeated application of convolutional and pooling layers, CNNs learn to hierarchically
extract increasingly abstract features from the input images. The final layers of a CNN typically
consist of one or more fully connected layers, which perform classification or regression tasks based
on the extracted features.
CNNs are trained using large datasets through the process of supervised learning, where input
images are labeled with corresponding classes or attributes. During training, the network learns to
optimize its parameters (such as filter weights and biases) to minimize the discrepancy between
predicted and actual labels, typically using backpropagation and gradient descent optimization
algorithms.
The success of CNNs can be attributed to their ability to automatically learn hierarchical
representations directly from raw data, without the need for handcrafted features. This makes CNNs
highly effective in a wide range of visual recognition tasks, leading to their widespread adoption in
both academic research and industrial applications.
Fig 4.1.2.1
4.1.3 VS Code:
Visual Studio Code (VS Code) is a free and open-source code editor developed by Microsoft.
Launched in 2015, it quickly gained popularity among developers for its lightweight yet powerful
features. Built on top of the Electron framework, VS Code is highly customizable, allowing
developers to tailor it to their preferences with extensions, themes, and settings.
VS Code supports a wide range of programming languages and features built-in support for syntax
highlighting, code completion, and debugging. It offers an integrated terminal, version control
through Git, and seamless integration with various tools and services, making it suitable for a diverse
range of development workflows.
One of the key strengths of VS Code is its extensive extension ecosystem, with thousands of
extensions available for enhancing functionality, adding new features, and supporting additional
languages and frameworks. These extensions are contributed by both Microsoft and the community,
further extending the capabilities of the editor.
Overall, VS Code provides developers with a highly productive and efficient environment for
writing code, debugging, and collaborating on projects. Its popularity continues to grow, making it
a top choice for developers across different platforms and programming languages.
4.1.4 Tensorflow:
TensorFlow is an end-to-end open-source platform for Machine Learning. It has a
comprehensive, flexible ecosystem of tools, libraries and community resources that lets
researchers push the state-of-the-art in Machine Learning and developers easily build and deploy
Machine Learning powered applications.
TensorFlow offers multiple levels of abstraction so you can choose the right one for your needs.
Build and train models by using the high-level Keras API, which makes getting started with
TensorFlow and machine learning easy.
If you need more flexibility, eager execution allows for immediate iteration and intuitive debugging.
For large ML training tasks, use the Distribution Strategy API for distributed training on different
hardware configurations without changing the model definition.
4.1.5 Keras:
Keras is a user-friendly, high-level deep learning library for Python. It simplifies the creation and
training of neural networks through an intuitive API, enabling rapid prototyping and
experimentation. With seamless integration with TensorFlow and other backends, Keras allows for
efficient execution of neural network computations, including GPU acceleration. Its modular design
and extensibility make it adaptable to diverse research needs and project requirements, contributing
to its widespread adoption in academia and industry for developing and training neural network
models.
4.1.6 OpenCV:
OpenCV (Open Source Computer Vision Library) is a powerful open-source library for computer
vision and image processing tasks in Python, C++, and other programming languages. It provides a
wide range of functionalities for tasks such as image and video processing, object detection and
tracking, feature extraction, and more. With its extensive collection of algorithms and tools,
OpenCV simplifies the development of computer vision applications, making it popular among
researchers, developers, and hobbyists alike. Its versatility, ease of use, and robustness have made
it a go-to choice for a wide range of projects, from simple image filtering to complex computer
vision applications in various domains like robotics, healthcare, automotive, and surveillance.
4.1.7 Tkinter:
Tkinter is a Python library for creating graphical user interfaces (GUIs). It simplifies the
development of desktop applications by providing widgets and event-driven programming. It's
widely used for building interactive interfaces due to its ease of use and integration with Python's
standard library.
4.1.7 Hunspell:
Hunspell is a spell checking and morphological analysis library used in various applications for
language processing and correction.
4.1.8 Pyttsx3:
Pyttsx3 is a Python library for text-to-speech (TTS) conversion. It provides a simple interface to
convert text strings into spoken audio using different speech engines, such as the Microsoft Speech
API (SAPI5) on Windows. Pyttsx3 supports various features like changing voice characteristics,
adjusting speech rate, and more, making it useful for creating speech-enabled applications, assistive
technologies, and automated systems.
4.1.9 numpy:
NumPy is a powerful Python library for numerical computing that provides support for large, multi-
dimensional arrays and matrices, along with a collection of mathematical functions to operate on
these arrays efficiently. It is widely used in scientific computing, data analysis, and machine learning
due to its speed and ease of use. NumPy's array operations are implemented in C, making them fast
and suitable for handling large datasets and complex mathematical computations in Python.
algorithm adjusts its parameters based on the difference between its predictions and the true labels
in the training data. Supervised learning is used for tasks such as classification, where the output is
a category label, and regression, where the output is a continuous value.
We used Open computer vision (OpenCV) library in order to produce our dataset.
Firstly, we captured around 800 images of each of the symbol in ASL (American Sign
Language) for training purposes and around 200 images per symbol for testing purpose.
First, we capture each frame shown by the webcam of our machine. In each frame we define a
Region Of Interest (ROI) which is denoted by a blue bounded square as shown in the image
below:
Figure -4.2.1.1
Then, we apply Gaussian Blur Filter to our image which helps us extract various features of our
image. The image, after applying Gaussian Blur, looks as follows:
Figure -4.2.1.2
Figure -4.2.2.1
Algorithm Layer 1:
4. Apply Gaussian Blur filter and threshold to the frame taken with openCV to get the processed
image after feature extraction.
5. This processed image is passed to the CNN model for prediction and if a letter is detected for
more than 50 frames then the letter is printed and taken into consideration for forming the
word.
6. Space between the words is considered using the blank symbol.
Algorithm Layer 2:
3. We detect various sets of symbols which show similar results on getting detected.
4. We then classify between those sets using classifiers made for those sets only.
Layer 1:
CNN Model:
8. 1st Convolution Layer: The input picture has resolution of 128x128 pixels. It is first processed
in the first convolutional layer using 32 filter weights (3x3 pixels each). This will result in a
126X126 pixel image, one for each Filter-weights.
9. 1st Pooling Layer: The pictures are down sampled using max pooling of 2x2 i.e we keep the
highest value in the 2x2 square of array. Therefore, our picture is down sampled to 63x63 pixels.
10. 2nd Convolution Layer: Now, these 63 x 63 from the output of the first pooling layer is served
as an input to the second convolutional layer. It is processed in the second convolutional layer
using 32 filter weights (3x3 pixels each). This will result in a 60 x 60 pixel image.
11. 2nd Pooling Layer: The resulting images are down sampled again using max pool of 2x2 and
is reduced to 30 x 30 resolution of images.
12. 1st Densely Connected Layer: Now these images are used as an input to a fully connected
layer with 128 neurons and the output from the second convolutional layer is reshaped to an
array of 30x30x32 =28800 values. The input to this layer is an array of 28800 values. The output
of these layer is fed to the 2nd Densely Connected Layer. We are using a dropout layer of value
0.5 to avoid overfitting.
13. 2nd Densely Connected Layer: Now the output from the 1st Densely Connected Layer is used
as an input to a fully connected layer with 96 neurons.
14. Final layer: The output of the 2nd Densely Connected Layer serves as an input for the final
layer which will have the number of neurons as the number of classes we are classifying
(alphabets + blank symbol).
Activation Function:
We have used ReLU (Rectified Linear Unit) in each of the layers (convolutional as well as
fully connected neurons).
ReLU calculates max(x,0) for each input pixel. This adds nonlinearity to the formula and
helps to learn more complicated features. It helps in removing the vanishing gradient
problemand speeding up the training by reducing the computation time.
Pooling Layer:
We apply Max pooling to the input image with a pool size of (2, 2) with ReLU activation
function. This reduces the amount of parameters thus lessening the computation cost and
reduces overfitting.
Dropout Layers:
The problem of overfitting, where after training, the weights of the network are so tuned to
the training examples they are given that the network doesn’t perform well when given new
examples. This layer “drops out” a random set of activations in that layer by setting them
to zero. The network should be able to provide the right classification or output for a
specific example even if some of the activations are dropped out [5].
Optimizer:
We have used Adam optimizer for updating the model in response to the output of the loss
function.
Adam optimizer combines the advantages of two extensions of two stochastic gradient
descent algorithms namely adaptive gradient algorithm (ADA GRAD) and root mean
square propagation (RMSProp).
Layer 2:
We are using two layers of algorithms to verify and predict symbols which are more similar to each
other so that we can get us close as we can get to detect the symbol shown. In our testing we found
that following symbols were not showing properly and were giving other symbols also:
a) For D : R and U
b) For U : D and R
c) For I : T, D, K and I
d) For S : M and N
So, to handle above cases we made three different classifiers for classifying these sets:
a) {D, R, U}
b) {T, K, D, I}
c) {S, M, N}
1. Whenever the count of a letter detected exceeds a specific value and no other letter is close to it
by a threshold, we print the letter and add it to the current string (In our code we kept the value
as 50 and difference threshold as 20).
2. Otherwise, we clear the current dictionary which has the count of detections of present symbol
to avoid the probability of a wrong letter getting predicted.
3. Whenever the count of a blank (plain background) detected exceeds a specific value and if the
current buffer is empty no spaces are detected.
4. In other case it predicts the end of word by printing a space and the current gets appended to the
sentence below.
A python library Hunspell_suggest is used to suggest correct alternatives for each (incorrect) input
word and we display a set of words matching the current word in which the user can select a word
to append it to the current sentence. This helps in reducing mistakes committed in spellings and
assists in predicting complex words.
We convert our input images (RGB) into grayscale and apply gaussian blur to remove unnecessary
noise. We apply adaptive threshold to extract our hand from the background and resize our images
to 128 x 128.
We feed the input images after pre-processing to our model for training and testing after applying
all the operations mentioned above.
The prediction layer estimates how likely the image will fall under one of the classes. So, the output
is normalized between 0 and 1 and such that the sum of each value in each class sums to 1. We have
achieved this using SoftMax function.
At first the output of the prediction layer will be somewhat far from the actual value. To make it
better we have trained the networks using labelled data. It is a continuous function which is positive
at values which is not same as labelled value and is zero exactly when it is equal to the labelled
value.
There were many challenges faced during the project. The very first issue we faced was that
concerning the data set. We wanted to deal with raw images and that too square images as CNN in
Keras since it is much more convenient working with only square images.
We couldn’t find any existing data set as per our requirements and hence we decided to make our
own data set. Second issue was to select a filter which we could apply on our images so that proper
features of the images could be obtained and hence then we could provide that image as input for
CNN model.
We tried various filters including binary threshold, canny edge detection, Gaussian blur etc. but
finally settled with Gaussian Blur Filter.
More issues were faced relating to the accuracy of the model we had trained in the earlier phases.
This problem was eventually improved by increasing the input image size and also by improving
the data set.
1. Installation:
- Ensure you have Python installed on your system.
- Install the required libraries by running `pip install -r requirements.txt` in your terminal or
command prompt.
6. Troubleshooting:
- If the application crashes or freezes, try restarting it.
- Ensure your system meets the minimum requirements for running the application.
- Check for any error messages displayed in the terminal or command prompt for troubleshooting
purposes.
CHAPTER-5
Used Python to create this UI, where the processed image is passed to the CNN model for prediction
and if a letter is detected for more than 50 frames then the letter is printed and taken into
consideration for forming the word.
After forming a word, it transfers to sentence part and then clicking on speak button let you listen
the sentence formed.
Firstly, we captured around 800 images of each of the symbol in ASL (American Sign Language)
for training purposes and around 200 images per symbol for testing purpose.
First, we capture each frame shown by the webcam of our machine. In each frame we define a
Region Of Interest (ROI) which is denoted by a blue bounded square as shown in the image below:
Fig. 5.2.1.1
Then, we apply Gaussian Blur Filter to our image which helps us extract various features of our
image. The image, after applying Gaussian Blur, looks as follows:
Fig. 5.2.1.2
COMPUTER SCIENCE & ENGINEERING 32
MEDI-CAPS UNIVERSITY, INDORE
1. Overfitting: When you train a model using the same data that you test it on, the model may
simply memorize the training data without truly learning the underlying patterns. This can lead to
overfitting, where the model performs well on the training data but poorly on new, unseen data.
2. Misleading Evaluation: If the model has memorized the training data, it may perform
unrealistically well during testing, giving you a false sense of the model's performance. However,
this performance won't generalize to new data, and the model may fail to recognize sign language
gestures accurately in real-world scenarios.
Fig. 5.2.2.1
CHAPTER-6
Evaluation metrics such as accuracy, precision, recall, and F1 score are employed to assess the
trained models' efficacy, guiding the optimization process. Once trained, the models are deployed
in production environments, enabling real-time ASL sign recognition. This deployment facilitates
integration into various applications and devices, thereby extending the system's accessibility and
usability. Furthermore, a feedback loop mechanism ensures continuous improvement by soliciting
user feedback and monitoring system performance in real-world scenarios. This iterative process is
fundamental in refining the system's accuracy, responsiveness, and user experience over time.
CHAPTER-7
7.2 Appendix
1. Accuracy: In the context of classification models, accuracy represents the proportion of
correct predictions made by a model compared to the total number of predictions.
3. Cloud-ML: Cloud-ML is a platform that offers tools and services to developers, allowing them
to build and deploy custom machine learning models in cloud environments. It simplifies the
process of developing machine learning solutions by providing pre-built algorithms and
infrastructure.
4. Framework: In machine learning, a framework is a software tool or library that provides a set
of functionalities and tools for developing machine learning models. Frameworks like
TensorFlow, PyTorch, and scikit-learn offer APIs and tools for tasks such as data
preprocessing, model training, and deployment.
5. Gesture: A gesture refers to a physical movement or action, often made with hands or other
body parts, used to convey meaning or communicate information. In the context of sign
language recognition, gestures are the hand movements and expressions used to represent
words or concepts.
8. NumPy: NumPy is a Python library used for numerical computing, particularly for working
with arrays and matrices. It provides support for mathematical functions, linear algebra
operations, and random number generation, making it a fundamental library for scientific
computing in Python.
9. OpenCV: OpenCV (Open Source Computer Vision Library) is an open-source library for
computer vision and image processing tasks. It offers a wide range of functionalities for tasks
such as image manipulation, feature detection, object recognition, and video analysis.
10. Optimal Approach: An optimal approach refers to a decision or strategy that leads to the
best possible outcome among all available options. In machine learning, finding an optimal
approach often involves optimizing model parameters, choosing appropriate algorithms, and
selecting relevant features to maximize performance.
11. Pandas: Pandas is a Python library used for data manipulation and analysis. It provides data
structures and functions for working with structured data, such as tabular data or time series,
making it a powerful tool for data preprocessing and analysis tasks.
12. Deep Learning: Deep learning is a subset of machine learning that focuses on developing
artificial neural networks with multiple layers (deep neural networks) to learn from data and
make predictions. Deep learning algorithms can automatically learn features from data,
enabling them to perform complex tasks such as image recognition, speech recognition, and
natural language processing.
13. Computer Vision: Computer vision is a multidisciplinary field that focuses on enabling
computers to gain high-level understanding from digital images or videos. It involves tasks
such as image recognition, object detection, scene understanding, and image generation
using techniques from machine learning, image processing, and computer graphics.
7.3 Bibliography
[1] T. Yang, Y. Xu, and “A., Hidden Markov Model for Gesture Recognition”, CMU-RI-TR-94 10,
Robotics Institute, Carnegie Mellon Univ., Pittsburgh, PA, May 1994.
[2] Pujan Ziaie, Thomas M uller, Mary Ellen Foster, and Alois Knoll “A Na ̈ıve Bayes Munich, Dept.
of Informatics VI, Robotics and Embedded Systems, Boltzmannstr. 3, DE-85748 Garching,
Germany.
[3]https://docs.opencv.org/2.4/doc/tutorials/imgproc/gausian_median_blur_bilateral_filter/gausian_
median_blur_bilateral_filter.html
[4] Mohammed Waleed Kalous, Machine recognition of Auslan signs using PowerGloves: Towards
large-lexicon recognition of sign language.
[5]aeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural
Networks-Part-2/
[6] http://www-i6.informatik.rwth-aachen.de/~dreuw/database.php
[7] Pigou L., Dieleman S., Kindermans PJ., Schrauwen B. (2015) Sign Language Recognition Using
Convolutional Neural Networks. In: Agapito L., Bronstein M., Rother C. (eds) Computer Vision -
ECCV 2014 Workshops. ECCV 2014. Lecture Notes in Computer Science, vol 8925. Springer,
Cham
[8] Zaki, M.M., Shaheen, S.I.: Sign language recognition using a combination of new vision-based
features. Pattern Recognition Letters 32(4), 572–577 (2011).
[10] Byeongkeun Kang, Subarna Tripathi, Truong Q. Nguyen” Real-time sign language
fingerspelling recognition using convolutional neural networks from depth map” 2015 3rd IAPR
Asian Conference on Pattern Recognition (ACPR)
[12] https://opencv.org/
[13] https://en.wikipedia.org/wiki/TensorFlow
[14] https://en.wikipedia.org/wiki/Convolutional_neural_nework
[15] http://hunspell.github.io/