0% found this document useful (0 votes)
47 views34 pages

Aasl

The document discusses a mid-term project report on American Sign Language detection using CNN. It includes an introduction describing the background and problem definition. It then discusses the literature review, block diagrams, methodology, implementation plan, requirements analysis, results and analysis.

Uploaded by

Amrit Sapkota
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views34 pages

Aasl

The document discusses a mid-term project report on American Sign Language detection using CNN. It includes an introduction describing the background and problem definition. It then discusses the literature review, block diagrams, methodology, implementation plan, requirements analysis, results and analysis.

Uploaded by

Amrit Sapkota
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

TRIBHUVAN UNIVERSITY

LALITPUR ENGINEERING COLLEGE

MID TERM PROJECT REPORT


ON
“ AMERICAN SIGN LANGUAGE DETECTION USING CNN”

SUBMITTED BY
Amrit Sapkota (076 BCT 05)
Asmit Oli (076 BCT 43)
Nischal Maharjan (076 BCT 20)
Sakshyam Aryal (076 BCT 29)

SUBMITTED TO:
DEPARTMENT OF COMPUTER ENGINEERING
LALITPUR ENGINEERING COLLEGE
LALITPUR, NEPAL

SUPERVISED BY
Er. Hemant Joshi

December, 2023
TRIBHUVAN UNIVERSITY
INSTITUTE OF ENGINEERING
LALITPUR ENGINEERING COLLEGE
DEPARTMENT OF COMPUTER ENGINEERING

MID TERM PROJECT REPORT ON:


”AMERICAN SIGN LANGUAGE DETECTION USING CNN ”
IN PARTIAL FULFILLMENT FOR THE
AWARD OF
BACHELOR’S DEGREE IN COMPUTER ENGINEERING

SUBMITTED BY
Asmit Oli (076 BCT 43)
Amrit Sapkota (076 BCT 05)
Sakshyan Aryal (076 BCT 29)
Nischal Maharjan (076 BCT 20)

SUPERVISED BY
Er. Hemant Joshi

December, 2023
ACKNOWLEDGEMENT

First and foremost, we would like to thank our supervisor, Er. Hemant Joshi sir,
who guided us in doing this project. He provided us with invaluable advice and
helped us in difficult stages. His motivations helped tremendously to the successful
completion of the project. We are really grateful to our project coordinator, Er.
Bibat Thokar, for advising us and introducing the project to us in an easy-to-
understand way which has helped us to complete our project easily and effectively
on time. We would like to express our special thanks of gratitude to IOE as well
as our principal, Dr.Surendra Tamrakar, who gave us the golden opportunity to do
this wonderful project on the topic of sign language Detection, which also helped us
in a lot of research and we came to know about so many new things. We are really
thankful to them. Besides, we would like to thank all the teachers who helped us
by advising us and providing the equipment we needed. We are overwhelmed in
all humbleness and gratefulness to acknowledge our depth to all those who have
helped us to put these ideas, well above the level of simplicity and into something
concrete. Also, we would like to thank our family and friends for their support.
Without their support we wouldn’t have succeeded in completing this project. Last
but not the least, we would like to thank everyone who helped and motivate us to
work on this project.

Sincerely,
Asmit Oli (076 BCT 43)
Amrit Sapkota (076 BCT 05)
Nischal Maharjan (076 BCT 20)
Sakshyan Aryal (076 BCT 29)

i
ABSTRACT

There is an undeniable communication problem between the Deaf community and


the hearing majority. It becomes hard for deaf people to communicate because
many people don’t understand sign language. With the use of innovation, in sign
language recognition, we tried to teardown this communication barrier. In this
proposal, it is shown how using Artificial Intelligence can play a key role to provide
the solution. Using the dataset, through the front camera of the laptop, translation
of sign language to text format can be seen on the screen in real-time i.e. the input
is in video format whereas the output is in text format. Extraction of complex head
and hand movements along with their constantly changing shapes for recognition
of sign language is considered a difficult problem in computer vision. Mediapipe
provides necessary key points or landmarks of hand, face, and pose. The model is
then trained using a Convolutional neural network(CNN). The trained model is
used to recognize sign language.

Keywords: Convolution Neural Network(CNN), Recurrent Neural Network, Deep


Learning, Gesture Recognition, Sign Language Recognition.

ii
TABLE OF CONTENTS

ACKNOWLEDGEMENT i
ABSTRACT ii
TABLE OF CONTENTS iii
LIST OF FIGURES v
LIST OF ABBREVIATIONS vi
1 INTRODUCTION 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5 System Requirement . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5.1 Functional Requirements . . . . . . . . . . . . . . . . . . . 2
1.5.2 Non-functional Requirement . . . . . . . . . . . . . . . . . . 3
2 LITERATURE REVIEW 4
3 BLOCK DIAGRAM 7
3.1 System Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Use Case Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Level 0 DFD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.4 Level 1 DFD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.5 Activity Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4 METHODOLOGY 12
4.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2.1 Video Acquisition . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2.2 Video Segmentation . . . . . . . . . . . . . . . . . . . . . . . 13
4.2.3 Frame Extraction . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2.4 Preprocessing Techniques . . . . . . . . . . . . . . . . . . . 13
4.2.5 Hand Segmentation . . . . . . . . . . . . . . . . . . . . . . . 14
4.2.6 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . 14

iii
4.3 Convolution Neural Network . . . . . . . . . . . . . . . . . . . . . . 14
4.4 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5 IMPLEMENTATION PLAN 17
5.1 Gantt Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6 REQUIREMENT ANALYSIS 18
6.1 Hardware Requirements . . . . . . . . . . . . . . . . . . . . . . . . 18
6.2 Software Requirements . . . . . . . . . . . . . . . . . . . . . . . . . 18
6.3 User Requirement Definition . . . . . . . . . . . . . . . . . . . . . . 19
7 RESULT AND ANALYSIS 20
7.1 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 20
7.2 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 22
7.3 Comparision of CNN and LSTM . . . . . . . . . . . . . . . . . . . . 23
8 EPILOGUE 24
8.1 Task Completed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
8.2 Remaining Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
REFERENCES 26

iv
LIST OF FIGURES

3.1 System Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . 7


3.2 Use Case Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Level 0 DFD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.4 Level 1 DFD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.5 Activity Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12


4.2 CNN Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5.1 Gantt Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

6.1 Hand Landmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

7.1 Epoch Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20


7.2 Epoch Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
7.3 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
7.4 Output from Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

v
LIST OF ABBREVIATIONS

ConvNets Convolutional Networks


CNN Convolution Neural Network
CPU Central Processing Unit
DFD Data Flow Diagram
DTW Dynamic Time Warping
FPS Frames Per Second
GPU Graphic Processing Unit
ReLu rectified linear unit
RGB Red Green Blue
LSTM One Dimensions
1D One Dimensions

vi
CHAPTER 1

INTRODUCTION

1.1 Background

With the rapid growth of technology around us, Machine Learning and Artificial
Intelligence has been used in various sectors to support mankind including gesture,
object, face detection etc. With the help of Deep Learning, a machine imitates
the way humans gain certain types of knowledge. Using Artificial Neural Network,
simulation of human brain is done and using Convolution layers, extraction of
selected important part from an image to make computation easy. ”Sign Language
Detection”, the name itself specifies the gist of the project. Sign language recog-
nition has been a major problem between mute disabilities people in community.
People does not understand sign language and also it is difficult for them to learn
those sign language. A part from the scoring grades form this minor project, the
core idea is to make communication easy for deaf people. We set the bar of the
project such that it would be beneficial to society as well. The main reason for us
to choose this project is to aid people using Artificial Intelligence.[1]

1.2 Problem Definition

Though there is a lot of research going on regarding sign language recognition,


there is very little implementation in practical life. As per the team research, upto
that extent, we came across this problem though there are many Sign language
recognition software and hardware out there but to some extent we felt that
people who don’t understand sign language or can’t read may have the problem
in communication also recognition using gloves is practically not portable and
costly as well. This made us think to use image processing and deep learning for
sign language recognition and provide the output as voice such that everyone can
understand.[2]

1
1.3 Scope

The field of sign language recognition includes the development and application of
techniques for recognizing and interpreting sign language gestures. This involves
using computer vision and machine learning techniques to analyze video input and
identify gestures of sign language users. Sign language recognition has a wide range
of potential applications, including communication aids for deaf people, automatic
translation of sign language into spoken or written language, and an interactive
platform for learning sign language. The scope also extends to improving the
accuracy and efficiency of sign language recognition systems through advances in
algorithms, sensor technology and data collection. Additionally, this scope also
includes addressing challenges related to sign language diversity, gestural variation,
lighting conditions, and the need for robust real-time performance in a variety of
environments. [3]

1.4 Objectives

The following are the objectives for sign language detection.

• To design and implement a system that can understand the sign language of
Hearing-impaired people.

• To train the model with a variety of datasets using MediaPipe and CNN,
and provide the output in real-time.

• To recognize sign language and provide the output as voice or text.

1.5 System Requirement

The following is the desired functionality of the new system. The proposed project
would cover.

1.5.1 Functional Requirements

• Real-time Output.

• Accurate detection of gestures.

2
• Data sets comment.

1.5.2 Non-functional Requirement

• Performance Requirement.

• Design Constraints.

• Reliability.

• Usability.

• Maintainability.

3
CHAPTER 2

LITERATURE REVIEW

There are many articles and papers that have been published regarding Sign
Language Detection. Many of them used different algorithms and data sets of
their own. In 1992, researchers developed a camera that could focus on a person’s
hand because the signer wore a glove with markings on the tip of each finger and
later, in 1994, on a ring of color around each joint on the signer’s hand (Starner,
1996).In 1995, Starner began the development of a system that initially involved
the signer wearing two different colored gloves, although eventually no gloves were
required. A camera was placed on a desk or mounted in a cap worn by a signer in
order to capture the movements (Starner, 1996). More recently, a wearable system
has been developed that can function as a limited interpreter (Brashear, Starner,
Lukowicz, and Junker, 2003).[4] To this end, they used a camera vision system
along with wireless accelerometers mounted in a bracelet or watch to measure
hand rotation and since the early 2000s, ConvNets have been applied with great
success to the detection, segmentation, and recognition of objects and regions in
images. These were all tasks in which labeled data was relatively abundant, such as
traffic sign recognition53, the segmentation of biological images54 particularly for
connectomics55, and the detection of faces, text, pedestrians, and human bodies
in natural images36,50,51,56–58[5].
A major recent practical success of ConvNets is face recognition. Toshev and
Szegedy proposed a deep learning-based method, which localizes body joints by
solving a regression problem and further improves on estimation precision by
using a cascade of these pose regressors. Their work demonstrates that a general
deep learning-based network originally formed for a classification problem can
be fine-tuned and used to solve localization and detection problems[4]. Since
the early 2000s, ConvNets have been applied with great success to the detection,
segmentation, and recognition of objects and regions in images. These were all tasks
in which labeled data was relatively abundant, such as traffic sign recognition, the

4
segmentation of biological images particularly for connectomics, and the detection
of faces, text, pedestrians, and human bodies in natural images. A major recent
practical success of ConvNets is face recognition. [6]

5
Table 2.1: Summary of Related Works on ASL Recognition

S.N. Related Works Results Tools Used


Real-time American
Sign Language Recognition
1 accuracy 95.72% CNN
with Convolutional Neural
Networks (2016)
Sign Language Translation
CNN,
Using
2 accuracy 94.91% Fingerspelling
Deep Convolutional Neural
dataset
Networks Rahib H (2019)
American sign language accuracy
recognition and training 93.36%(LSTM),
3 LSTM, SVM, RNN
method with recurrent 94.23%(SVM) ,
neural network lee (2021) 95.03%(RNN)
User-Independent American
Sign Language Alphabet Principal
Recognition Based on Depth Component
4 accuracy 88%
Image and PCA Net Analysis Network
Features Walah Aly (PCANet)
and Saleh Aly (2019)
A New Benchmark on
American Sign Language
Recognition using
5 accuracy 95.9% CNN
Convolutional Neural
Network.
Md. Rahman (2019)
American Sign Language
Alphabet Recognition by
6 Extracting Features from accuracy 87% SVM
Hand Pose Estimation.
(2020)
Real-time recognition of
American sign language
7 using long-short-term accuracy 93.81% LSTM
memory neural network and
hand detection (2021)
Sign language recognition
system for communicating
8 accuracy 95.1% CNN
to people with disabilities.
(2023)
Real-time Assamese Sign
Mediapipe,
Language Recognition using
9 accuracy 96.21% Microsoft Kinect
MediaPipe and Deep
sensor
Learning. (2023)

6
CHAPTER 3

BLOCK DIAGRAM

3.1 System Block Diagram

Figure 3.1: System Block Diagram

The overall workflow of the system is shown in the above block diagram. Data-set
are like the memory of the system. Each and every detection that we view in real
time are the results of the data-set. Data-sets are captured in real time from the
front camera of the laptop . Using media pipe live perception of simultaneous
human pose, face landmarks, and hand tracking in real-time various modern life
applications including sign language detection can be enabled. With the help of
the landmarks or let’s say key points of features (face, pose, and hands) we get
from the media pipe we train our model. All the data that we collected from the
data-sets and from deep learning models are considered as training data. These
data are provided to the system such that the system can detect the sign language
in real-time. Input to this system is real-time or say live video using the front
camera of the laptop. As the real-time input i.e. sign language is provided using
the front camera of laptop, simultaneously live output can be seen on the screen in
text format. It acts as an interface for the Sign Language System providing an
environment for input data to get processed and provide the output.

7
3.2 Use Case Diagram

Figure 3.2: Use Case Diagram

8
3.3 Level 0 DFD

Figure 3.3: Level 0 DFD

9
3.4 Level 1 DFD

Figure 3.4: Level 1 DFD

10
3.5 Activity Diagram

Figure 3.5: Activity Diagram

11
CHAPTER 4

METHODOLOGY

4.1 Data Collection

In this project, we have collected sign language data of around 20,000 and made
10 classes. These classes are then made labels and the predictions are made from
these labels. High-quality video recording tools, including cameras and lighting
setups that allow for good viewing of hand motions, will be used to collect the data
We have used the Media-pipe library to extract key points from the images, which
are stored as data. The sample of data collection is shown below.

Figure 4.1: Data Collection

4.2 Data Preprocessing

Data preprocessing is a process of preparing the raw data and making it suitable
for a machine-learning model. It is the first and crucial step in creating a machine-
learning model. Real-world data generally contains noises and missing values and
may be unusable, which cannot be directly used for machine, learning models.
Data preprocessing is a required task for cleaning the data and making it suitable
for a machine-learning model, which also increases the accuracy and efficiency of a

12
machine-learning model. In our project, we have used a media-pipe library that
does preprocessing for our data.

4.2.1 Video Acquisition

The video data for sign language detection will be captured using a high-definition
camera with a decent resolution. The camera will be positioned to capture the
frontal view of the signer’s upper body, focusing on the hand region.

4.2.2 Video Segmentation

Then the acquired video data will be segmented into individual sign language
gestures. We will employ an automatic gesture detection algorithm based on motion
and hand region analysis. This algorithm will then detect significant changes in
motion and will use hand-tracking techniques to separate consecutive gestures from
video sequences.

4.2.3 Frame Extraction

From the segmented video data, frames will be extracted at a rate of one frame
per second to capture key moments of each gesture. A sample set of frames will
thus be guaranteed for additional study.

4.2.4 Preprocessing Techniques

Resizing and Cropping


To maintain consistency in the input data, the extracted frames will be resized to a
small resolution for example 680x480 pixels. Additionally, we will crop each frame
to focus on the hand region, ensuring that irrelevant background information is
eliminated.
Color Conversion
We will convert the frames from RGB to grayscale to reduce the computational
complexity and focus solely on hand shape and motion.
Noise Reduction
Gaussian smoothing with a kernel size of 3x3 will be applied to reduce noise in
the grayscale frames. This will help to enhance the clarity of hand contours and

13
minimize the impact of minor variations caused by lighting conditions.
Contrast Enhancements
Histogram equalization will be applied to the grayscale frames to improve the
visibility of hand features. This will enhance the contrast and increase the overall
dynamic range of pixel intensities.
Normalization
We will use min-max scaling to translate the intensity values from the [0, 255]
range to [0, 1], standardizing the pixel values across frames. By ensuring that the
input data has consistent ranges, this normalization step will help in convergence
during model training.

4.2.5 Hand Segmentation

We will use a hand segmentation technique based on color and region analysis be-
cause hand movements are important in sign language. To separate the hands from
the background and other unimportant items, this technique will use background
subtraction and skin color modeling.

4.2.6 Data Augmentation

We will use data augmentation techniques to broaden the variety and amount
of the training dataset. These will consist of randomizing the frames’ cropping,
rotation, translation, and flipping. The model’s ability to recognize sign gestures
in a variety of situations will be strengthened with the aid of data augmentation.

4.3 Convolution Neural Network

The CNN layer is the most significant; it builds a convolved feature map by applying
a filter to an array of picture pixels. I developed a CNN with three layers, each layer
using convolution, ReLU, and pooling. Because CNN does not handle rotation and
scaling by itself, a data augmentation approach was used. A few samples have been
rotated, enlarged, shrunk, thickened, and thinned manually. Convolution filters
are applied to the input using 1D Convolutions to extract the most significant
characteristics. The kernel glides in one dimension in 1D convolution, which
exactly suits the spatial properties. Convolution sparsity, when used with pooling

14
Figure 4.2: CNN Architecture

for location invariant feature detection and parameter sharing, lowers overfitting.
ReLU layer is a layer where data travels through each layer of the network, the
ReLU layer functions as an activation function, ensuring non-linearity. Without
ReLU, the dimensionality that is desired would be lost. It introduces non-linearity,
accelerates training, and reduces computation time. Pooling layer is a layer that
gradually decreases the dimension of the feature and variation of the represented
data. Decreases dimensions and computation, speeds up processing by reducing
the number of parameters that the network must compute, reduces overfitting by
reducing the number of parameters, and makes the model more tolerant of changes
and distortions. Pooling strategies include max pooling, min pooling, and average
pooling; I tried max pooling. The maximum input of a convolved feature is used in
max pooling. Flatten is used to transform the data into a one-dimensional array
for input to the next layer. Dense the weights are multiplied by a matrix-vector
multiplication of the input tensors, followed by an activation function. Apart from
an activation function, the essential argument that we define here is units, which is
an integer that we use to select the output size. Dropout layer is a regularisation
approach that eliminates neurons from layers at random, along with their input
and output connections. As a consequence, generality is improved, and overfitting
is avoided.

15
4.4 Loss Function

Categorical cross-entropy is a widely used loss function in machine learning, par-


ticularly in the context of multi-class classification tasks such as American Sign
Language (ASL) detection. Specifically tailored for scenarios where instances be-
long to one of several mutually exclusive classes, categorical cross-entropy measures
the dissimilarity between the predicted probability distribution of classes and the
true distribution. In the realm of ASL detection, where accurate classification of
various sign gestures is crucial, this loss function plays a pivotal role in guiding
the training process. By penalizing deviations from the actual class probabilities,
categorical cross-entropy effectively steers the model towards learning to make
more precise predictions. Its implementation ensures that the model is trained
to discern subtle differences among ASL gestures, ultimately contributing to en-
hanced accuracy and proficiency in sign language recognition. The documentation
should underscore the significance of categorical cross-entropy in the ASL detection
pipeline, elucidating its role in optimizing the model’s ability to interpret and
classify a diverse range of sign language expressions accurately.

N M
1 XX
Categorical Cross-Entropy = − yij log(pij ) (4.1)
N i=1 j=1

16
CHAPTER 5

IMPLEMENTATION PLAN

5.1 Gantt Chart

Figure 5.1: Gantt Chart

17
CHAPTER 6

REQUIREMENT ANALYSIS

6.1 Hardware Requirements

The hardware required for the projects are:

• CPU

• GPU

• Storage

6.2 Software Requirements

The software required for the projects is :


Python
Python is a high level language which is used for general purpose programming. It
was developed by Guido van Rossum. The first release of Python was in the year
1991 as Python 0.9.0. Programming paradigms such as structured, object oriented,
and functional programming are supported in Python.

TensorFlow
TensorFlow is a free open source library that can be used in the field of machine
learning and artificial intelligence. Including many other tasks, it can be used for
training purposes in deep learning.

Mediapipe
MediaPipe offers cross-platform, customizable machine learning solutions for live
and streaming media i.e. real time videos. Its features are End to end acceleration,
Build once deploy anywhere, ready to use solution, and Free and open source.

18
Figure 6.1: Hand Landmarks

6.3 User Requirement Definition

The user requirement for this system is to make the system fast, feasible, less prone
to error, save time and improve the communication gap between normal people
and deaf people

• The system can translate sign language into text.

• The system should have a user-friendly interface.

19
CHAPTER 7

RESULT AND ANALYSIS

CNN model has been trained


S.N. Parameter Used Value
1 Number of Convolutional Layers 3 (32,64,128)
2 Activation Functions ReLU
3 Learning Rate 0.001
4 Optimizer Adam
5 Batch Size 32
6 Epochs 8
7 Input Image Size 640x480
8 Loss Function Categorical crossentropy

Figure 7.1: Epoch Accuracy

Figure 7.2: Epoch Loss

7.1 Quantitative Analysis

In order to evaluate the effectiveness of the system being proposed, we have

measured its performance using various metrics, including Accuracy, Precision,

20
Recall, F1-Score, and Error Rate. Accuracy refers to how closely the measurements

of the system align with a particular value, and is expressed as follows:

TP + TN
Accuracy = (7.1)
TP + FN + FP + TN

Precision is a metric that measures the accuracy of positive predictions


made by a system. It can be obtained by dividing true positives by
the sum of true positives and false positives.

TP
P recision = (7.2)
TP + FP

In machine learning, recall, also referred to as sensitivity or true


positive rate, represents the likelihood that the model accurately
recognizes the detected anomaly.

TP
Recall = (7.3)
TP + FN

The F1-Score is a metric that combines precision and recall using


their harmonic mean. It provides a single value for comparison, with
higher values indicating better performance.

2 ∗ P recision ∗ Recall
F 1 − Score = (7.4)
P recision + Recall

21
Figure 7.3: Confusion Matrix

The above photo is the sign of the ’Hello’ word. The shown hand’s
landmarks are taken in NumPy array form, and then it is fed into our
deep learning model. Our model classified the gesture of our hand.
Hence it outputs a class label corresponding to the gesture.

7.2 Qualitative Analysis

Output from our model is shown below where ”Hello, no, I Love You
”, This kind of hand gestures input has been provided and output
can be seen on screen.

22
Figure 7.4: Output from Model

7.3 Comparision of CNN and LSTM

23
CHAPTER 8

EPILOGUE

8.1 Task Completed

A significant milestone was achieved through the development and


training of both Long Short-Term Memory (LSTM) and Convolu-
tional Neural Network (CNN) models. The LSTM model, renowned
for its proficiency in sequence prediction tasks, underwent meticulous
training to grasp intricate patterns within the data. Simultaneously,
the CNN model, tailored for image-related tasks, underwent a rigor-
ous training regimen to enhance its feature extraction capabilities.
Model evaluation played a pivotal role in assessing the efficancy of
these trained models. Through a meticulous process of validation
and testing, their performance metrics were scrutinized to ensure
their aptitude for real-world applications.

To further elevate model efficiency, a multifaceted approach was im-


plemented. Firstly, the dataset was substantially expanded, exposing
the models to a more diverse range of examples, thereby augmenting
their ability to generalize. This step proved crucial in enhancing the
models’ robustness across a spectrum of scenarios. Additionally, the
number of units in both dense layers of the models was increased,
empowering them to capture more intricate relationships in the data.
This strategic augmentation of model complexity contributed to a
more nuanced understanding of the underlying patterns.

24
Furthermore, a deliberate increase in kernel size and the number of
filters in the CNN model was undertaken. This adjustment facilitated
a broader scope of feature extraction, enabling the model to discern
more complex spatial hierarchies within the input data. As a result,
the model’s capacity for recognizing subtle patterns in images was
greatly amplified. Collectively, these refinements, encompassing
dataset expansion, augmentation of model architecture, and fine-
tuning of hyperparameters, culminated in a substantial increase in
the overall efficiency of both the LSTM and CNN models, positioning
them as formidable tools in the realm of predictive analytics and
image processing.

8.2 Remaining Task

Developing the best user experience website

25
REFERENCES

[1] S. Gattupalli, A. Ghaderi, and V. Athitsos, “Evaluation of deep


learning based pose estimation for sign language recognition,” in
Proceedings of the 9th ACM international conference on PErvasive
technologies related to assistive environments, pp. 1–7, 2016.

[2] G. A. Rao, K. Syamala, P. Kishore, and A. Sastry, “Deep convo-


lutional neural networks for sign language recognition,” in 2018
conference on signal processing and communication engineering
systems (SPACES), pp. 194–197, IEEE, 2018.

[3] N. K. Zirzow, “Signing avatars: Using virtual reality to support


students with hearing loss,” Rural Special Education Quarterly,
vol. 34, no. 3, pp. 33–36, 2015.

[4] A. Toshev and C. Szegedy, “Deeppose: Human pose estimation


via deep neural networks,” in Proceedings of the IEEE conference
on computer vision and pattern recognition, pp. 1653–1660, 2014.

[5] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature,


vol. 521, no. 7553, pp. 436–444, 2015.

[6] L. Pigou, S. Dieleman, P.-J. Kindermans, and B. Schrauwen,


“Sign language recognition using convolutional neural networks,”
in Computer Vision-ECCV 2014 Workshops: Zurich, Switzerland,
September 6-7 and 12, 2014, Proceedings, Part I 13, pp. 572–578,
Springer, 2015.

26

You might also like