0% found this document useful (0 votes)
96 views42 pages

Sign Doc 2 - Merged

This document describes a project to develop a sign language recognition and translation system using deep learning. It was created by three students as a partial fulfillment of their Bachelor of Technology degree in Information Technology at Adithya Institute of Technology. The system aims to use computer vision techniques and machine learning algorithms to recognize American Sign Language gestures from video input and translate them into text or speech in real-time.

Uploaded by

SREYAS K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
96 views42 pages

Sign Doc 2 - Merged

This document describes a project to develop a sign language recognition and translation system using deep learning. It was created by three students as a partial fulfillment of their Bachelor of Technology degree in Information Technology at Adithya Institute of Technology. The system aims to use computer vision techniques and machine learning algorithms to recognize American Sign Language gestures from video input and translate them into text or speech in real-time.

Uploaded by

SREYAS K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 42

SIGN LANGUAGE RECOGNITION & TRANSLATOR

USING DEEP LEARNING

A PROJECT REPORT

Submitted by

SHREYAS [710120205303]
MADHAN KUMAR [710120205026]
ASHOKAN [710120205006]

in partial fulfillment for the award of the degree of

BACHELOR OF TECHNOLOGY
in
INFORMATION TECHNOLOGY

ADITHYA INSTITUTE OF

TECHNOLOGY

ANNA UNIVERSITY :CHENNAI 600025

MAY 2023
BONAFIDE CERTIFICATE

This project report “SIGN LANGUAGE RECOGNITION & TRANSLATOR USING


DEEP LEARNING”is the bonafide work of “SREYAS(710120205303),MADHAN
KUMAR(710120205026),ASHOKAN(710120205006)” who carried out the project work
under my supervision.

SIGNATURE SIGNATURE
Dr.Mishmala Sushith,M.E,PhD. Ms.Teitsana Devi B.E,M.E
HEAD OF THE DEPARTMENT SUPERVISOR
Department of information technology Assistant Professor-IT
Adithya Institute of Technology, Adithya Instituteof Technology,
Kurumbapalayam, Kurumbapalayam,
Coimbatore-641107. Coimbatore-641107.

Submitted for the Anna University Viva-Voce examination

Held on

Internal Examiner External Examiner


ACKNOWLEDGEMENT

We are pleased to present the “SIGN LANGUAGE RECOGNITION &


TRANSLATOR USING DEEP LEARNING” project and take the opportunity to
express our profound gratitude to all of those people who helped us in the completion
of this project.
We would like to express our deep sense of gratitude to the management of
Adithya Institute of Technology, Mr.C.SUKUMARAN Chairman and Mr.S.
PRAVEEN KUMAR, Trustee, for providing us with all facilities to carry out this
project.
We would like to express our sincere thanks to our CEO Madam,Dr.
SRINITHI PRAVEEN KUMAR , Adithya Institute of Technology, for providing us
all facilities to carry out this project successfully.
We would like to express our gratitude to our Principal Madam, Dr. D.
SOMASUNDARESWARI, Adithya Institute of Technology, for providing us all
facilities to carry out this project successfully.
We would like to express our profound thanks to our, Dr. Mishmala Sushith,
Head of the Department,Computer Science and Engineering, Adithya Institute of
Technology, for her valuable suggestions and guidance throughout the course of the
project.
We also express our sincere thanks to our guide Ms.TEITSANA DEVI, M.E.,
Assistant Professor, Department of Information Technology. We also thank all the
Faculty Members of our department for their help in making this project a successful
one.
Finally, we take this opportunity to extend our deep appreciation to our family
and friends, for all they meant to us during the crucial times of the completion of our
project.
ABSTRACT
Sign language is a rich and complex language that uses hand gestures, facial
expressions and body language to convey meaning.However, not everyone is able to
understand or use sign language, which can be barriers to communication and limit
access to information.The inspiration for creating a sign language detector comes from
the desire to make communication more accessible and inclusive for the deaf and
hard-of-hearing community. By developing a real-time sign language recognition
model that can recognize and works like a communication bridge it easier for people
to connect with each other regardless of their hearing abilities. The model works by
using deep learning algorithms to analyze video footage of hand gestures and translate
them into text or speech.

The proposed system is a real-time sign language detector that uses computer
vision techniques and machine learning algorithms to recognize and translate
American Sign Language (ASL) gestures into text or speech.The system consists of a
camera that captures video footage of the user's hand gestures, which is then fed into a
deep-learning model using MediaPipe for analysis.The proposed system is designed to
be user-friendly and accessible, with a simple interface that allows users to easily
input their ASL gestures and receive real-time feedback.This proposed system may
future we update feature in the real time video call sign translator
LIST OF FIGURES
FIG.NO FIGURE NAME PAGE NO.
1 American Sign Language Symbols 6
2 Example of MediaPipe Holistic 8
3 MediaPipe Landmarks for Hands 8
4 Model Architecture Diagram 9
5 LSTM Computation Cell 11
6 Model Testing Accuracy & Loss 22
TABLE OF CONTENTS

CHAPTER NO. TITLE PAGE NO.

ABSTRACT iv

LISTOFFIGURES v
1. INTRODUCTION 1
2. LITERATURE SURVEY 3
3. EXISTING SYSTEM 5
4. PROPOSED SYSTEM 6
5. MODEL ARCHITECTURE 9
6. SYSTEM SPECIFICATION 12
7. IMPLEMENTATIONAND OUTPUT RESULT 22
8. CONCLUSION 23

APPENDIX – I SOURCE CODE 24


APPENDIX – II REFRENCES 33
CHAPTER1
INTRODUCTION
Recently, deep learning algorithm's have successfully addressed problems in
various fields such as Image classification, Machine translation, Speech recognition
and other machine learning related areas.In this era of digital transformation, sign
language detector using deep learning has the potential to revolutionize
communication for the deaf and hard of hearing communities.Sign language
recognition using deep learning can be used in various applications, such as online
communication, educational settings, and even in public spaces like airports or
hospitals.The technology can also be used to create sign language avatars, which can
make it easier for people to interact with machines, such as virtual assistants or
robots.Sign language detector using deep learning is a rapidly developing field, with
ongoing research aimed at improving the accuracy and usability of the technology.
In accordance with the report of the World Federation of the Deaf (WFD) over
5% of the world’s population (≈ 360 million people) has hearing impairment including
328 million adults and 32 Million children. Approximately there are about 300 sign
language is in use around the globe. Sign language recognition is a challenging task as
sign language alphabets are different for different sign languages. For instance,
American Sign Language (ASL) alphabets vary widely from Indian Sign Language or
Italian Sign Language. Thus Sign language varies from region to region. Moreover,
articulation of single as well as double hands is used to convey meaningful messages.
Sign Language can be expressed by the compressed version, where a single gesture is
sufficient to describe a word. Now, sign language also has fingerspelling to describe
each alphabet of the word using different signs corresponding to a particular letter. As
there are many words still not standardized in sign language dictionaries,
fingerspelling is often used to manifest a word. There are still about 150,000 words in
spoken English having no counterpart in ASL. Furthermore, any name of people,
places, brands or titles doesn’t have any standardized sign symbol. Besides, a user
might not be aware of the exact sign of any particular word and in this scenario,
fingerspelling comes in handy and any word can be easily described.
Previous works included sensor-based Sign language Recognition (SLR)
system, which was quite uncomfortable and more restrictive for signers. Specialized
hardware for example sensors, were used which were an expensive option as well.
Whereas, computer vision-based techniques uses bare hands without any sensors or
coloured gloves. Due to the use of single camera, computer-vision based technique is
more cost-effective and highly portable compared to sensor-based techniques. In
computer-vision based methods, the most common approach for hand-tracking is skin
colour detection or background subtraction. Computer vision-based SLR system often

1
deals with feature extraction example boundary modelling, contour, segmentation of
gestures and estimation of hand shapes. But, all these solutions are not lightweight
enough to run in real-time devices like mobile phone applications and thus are
restricted to platform equipped with robust processors. Moreover, the challenge of
hand-tracking remained persistent in all these techniques. To address this drawback,
our proposed methodology used an approach that involves Google’s innovative,
rapidly growing and open source project MediaPipe and a machine learning algorithm
on top of this framework to get a faster, simpler, cost-effective, portable and easy to
deploy pipeline which can be used as a sign language recognition system.
1.1 SCOPE OF THEPROJECT
The main scope this project is to recognize the sign language from using human
gestureand translate the sign language. It improves the communication between
normal people and deaf people this project will act like bridge both actors. This sign
language project will adopt the different region sign languages also.
1.2 OBJECTIVE OF THE PROJECT
The objective of this project is to create an automatic sign language translator
that can recognize and interpret sign language gestures and movements accurately
and generate spoken language output in real-time. The project aims to use state-of-
the-art computer vision and machine learning techniques to develop a reliable and
efficient system that can recognize a broad range of sign language gestures and
dialects.
The system's primary objective is to facilitate communication between
individuals who use sign language and those who do not, improving inclusivity and
accessibility for people with hearing impairments.
The project's secondary objective is to develop a user-friendly and accessible
interface that can be easily used by people with varying levels of technical knowledge.
Ultimately, the project's success will be measured by its ability to accurately translate
sign language into spoken language and its usefulness in promoting communication
and inclusivity in society.

2
CHAPTER2
LITERATURE SURVEY
Relatively hand gesture recognition is a difficult problem to address in the field
of machine learning. Classification methods can be divided into supervised and
unsupervised method. Based on these methods the SLR system can recognize static or
dynamic sign gestures of hands.
2.1 Neural Network for the First Time in Sign Language Recognition
Murakami and Taguchi in the year 1991, published a research article using neural
network for the first time in sign language recognition. With the development in the
field of computer vision, numerous researchers came up with novel approaches to help
the physically challenged community. Using coloured gloves, a real-time hand
tracking application was developed by Wang and Popovic. The colour pattern of the
gloves was recognized by K-Nearest Neighbors (KNN) technique but continuous
feeding of hand streams is required for the system.
2.2 Isolated Sign Recognition
Support Vector Mechanism (SVM) outperformed this algorithm in the research
findings of Rekha et al, Kurdyumov et al, Tharwat et al. and Baranwal and Nandi.
There are two types of Sign Language Recognition: Isolated sign recognition and
continuous sentence recognition. Likewise, whole sign level modelling and subunit
sign level modelling exist in the SLR system. Visual-descriptive and linguistic-
oriented are two approaches that lead to subunit level sign modelling.
2.3 Pre-Processed Images for a Hand-Detection System
R.Sharma et al., used 80000 individual numeric signs with more than 500
pictures per sign to train a machine learning model. Their system methodology
comprises a training database of pre-processed images for a hand-detection system and
a gesture recognition system. Image pre-processing included feature extraction to
normalize the input information before training the machine learning model. The
images are converted into grayscale for better object contour maintaining a
standardized resolution and then flattened into a smaller amount of one-dimensional
components. The feature extraction technique helps to extract certain features about
the pixel data from images and feed them to CNN for easier training and more
accurate prediction. Hand tracking in 2D and 3D space has been performed by W.Liu
et al.They used skin saliency where skin tones within a specific range were extracted
for better feature extraction and achieved a classification accuracy of around 98%.

3
2.4 Action Recognition Using CNN
Similar to action recognition, some recent works use CNNs to extract the
holistic features from image frames and then use the extracted features for
classification. Several approaches first extract body keypoints and then concatenate
their locations as a feature vector. The extracted features are then fed into a stacked
GRU for recognizing signs. These methods demonstrate the effectiveness of using
human poses in the word-level sign recognition task. Instead of encoding the spatial
and temporal information separately, recent works also employ 3D CNNs to capture
spatial-temporal features together. However, these methods are only tested on small-
scale datasets. Thus, the generalization ability of those methods remains unknown.
Moreover, due to the lack of a standard word-level large-scale sign language dataset,
the results of different methods evaluated on different small-scale datasets are not
comparable and might not reflect the practical usefulness of models.

2.5Conceptual Video Based SLT System


Conceptual video based SLT systems were introduced in the early 2000s. There
have been studies, such as which propose recognizing signs in isolation and then
constructing sentences using a language model. However, end to-end SLT from video
has not been realized until recently.

The most important obstacle to vision based SLT research has been the
availability of suitable datasets. Curating and annotating continuous sign language
videos with spoken language translations is a laborious task. There are datasets
available from linguistic sources and sign language interpretations from broadcasts.
However, the available annotations are either weak (subtitles) or too few to build
models which would work on a large domain of discourse. In addition, such datasets
lack the human pose information which legacy Sign Language Recognition (SLR)
methods heavily relied on.

The relationship between sign sentences and their spoken language translations
are non-monotonic, as they have different ordering. Also, sign glosses and linguistic
constructs do not necessarily have a one-to-one mapping with their spoken language
counterparts. This made the use of available CSLR methods (that were designed to
learn from weakly annotated data) infeasible, as they are build on the assumption that
sign language videos and corresponding annotations share the same temporal order.

It is evident from all these previous methods that to recognize hand gesture
precisely with high accuracy, models require a large dataset and complicated
methodology with complex mathematical processing. Pre-processing of images plays
a vital in the gesture tracking process. Therefore, for our project, we used an open-
source framework from Google known as MediaPipe which is capable of detecting
human body part accurately.

4
CHAPTER 3
EXISTINGSYSTEM
3.1 Sign Language Recognition
Early approaches for SLR rely on hand-crafted features (Tharwat et al., 2014;
Yang, 2010) and use Hidden Markov Models (Forster et al., 2013) or Dynamic Time
Warping (Lichtenauer et al., 2008) to model sequential dependencies. More recently,
2D convolutional neural networks (2D-CNN) and 3D convolutional neural networks
(3D-CNN) effectively model spatio-temporal representations from sign language
videos (Cui et al., 2017; Molchanov et al., 2016). Most existing work on CSLR
divides the task into three sub-tasks: alignment learning, single-gloss SLR, and
sequence construction (Koller et al., 2017; Zhang et al., 2014) while others perform
the task in an end-to-end fashion using deep learning (Huang et al., 2015; Camgoz et
al., 2017).

3.2 Sign Language Translation


SLT was formalized in Camgoz et al. (2018) where they introduce the
PHOENIX-Weather 2014T dataset and jointly use a 2D-CNN model to extract gloss-
level features from video frames, and a seq2seq model to perform German sign
language translation. Subsequent works on this dataset (Orbay and Akarun, 2020;
Zhou et al., 2020) all focus on improving the CSLR component in SLT. A
contemporaneous paper (Camgoz et al., 2020) also obtains encouraging results with
multi-task Transformers for both tokenization and translation, however their CSLR
performance is sub-optimal, with a higher Word Error Rate than baseline models.
Similar work has been done on Korean sign language by Ko et al. (2019) where they
estimate human keypoints to extract glosses, then use seq2seq models for translation.
Arvanitis et al. (2019) use seq2seq models to translate ASL glosses of the ASLG-
PC12 dataset (Othman and Jemni, 2012).

3.3 Neural Machine Translation


Neural Machine Translation (NMT) employs neural networks to carry out
automated text translation. Recent methods typically use an encoder-decoder
architecture, also known as seq2seq models. Earlier approaches use recurrent
(Kalchbrenner and Blunsom, 2013; Sutskever et al., 2014) and convolutional networks
(Kalchbrenner et al., 2016; Gehring et al., 2017) for the encoder and the decoder.
However, standard seq2seq networks are unable to model long-term dependencies in
large input sentences without causing an information bottleneck. To address this issue,
recent works use attention mechanisms (Bahdanau et al., 2015; Luong et al., 2015)
that calculates context-dependent alignment scores between encoder and decoder

5
hidden states. Vaswani et al. (2017) introduces the Transformer, a seq2seq model
relying on self-attention that obtains state-of-the-art results in NMT.

CHAPTER4
PROPOSED
SYSTEM
The deaf-mute community have undeniable communication problems in their
daily life. Recent developments in artificial intelligence tear down this communication
barrier. The main purpose of this paper is to demonstrate a methodology that
simplified Sign Language Recognition using MediaPipe’s open-source framework and
machine learning algorithm. The predictive model is lightweight and adaptable to
smart devices. Multiple sign language datasets such as American, Indian, Italian and
Turkey are used for training purpose to analyze the capability of the framework. With
an average accuracy of 99%, the proposed model is efficient, precise and robust. Real-
time accurate detection using Long Short Term Memory (LSTM) algorithm without
any wearable sensors makes use of this technology more comfortable and easy.

4.1 American Sign Language ( ASL )

Figure 1 – American Sign Language Symbols

ASL Basics: American Sign Language (ASL) is a complete, natural language


that has the same linguistic properties as spoken languages, with grammar that differs
from English.
6
Sign Language in Different Countries: There is no universal sign language.
Different sign languages are used in different countries or regions. Some countries
adopt features of ASL in their sign languages.

Origins of ASL: No person or committee invented ASL. The exact beginnings


of ASL are not clear, but some suggest that it arose more than 200 years ago from the
intermixing of local sign languages and French Sign Language (LSF, or Langue des
Signes Française).

ASL Compared to Spoken Language: ASL is a language completely separate


and distinct from English. It contains all the fundamental features of language, with
its own rules for pronunciation, word formation, and word order. Fingerspelling is
part of ASL and is used to spell out English words.

Neurobiology of Language Development: Study of sign language can also


help scientists understand the neurobiology of language development. Better
understanding of the neurobiology of language could provide a translational
foundation for treating injury to the language system, for employing signs or gestures
in therapy for children or adults, and for diagnosing language impairment in
individuals who are deaf.

Sign Languages Created Among Small Communities: The NIDCD is also funding
research on sign languages created among small communities of people with little to
no outside influence.

4.2 MediaPipe

MediaPipe Holistic Solution is a powerful, easy-to-use software tool that can


detect and track multiple human body parts and gestures in real-time video streams.
It is open-source and can run on a variety of platforms, including mobile devices,
making it an ideal solution for our competition.

Real-time Perception: Real-time, simultaneous perception of human pose, face


landmarks, and hand tracking can enable impactful applications like fitness analysis,
gesture control, and sign language recognition.

Open-Source Framework: MediaPipe is an open-source framework designed


for complex perception pipelines.

State-of-the-Art Solution: MediaPipe Holistic is a solution that provides a


state-of-the-art human pose topology, consisting of optimized pose, face, and hand
components that each run in real-time.
Unified Topology: MediaPipe Holistic provides a unified topology for 540+

7
keypoints and is available on-device for mobile and desktop.

Separate ML Models For Separate Tasks: The pipeline integrates separate


models for pose, face, and hand components, treating the different regions using a
region-appropriate image resolution.

Significant Model Coordination: MediaPipe Holistic requires coordination


between up to 8 models per frame and optimized machine learning models and pre-
and post-processing algorithms for performance benefits.

Performance Benefits: The multi-stage nature of the pipeline provides


performance benefits, as models are mostly independent and can be replaced with
lighter or heavier versions.

Figure 2 - Example of MediaPipe Holistic

8
Figure 3 - MediaPipe Landmarks for Hands

CHAPTER5
MODEL ARCHITECTURE

Figure4–Model Architecture Diagram

5.1 Model Procedure

5.1.1 Stage 1: Pre-Processing of Images to get Multi-hand Landmarks using


MediaPipe
MediaPipe is a framework that enables developers for building multi-
modal(video, audio, any times series data) cross-platform applied ML pipelines.

9
MediaPipe has a large collection of human body detection and tracking models which
are trained on a massive and most diverse dataset of Google. As the skeleton of nodes
and edges or landmarks, they track key points on different parts of the body. All co-
ordinate points are three-dimension normalized. Models build by Google developers
using Tensorflow lite facilitates the flow of information easily adaptable and
modifiable via graphs. MediaPipe pipelines are composed of nodes on a graph which
are generally specified in pbtxt file. These nodes are connected to C++ files.

Expansion upon these files is the base calculator class in Mediapipe. Just like a
video stream this class gets contracts of media streams from other nodes in the graph
and ensures that it is connected. Once, rest of the pipelines nodes are connected, the
class generates its own output processed data. Packet objects encapsulating many
different types of information are used to send each stream of information to each
calculator. Into a graph, side packets can also be imposed, where a calculator node can
be introduced with auxiliary data like constants or static properties. This simplified
structure in the pipeline of dataflow enables additions or modifications with ease and
the flow of data becomes more precisely controllable. The Hand tracking solution has
an ML pipeline at its backend consisting of two models working dependently with
each other: a) Palm Detection Model b) Land Landmark Model. The Palm Detection
Model provides an accurately cropped palm image and further is passed on to the
landmark model. This process diminishes the use of data augmentation (i.e. Rotations,
Flipping, Scaling) that is done in Deep Learning models and dedicates most of its
power for landmark localization.

The traditional way is to detect the hand from the frame and then do landmark
localization over the current frame. But in this Palm Detector using ML pipeline
challenges with a different strategy. Detecting hands is a complex procedure as you
have to perform image processing and thresholding and work with a variety of hand
sizes which leads to consumption of time. Instead of directly detecting hand from the
current frame, first, the Palm detector is trained which estimates bounding boxes
around the rigid objects like palm and fists which is simpler than detecting hands with
coupled fingers. Secondly, an encoder-decoder is used as an extractor for bigger scene
context.

5.1.2 Stage 2: Data cleaning and normalization


As in stage 1, we are only considering x and y coordinates from the detector,
each image in the dataset is passed through stage 1 to collect all the data points under
one file. This file is then scraped through the pandas' library function to check for any
nulls entries. Sometimes due to blurry image, the detector cannot detect the hand
which leads to null entry into the dataset. Hence, it is necessary to clean these points
or will lead to biasness while making the predictive model. Rows containing these null
entries are searched and using their indexes removed from the table. After the removal
of unwanted points, we normalized x and y coordinates to fit into our system. The data

10
file is then prepared for splitting into training and validation set. 80% of the data is
retained for training our model with various optimization and loss function, whereas
20% of data is reserved for validating the model.

5.1.3 Stage 3: Prediction using LSTM


LSTM (Long Short-Term Memory) is a type of recurrent neural network (RNN)
architecture that is specifically designed to address the vanishing gradient problem,
which is a challenge faced by traditional RNNs when processing long sequences of
data. LSTMs are capable of learning long-term dependencies in sequential data by
utilizing a memory cell and various gates.

The key components of an LSTM are as follows:


Cell State (Ct): The cell state acts as a memory unit that can store information over
long sequences. It allows the LSTM to preserve relevant information and discard
irrelevant information as it flows through the network.

Figure 5– LSTM Computation Cell

Hidden State (ht): The hidden state is the output of the LSTM at each time step. It
can be considered as the LSTM's "memory" of the previous time steps and contains
information that is relevant for prediction or classification tasks.
Input Gate (it), Forget Gate (ft), and Output Gate (ot): These gates regulate the
flow of information through the LSTM cell. The input gate determines how much new
information should be stored in the cell state, the forget gate controls what information
to discard from the cell state, and the output gate determines how much of the cell
state should be exposed as the hidden state.

Gate Activation Functions: Sigmoid activation functions are typically used for the
input, forget, and output gates to squash the gate values between 0 and 1. This allows

11
the gates to control the flow of information effectively.

Cell Activation Function: The cell state uses a hyperbolic tangent (tanh) activation
function to squish the values between -1 and 1, which helps in capturing the
relationships and dependencies between different time steps.

During the forward pass of training or inference, the LSTM takes a sequence of
inputs (e.g., word embeddings or image features) and updates its cell state and hidden
state at each time step. The updated hidden state can be used for tasks like sequence
prediction, sentiment analysis, machine translation, and more.

LSTMs have been widely used in various domains, including natural language
processing (NLP), speech recognition, time series analysis, and computer vision, due
to their ability to model and understand complex sequential patterns and dependencies.
They have proven to be effective in capturing long-range dependencies and have
become a popular choice for tasks involving sequential data.

CHAPTER6
SYSTEMSPECIFICATION
6.1 HardwareRequirement

Processor AMD Ryzen 7

OperatingSystem Windows10(64-bit) / Linux (64-bit)

RAM 16GB

Graphic Card GeForce NVIDIA

6.2 SoftwareRequirement & Libraries Dependencies

S.No Software Version URL

1 Anaconda 3.7 https://www.anaconda.com/

2 Tensorflow 2.0 https://www.tensorflow.org/

3 Pandas 0.19.2 https://pandas.pydata.org/

4 Numpy 1.11.3 https://numpy.org/

12
5 OpenCV 4.7.0 https://opencv.org/

6 Scikit-learn 1.1.2 https://scikit-learn.org/stable/

7 MediaPipe 0.10.0 https://pypi.org/project/mediapipe/

8 Matplotlib 3.7.1 https://pypi.org/project/matplotlib/

Anaconda:

 Anaconda is a free and open-source distribution of Python and R for scientific


computing, data science, and machine learning.

 It includes over 1,500 data science packages and libraries, making it easy to get
started with data analysis and machine learning.

 It provides a package manager and environment manager for managing


dependencies and creating isolated environments for different projects.

 Anaconda is available for Windows, macOS, and Linux, and can be installed
through a graphical installer or command-line interface.

 It also includes a range of development tools, including Jupyter Notebook,


Spyder, and VS Code, for developing and experimenting with code.

TensorFlow:

 TensorFlow is an open-source software library for building and training machine


learning models.

 It was developed by Google and is widely used in industry and academia for a
variety of tasks, including image and speech recognition, natural language
processing, and more.

 TensorFlow provides a flexible and scalable framework for building deep neural
networks, including support for distributed training across multiple GPUs and
machines.

 It includes a range of pre-built models and tools for visualizing and analyzing
model performance.

 TensorFlow can be used with Python, C++, Java, and other programming
languages.

13
Pandas:

 Pandas is a Python library used for data manipulation and analysis.

 It offers data structures like DataFrames and Series, which enable easy
manipulation and analysis of structured data.

 Pandas provides powerful features for data cleaning, filtering, and aggregation.

 It also provides functionality for merging, pivoting, and reshaping data.

 Pandas supports a wide range of data formats, including CSV, Excel, SQL
databases, and more.

 It is a popular tool in data analysis and machine learning workflows.

NumPy:

 NumPy is a fundamental library for numerical computing in Python.

 It offers efficient data structures like arrays and matrices, which enable fast
computations.

 NumPy provides a wide range of mathematical functions and linear algebra


operations.

 It supports multi-dimensional data processing and manipulation.

 NumPy enables efficient data storage and retrieval using disk or memory-
mapped files.

 It is a foundation for many other scientific computing and data analysis


libraries.

OpenCV:

 OpenCV is a popular library for computer vision and image processing tasks.

 It offers a wide range of functions and algorithms for image and video I/O,
image manipulation, feature detection, and object recognition.

 OpenCV supports many programming languages, including Python, C++, and


Java.

 It provides a comprehensive set of tools for camera calibration and 3D

14
reconstruction.

 OpenCV enables real-time processing of multimedia data on a wide range of


platforms.

 It is widely used in computer vision applications, robotics, and machine


learning.

Scikit-learn:

 Scikit-learn is a comprehensive machine learning library in Python.

 It offers a wide range of supervised and unsupervised learning algorithms,


including classification, regression, clustering, and dimensionality reduction.

 Scikit-learn provides tools for data preprocessing, model selection, and


hyperparameter tuning.

 It also offers functionality for evaluating models and handling missing data.

 Scikit-learn is compatible with other Python libraries like NumPy, Pandas, and
Matplotlib.

 It is widely used in machine learning research and applications.

MediaPipe:

 MediaPipe is an open-source framework for building multimodal perceptual


computing pipelines.

 It provides pre-built components and algorithms for tasks like hand tracking,
face detection, pose estimation, and object detection.

 MediaPipe enables real-time processing of multimedia data on a wide range of


platforms.

 It supports many programming languages, including Python, C++, and Java.

 MediaPipe provides a flexible and scalable framework for building custom


pipelines.

 It is widely used in computer vision and augmented reality applications.

Matplotlib:

 Matplotlib is a popular plotting library in Python.

15
 It provides a wide range of functions for creating line plots, scatter plots, bar
plots, histograms, and more.

 Matplotlib offers extensive customization options for labels, titles, axes, colors,
and styles.

 It enables the creation of 2D and basic 3D visualizations.

 Matplotlib is compatible with other Python libraries like NumPy, Pandas, and
scikit-learn.

 It is widely used for data visualization and presentation of results in scientific


and data analysis projects.

6.3Installation Procedure
6.3.1 Anaconda Navigator - Installing on Windows
1. Download the Anaconda installer.
2. Go to your Downloads folder and double-click the installer to launch. To
prevent permission errors, do not launch the installer from the Favorites folder.
Notes : If you encounter issues during installation, temporarily disable your
anti-virus software during install, then re-enable it after the installation
concludes. If you installed for all users, uninstall Anaconda and re-install it for
your user only.
3. Click Next.
4. Read the licensing terms and click I Agree.
5. It is recommended that you install for Just Me, which will install Anaconda
Distribution to just the current user account. Only select an install for All
Users if you need to install for all users’ accounts on the computer (which
requires Windows Administrator privileges).
6. Click Next.
7. Select a destination folder to install Anaconda and click Next. Install Anaconda
to a directory path that does not contain spaces or unicode characters.
8. Do not install as Administrator unless admin privileges are required.

16
9. Choose whether to add Anaconda to your PATH environment variable or
register Anaconda as your default Python. We don’t recommend adding
Anaconda to your PATH environment variable, since this can interfere with
other software. Unless you plan on installing and running multiple versions of
Anaconda or multiple versions of Python, accept the default and leave this box
checked. Instead, use Anaconda software by opening Anaconda Navigator or
the Anaconda Prompt from the Start Menu.
10.As of Anaconda Distribution 2022.05, the option to add Anaconda to the PATH
environment variable during an All Users installation has been disabled. This
was done to address a security exploit. You can still add Anaconda to the PATH
environment variable during a Just Me installation.

17
11.Click Install. If you want to watch the packages Anaconda is installing, click
Show Details.
12.Click Next.
13.Optional: To install Dataspell for Anaconda,
click https://www.anaconda.com/dataspell.
14.Or to continue without Dataspell, click Next.

18
1. After a successful installation you will see the “Thanks for installing Anaconda”
dialog box:

2. If you wish to read more about Anaconda.org and how to get started with
Anaconda, check the boxes “Anaconda Distribution Tutorial” and “Learn more
about Anaconda”. Click the Finish button.
3. Verify your installation.

6.3.2 Anaconda Navigator - Installing on Linux


1. In your browser, download the Anaconda installer for Linux.
2. Search for “terminal” in your applications and click to open.
3. (Recommended) Verify the installer’s data integrity with SHA-256. For more
information on hash verification, see cryptographic hash validation.
 In the terminal, run the following:
shasum-a256/PATH/FILENAME
# Replace /PATH/FILENAME with your installation's path and filename.
Install for Python 3.7 or 2.7 in the terminal:
 For Python 3.7, enter the following:

19
# Include the bash command regardless of whether or not you are using the Bash
shell
bash~/Downloads/Anaconda3-2020.05-Linux-x86_64.sh
# Replace ~/Downloads with your actual path
# Replace the .sh file name with the name of the file you downloaded
 For Python 2.7, enter the following:
# Include the bash command regardless of whether or not you are using the Bash
shell
bash~/Downloads/Anaconda2-2019.10-MacOSX-x86_64.sh
# Replace ~/Downloads with your actual path
# Replace the .sh file name with the name of the file you downloaded
Press Enter to review the license agreement. Then press and hold Enter to scroll.
Enter “yes” to agree to the license agreement.
Use Enter to accept the default install location, use CTRL+C to cancel the
installation, or enter another file path to specify an alternate installation directory.
If you accept the default install location, the installer
displays PREFIX=/home/<USER>/anaconda<2/3> and continues the installation.
It may take a few minutes to complete.
Notes: Anaconda recommends you accept the default install location. Do not choose
the path as /usr for the Anaconda/Miniconda installation.
The installer prompts you to choose whether to initialize Anaconda Distribution
by running conda init. Anaconda recommends entering “yes”.
 If you enter “no”, then conda will not modify your shell scripts at all. In
order to initialize after the installation process is done, first
run source [PATH TO CONDA]/bin/activate and then run conda init.
See FAQ.
The installer finishes and displays, “Thank you for installing Anaconda<2/3>!”
Optional: The installer describes the partnership between Anaconda and
JetBrains and provides a link to install Dataspell for Anaconda
at https://www.anaconda.com/dataspell.
Close and re-open your terminal window for the installation to take effect, or
enter the command source ~/.bashrc to refresh the terminal.

20
You can also control whether or not your shell has the base environment
activated each time it opens.
# The base environment is activated by default
condaconfig--setauto_activate_baseTrue
# The base environment is not activated by default
condaconfig--setauto_activate_baseFalse
# The above commands only work if conda init has been run first
# conda init is available in conda versions 4.6.12 and later
Verify your installation.
Notes: If you install multiple versions of Anaconda, the system defaults to the most
current version, as long as you haven’t altered the default install path.
Verifying Your Installation
Confirm that Anaconda is installed and working with Anaconda Navigator or
conda with the following instructions.
Anaconda Navigator
1. Anaconda Navigator is a graphical user interface (GUI) that is automatically
installed with Anaconda. Navigator will open if the installation was successful. If
Navigator does not open, review our help resources.
2. Windows: Click Start, search for Anaconda Navigator, and click to open.
3. macOS: Click Launchpad and select Anaconda Navigator. Or use Cmd+Space to
open Spotlight Search and type “Navigator” to open the program.
Conda
1. If you prefer using a command line interface (CLI), use conda to verify the
installation using Anaconda Prompt on Windows or the terminal on Linux and
macOS.
2. To open Anaconda Prompt:

3. Windows: Click Start, search for Anaconda Prompt, and click to open.

4. macOS: Use Cmd+Space to open Spotlight Search and type “Navigator” to


open the program.

5. Linux–CentOS: Open Applications > System Tools > terminal.

6. Linux–Ubuntu: Open the Dash by clicking the Ubuntu icon, then type
“terminal”.

21
7. After opening Anaconda Prompt or the terminal, choose any of the following
methods to verify:

8. Enter conda list. If Anaconda is installed and working, this will display a list
of installed packages and their versions.

9. Enter the command python. This command runs the Python shell, also known
as the REPL. If Anaconda is installed and working, the version information it
displays when it starts up will include “Anaconda”. To exit the Python shell,
enter the command quit().

10.Open Anaconda Navigator with the command anaconda-navigator. If


Anaconda is installed properly, Anaconda Navigator will open.

6.3.3 Libraries Installation Code

Tensorflow: Scikit-learn:
pip install tensorflow pip install scikit-learn
Pandas: MediaPipe:
pip install pandas pip install mediapipe
NumPy: Matplotlib:
pip install numpy pip install matplotlib
OpenCV:
pip install opencv-python

CHAPTER7
IMPLEMENTATIONANDOUTPUT RESULT
The sign language recogtion and translator model was trained using sample
training dataset extracted from real-time trained sign symbols, over 100 epochs. The
resulting accuracy of the text only emotion recognition model was 83.33%, with loss
of 0.17%.

 import matplotlib.pyplot as plt is added to import the required plotting library.

 The history variable is assigned the output of the fit method to store the training
history.

 The code for plotting training accuracy and loss is added after the model
training loop.

22
 The plt.show() command is used to display the generated plots.

 Finally, the test accuracy is printed using print("Test Accuracy:", accuracy).

The following figures show the loss and accuracy results while training a text only
emotion recognition model for each and every epoch.

Figure6 – Model Testing Accuracy & Loss

Test Accuracy: 0.8333333333333334

CHAPTER8
CONCLUSION
With an average accuracy of 83% in most of the sign language dataset using
MediaPipe’s technology and machine learning, our proposed methodology show that
MediaPipe can be efficiently used as a tool to detect complex hand gesture precisely.
Although, sign language modelling using image processing techniques has evolved
over the past few years but methods are complex with a requirement of high
computational power. Time consumption to train a model is also high. From that
perspective, this work provides new insights into this problem. Less computing power
and the adaptability to smart devices makes the model robust and cost-effective.
Training and testing with various sign language datasets show this framework can be
adapted effectively for any regional sign language dataset and maximum accuracy can
be obtained. Faster real-time detection demonstrates the model’s efficiency better than
the present state-of-arts. In the future, the work can be extended by introducing word

23
detection of sign language from videos using Mediapipe’s state-of-art and best
possible classification algorithms

APPENDIX – I
SOURCE
CODE
Data Collection
import os
import numpy as np
import cv2
import mediapipe as mp
from itertools import
product from my_functions
24
import *

25
actions = np.array(['a',
'b']) sequences = 30
frames = 10

PATH = os.path.join('data')

for action, sequence in product(actions, range(sequences)):


try:
os.makedirs(os.path.join(PATH, action, str(sequence)))
except:
pass

cap = cv2.VideoCapture(0)
if not cap.isOpened():
print("Cannot access camera.")
exit()

with mp.solutions.holistic.Holistic(min_detection_confidence=0.75,
min_tracking_confidence=0.75) as holistic:
for action, sequence, frame in product(actions, range(sequences), range(frames)):
if frame == 0:
while True:
if keyboard.is_pressed(' '):
break
_, image = cap.read()

results = image_process(image, holistic)


draw_landmarks(image, results)

26
cv2.putText(image, 'Recroding data for the "{}". Sequence number
{}.'.format(action, sequence),
(20,20), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0,0,255), 1,
cv2.LINE_AA)
cv2.putText(image, 'Pause.', (20,400), cv2.FONT_HERSHEY_SIMPLEX,
1, (0,0,255), 2, cv2.LINE_AA)
cv2.putText(image, 'Press "Space" when you are ready.', (20,450),
cv2.FONT_HERSHEY_SIMPLEX, 1, (0,0,255), 2, cv2.LINE_AA)
cv2.imshow('Camera', image)
cv2.waitKey(1)

if cv2.getWindowProperty('Camera',cv2.WND_PROP_VISIBLE) < 1:
break
else:
_, image = cap.read()
results = image_process(image, holistic)
draw_landmarks(image, results)

cv2.putText(image, 'Recroding data for the "{}". Sequence number


{}.'.format(action, sequence),
(20,20), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0,0,255), 1,
cv2.LINE_AA)
cv2.imshow('Camera', image)
cv2.waitKey(1)

if cv2.getWindowProperty('Camera',cv2.WND_PROP_VISIBLE) < 1:
break

27
keypoints = keypoint_extraction(results)
frame_path = os.path.join(PATH, action, str(sequence), str(frame))
np.save(frame_path, keypoints)

cap.release()
cv2.destroyAllWindows()

MY FUNCTION
import mediapipe as
mp import cv2
import numpy as np

def draw_landmarks(image, results):


mp.solutions.drawing_utils.draw_landmarks(image, results.left_hand_landmarks,
mp.solutions.holistic.HAND_CONNECTIONS)
mp.solutions.drawing_utils.draw_landmarks(image, results.right_hand_landmarks,
mp.solutions.holistic.HAND_CONNECTIONS)

def image_process(image, model):


image.flags.writeable = False
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
results = model.process(image)
image.flags.writeable = True
image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
return results

def keypoint_extraction(results):
lh = np.array([[res.x, res.y, res.z] for res in
results.left_hand_landmarks.landmark]).flatten() if results.left_hand_landmarks else

28
np.zeros(63)
rh = np.array([[res.x, res.y, res.z] for res in
results.right_hand_landmarks.landmark]).flatten() if results.right_hand_landmarks
else np.zeros(63)
return np.concatenate([lh, rh])

MODEL
import tensorflow
import numpy as np
import os
from sklearn.model_selection import train_test_split
from tensorflow.keras.utils import to_categorical
from itertools import product
from sklearn import metrics

from tensorflow.keras.models import Sequential


from tensorflow.keras.layers import LSTM, Dense

actions = np.array(os.listdir(PATH))
sequences = 30
frames = 10

label_map = {label:num for num, label in enumerate(actions)}

landmarks, labels = [], []

for action, sequence in product(actions, range(sequences)):

29
temp = []
for frame in range(frames):
npy = np.load(os.path.join(PATH, action, str(sequence), str(frame) + '.npy'))
temp.append(npy)
landmarks.append(temp)
labels.append(label_map[action])

X, Y = np.array(landmarks), to_categorical(labels).astype(int)

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.10,


random_state=34, stratify=Y)

model.add(LSTM(64, return_sequences=True, activation='relu'))


model.add(LSTM(32, return_sequences=False, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(actions.shape[0], activation='softmax'))

model.compile(optimizer='Adam', loss='categorical_crossentropy',
metrics=['categorical_accuracy'])
model.fit(X_train, Y_train, epochs=100)

model.save('my_model')

predictions = np.argmax(model.predict(X_test), axis=1)


test_labels = np.argmax(Y_test, axis=1)

accuracy = metrics.accuracy_score(test_labels, predictions)

30
history = model.fit(X_train, Y_train, epochs=100, validation_data=(X_test, Y_test))

# Plot training accuracy and loss


from sklearn import metrics
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.plot(history.history['categorical_accuracy'])
plt.plot(history.history['val_categorical_accuracy'])
plt.title('Model Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend(['Train', 'Test'], loc='upper left')

plt.subplot(1, 2, 2)
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend(['Train', 'Test'], loc='upper right')

plt.tight_layout()
plt.show()

print("Test Accuracy:", accuracy)


MAIN

31
import numpy as np
import os
import mediapipe as
mp import cv2
from my_functions import *
import tensorflow as tf
from tensorflow import keras
from keras.models import load_model
import keyboard

PATH = os.path.join('data')

actions = np.array(os.listdir(PATH))

model = load_model('my_model')

sentence, keypoints = [' '], []

cap = cv2.VideoCapture(0)
if not cap.isOpened():
print("Cannot access camera.")
exit()

while cap.isOpened():
_, image = cap.read()

32
results = image_process(image, holistic)
draw_landmarks(image, results)
keypoints.append(keypoint_extraction(results))

if len(keypoints) == 10:
keypoints = np.array(keypoints)
prediction = model.predict(keypoints[np.newaxis, :, :])
keypoints = []

if np.amax(prediction) > 0.9:


if sentence[-1] != actions[np.argmax(prediction)]:
sentence.append(actions[np.argmax(prediction)])

if len(sentence) > 7:
sentence = sentence[-
7:]

if keyboard.is_pressed('
'): sentence = [' ']

textsize = cv2.getTextSize(' '.join(sentence),


cv2.FONT_HERSHEY_SIMPLEX, 1, 2)[0]
text_X_coord = (image.shape[1] - textsize[0]) // 2

cv2.putText(image, ' '.join(sentence), (text_X_coord, 470),


cv2.FONT_HERSHEY_SIMPLEX, 1, (255, 255, 255), 2,
cv2.LINE_AA)

cv2.imshow('Camera', image)

33
cv2.waitKey(1)
if cv2.getWindowProperty('Camera',cv2.WND_PROP_VISIBLE) < 1:
break

cap.release()
cv2.destroyAllWindows()

APPENDIX – II
REFERENCES
1. The world’s simplest facial recognition API for Python and the command
line. https://github.com/ ageitgey/face_recognition.

2. Runpeng Cui, Hu Liu, and Changshui Zhang. Recurrent convolutional neural


networks for continuous sign language recognition by staged optimization.
In CVPR, 2017.

3. Necati Cihan Camgoz, Oscar Koller, Simon Hadfield, and Richard Bowden.
2020. Sign language transformers: Joint end-to-end sign language recognition
and translation. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), June.

34
4. Pradyumna Narayana, J. Ross Beveridge, and Bruce A. Draper.
Gesture recognition: Focus on the hands. In CVPR, 2018.

5. R. Sharma, R. Khapra, N. Dahiya. June 2020. Sign Language


Gesture Recognition.

6. Sahoo, Ashok. 2014. Indian sign language recognition using neural


networks and kNN classifiers. Journal of Engineering and Applied Sciences

7. Elakkiya R, Selvamani K, Velumadhava Rao R, Kannan A. 2012. Fuzzy hand


gesture recognition based human computer interface intelligent system.
UACEE Int J Adv Comput Netw Secur 2(1):29–33 (ISSN 2250–3757)
8. Ahmed AA, Aly S. 2014. Appearance-based arabic sign language recognition
using hidden markov models. In: IEEE International Conference on
Engineering and Technology (ICET), pp 1–6. https ://doi.org/10.1109/ICEng
Techn ol.2014.70168 04

9. R. Sharma, R. Khapra, N. Dahiya. June 2020. Sign Language


Gesture Recognition, pp.14-19

35

You might also like