Sign Language
Sign Language
1
BONAFIED CERTIFICATE
2
Abstract
Communication is essential to express and receive information, knowledge, ideas, and views
among people, but it has been quite a while to be an obstruction for people with hearing and
mute disabilities.Sign language is one method of communicating with deaf people.Though
there is sign language to communicate with non-sign people it is difficult for everyone to
interpret and understand.The performance of existing sign language recognition approaches is
typically limited.Developing an assistive device that will translate the sign language to a
readable format will help the deaf-mutes to communicate with ease to the common people.
Recent advancements in the development of deep learning, deep neural networks, especially
convolutional neural networks (CNNs) have provided solutions to the communication of deaf
and mute individuals.Convolution Neural Network (CNN) can effectively extract features
from images and generalize unseen images. In this project, the main objective is to provide
easiness of communication and to implement an automatic speaking system for deaf and
mute people. It provides two-way communication for all classes of people (deaf-and-mute,
hard of hearing, visually impaired, and non-signers) and can be scaled commercially. The
proposed system uses Convolution Neural Network (CNN) for converting sign language to
speech. he proposed model is trained on alphabets from American Sign Language. We
developed a web-based user interface to remove for ease of deployment. It is equipped with
text-to-speech, speech-to-text and auto-correct features to support communication between
deaf-and-mute, hard of hearing, visually impaired and non-signers.Experimental results on
MNIST sign language recognition datasets validate the superiority of the proposed
framework. The CNN model gives an accuracy of 98.5%.
3
TABLE OF CONTENTS
CHAPTER NO TITTLE PAGE NO
Abstract
List of Tables
List of Figures
List of Abbreviations
1 INTRODUCTION
1.1Objective
2 LITERATURE
SURVEY
3 SYSTEM ANALYSIS
3.1 Existing System
3.2 Proposed System
4 SYSTEM DESIGN
4.1. System Architecture
4.2. Training Phase
4.3. Testing Phase
4.3. Data Flow Diagram
5 SYSTEM
IMPLEMENTATION
5.1. System Description
5.2. Dataset Description
5.3. Module Description
6 UML DIAGRAMS
6.1. Use case
6.2. Class Diagram
6.3. Activity Diagram
6.4. Sequence Diagram
6.5. Collaboration
4
Diagram
7 SYSTEM
SPECIFICATION
8 SOFTWARE
DESCRIPTION
9 SYSTEM TESTING
9.1 Unit Testing
9.2 Validation Testing
9.3Functional Testing
9.4 Integration Testing
10 RESULTS AND
CONCLUSION
11 CONCLUSION AND
FUTURE
ENHANCEMENT
REFERENCES
LIST OF ABBREVIATIONS
5
SNO APPREVATIONS EXPLANATIONS
1 DL Deep learning
2 CNN convolutional neural
networks
3 ASL American Sign Language
4 CSOM Convolutional Self-
Organizing Map
5 SL Sign language
6 FSL French Sign Language
7 AI Artificial intelligence
8 ML Machine Learning
9 SLR Sign Language
Recognition
10 RNN Recurrent Neural
Network
11 DFD Data Flow Diagram
List of figures
FIGURES FIGURESNAME
6
Figure 1.1 Sign Language
CHAPTER 1
INTRODUCTION
7
1.1. Overview
Sign language is manual communication commonly used by people who are deaf. Sign
language is not universal; people who are deaf from different countries speak different sign
languages. The gestures or symbols in sign language are organized in a linguistic way. Each
individual gesture is called a sign. Each sign has three distinct parts: the handshape, the
position of the hands, and the movement of the hands. American Sign Language (ASL) is the
most commonly used sign language in the India.
8
palsy, trauma, and brain diseases or speech difficulties may require a nonverbal mode of
communication. 6,909 spoken languages and 138 sign languages have been identified, but
there is no universal sign language. Each has its own syntactical and grammatical structures
to provide definitive means of communication, primarily for deaf communities worldwide.
Sign languages emphasize on the movement of the hands, arms, head, and body in a
conceptually predetermined manner to significantly construct a gesture language. Indian Sign
Language (ISL) is the name given to the sign language used in India. According to the 2011
census, 2.7 million people in India cannot speak, and 1.8 million are deaf. They face
difficulty communicating with others because most normal people are unfamiliar with sign
language. However, communication between them becomes inevitable in emergencies. Sign
languageinterpreters are required to convert sign language to spoken language and vice versa,
but their supply is limited and expensive. As a result, automatic sign language recognition
systems are needed to translate signs into corresponding text or voice without the assistance
of interpreters. Through human-computer interaction, systems can be built to aid in the
development of the deaf and other communities who rely on sign languages.In recent years,
there have been ongoing efforts to develop automated methods for the completion of
numerous linguistic tasks using advanced algorithms that can ‘learn’ based on past
experience. Sign language recognition (SLR) is an area where automation can provide
tangible benefits and improve the quality of life for a significant number of people who rely
on sign language to communicate on a daily basis. The successful introduction of such
capabilities would allow for the creation of a wide array of specialized services, but it is
paramount that automated SLR tools are sufficiently accurate to avoid creating confusing or
dysfunctional responses.Recently, basic machine learning approaches have been largely
replaced with deeper architectures that employ several layers and pass information in vector
format between layers, gradually refining the estimation until positive recognition is
achieved. Such algorithms are usually described as ‘‘deep learning’’ systems or deep neural
networks, and they operate on principles similar to the machine learning strategies
described above, although with far greater complexity.
1.3. Deep Learning
Deep Learning is a part of machine learning, which is a subset of Artificial Intelligence. It
enables us to extract the information from the layers present in its architecture. It is used in
Image Recognition, Fraud Detection, News Analysis, Stock Analysis, Self-driving cars,
Healthcare like cancer image analysis, etc. By inputting more data in the network the layers
9
get trained very well. They can be classified into Supervised, Semi-Supervised and
Unsupervised categories. Each layer is known for extracting information specifically. For
example, in Image recognition, the first layer will find the edge, lines, etc, second layer like
eye, ear, nose, etc.
To help improve the efficiency of predictions, to find the best possible outcomes and for
model optimization. When the data is huge, to reduce the cost in the company in terms of
insurance, sales, profit, etc. Deep learning can be very useful when there is no particular
structure to data means to analyse data from audio, video, image, numbers, document
processing, etc.
Advantages of Deep Learning
Solve Complex problems like Audio processing in Amazon echo, Image recognition,
etc., reduce the need for feature extraction, automated tasks wherein predictions can
be done in less time using Keras and Tensorflow.
Parallel computing can be done thus reducing overheads.
Models can be trained on a huge amount of data and the model gets better with more
data.
High-Quality Predictions when compared with humans by training tirelessly.
Works well-unstructured data like video clips, documents, sensor data, webcam data,
etc.
10
1.3.1. Deep Learning Algorithms
Deep learning algorithms work with almost any kind of data and require large amounts of
computing power and information to solve complicated issues. Now, let us, deep-dive, into
the top 10 deep learning algorithms.
Healthcare
11
From Medical image analysis to curing diseases, Deep Learning played a huge role especially
when GPU-processors are present. It also helps Physicians, Clinicians, and doctors to help the
patients out of danger, and also they can diagnose and treat the patients with apt medicines.
Stock Analysis
Quantitative Equity Analysts are getting more benefits specially to find the trends for a
particular stock whether it will be bullish or bearish and they can use many more factors like
no of transactions made, no of buyers, no of sellers, previous day closing balance, etc when
training the deep learning layers. Qualitative Equity Analysts use factors like return on
equity, P/E ratio, return on Asset, Dividend, return on Capital Employed, Profit per
Employee, Total Cash, etc. when training the deep learning layers.
Fraud Detection
These days, hackers especially those based out of the dark- web have found ways to steal
money digitally across the globe using different software. Deep learning will learn to find
these types of fraudulent transactions in the web using a lot of factors like Router
information, IP addresses, etc. Auto encoders also help financial institutions saving billions
of dollars in terms of cost. These types of fraudulent transactions can also be detected by
finding the outliers and investigating the same.
Image Recognition
Suppose say the city police department has a people database of the city and they wanted to
know in public gatherings like who is involved in the crimes, violence using public webcam
available in streets this deep learning using CNN (Convolution Neural networks) helps a lot
in finding the person who was involved in the act.
News Analysis
These days the government takes a lot of effort especially in controlling the spread of fake
news and origin of it. Also during poll surveys like who would win elections in terms of
popularity, which candidate been shared by most people in social media etc. and analysis of
tweets made by country people using all these variables we can predict the outcomes in deep
learning, but also there are some limitations to it, we don’t know the data authenticity
whether its genuine or fake. etc. or whether the necessary information been spread by bots.
Self-Driving Cars
Self-driving cars use Deep Learning by analysing the data captured in the cars made in
different terrains like mountains, deserts, Land, etc. Data can be captured from sensors,
public cams, etc. which will be helpful in testing and implementation of self-driving cars. The
system must able to ensure all the scenarios been handled well in training.
12
1.4. Statement of the Problem
To design a real time software system that will be able to recognize ISL hand-gestures using
deep learning techniques. This project aims to predict the 'alphanumeric' gesture of the ISL
system.
1.4. Objective of the project
The objective of the project is to propose a well-adapted deep architecture for automatic hand
gesture recognition.A DCNN model to learn region-based spatiotemporal features for hand
gestures. The input of this model is a sequence of RGB images captured by a basic camera. It
does not require other input channels, colouredgloves, or a complex setup.To design the
model and architecture for CNN to train the pre-processed images and achieve the maximum
possible accuracy. Propose a deep learning-based DeepCNN algorithm to recognize 0-1
numbers and A-D alphabets from ASL data.
CHAPTER 2
LITERATURE SURVEY
2.1. Sign Language Recognition Using Template Matching Technique
Author: Soma Shrenika; MyneniMadhuBala
Year: 2021
Doi: 10.1109/ICCSEA49143.2020.9132899
13
Problem identified
Sign language is also useful for people suffering with Autism Spectrum Disorder (ASD).
Normal people cannot understand the signs used by deaf, as they do not know the meaning of
a particular sign
Objective
The aim of this project is uses a camera, which captures various gestures of the hand. First,
pre-processing of the image takes place. the output is text; one can easily interpret the
meaning of a particular sign The implementation of the system is by using OpenCV-Python.
The system uses various libraries.
Methodology
In this study, Normal people cannot understand the signs used by deaf, as they do not know
the meaning of a particular sign.This system uses a camera, which captures various gestures
of the hand. Then, processing of the image takes place by using various algorithms. The
implementation of the system is by using OpenCV-Python. The system uses various libraries.
RGB is a device dependent model. Different systems produce different values. This method
takes 33% of RED, 33% of GREEN and 33% of BLUE i.e. the contribution is equal from all
the three color. Gaussian filters are smoothing filters that reduce this noise by using a
Gaussian kernel. As an input, provide the values for height, width and standard deviation in
both the directions. If its value is larger than high threshold value, the pixel value is made
255. o detects the changes in intensities and gradient values in horizontal and vertical
directions, i.e. (x, y). It returns the first order derivatives for both the directions. This
comparison of template image with the images present in the data set uses Sum of Absolute
Difference (SAD) method. SAD method computes intensity values. This method computes
the values by subtracting the pixel values of both the images and then adds the result, until no
pixel is left. he results obtained after performing the experiment. a sample input image of
hand gesture. This image is a three-dimensional image. after template matching using SAD
and the matched image from the data set
Dataset
Data set has 70 samples for each of the 36 symbols. There are 70 samples for each symbol.
These 70 samples of each symbol cover all the hand shapes and movements. The entire data
set is available in both compressed and uncompressed format. Every sample of the data set is,
characterized with its equivalent sign. A unique sign letter corresponds to every sample.
Findings
14
This paper is to help and serve the deaf of our society to communicate with normal people.
Here the implementation of the system is using image-processing techniques. This system is
for people who cannot use gloves, sensors and other highly refined equipment. First, acquire
image with a camera. The system is two-way system where, conversion of sign to text and
text to sign is possible. Developing a system, where interpretation involves dynamic gestures.
Implementation can extend to mobile phones.
2.2. Real-Time Bangla Sign Language Detection with Sentence and Speech
Generation
Author:DiponTalukder; Fatima Jahara
Year: 2020
Doi:10.1109/ICECIT54077.2021.9641137
Problem identified
This paper is the non-verbal language used by the people with hearing and speaking
disability, known as the deaf and mute, to connect the bridge of communication with others.
Being a visual means of communication, it deprives the mutes to communicate with people
having a visual impairment. A medium recognizing sign language and converting it into text
and speech could fill the gap.
Objective
This paper proposes a system for BdSL recognition that can interpret BdSL from a sequence
of images or a video stream and generate both textual sentences and speech in real-time. We
have used YOLOv4 as the object detection model. We have also proposed three new signs for
the sentence generation task.
Methodology
In this study to- Sign language, the non-verbal language used by the people with hearing and
speaking disability, used the YOLOv3 algorithm for real-time sign language conversion. the
finger-spelling of words is the leading alternative for communication for deaf people. Using
our proposed technique, finger-spelling can also be used to write documents. It consists of 10
digits and 36 characters used in standard BdSL along with three proposed signs for
generating sentences, which are compound characters, space, and end of sentence. Bangla
Sign Language detection system where they converted training samples from RGB color
space to HSV color space and then extracted feature using Scale Invariant Feature Transform
(SIFT) Algorithm before feeding to the model. extracted features using the Principal
Component Analysis (PCA) algorithm to reduce dimensionality and then data fed to the
15
model. proposed a Bangla Sign language to speech generation system using a pair of smart
gloves, sensors, and microcontroller. There are multiple models like VGG16 [15], SpineNet
[19], CSPResNeXt50 [14], CSPDarknet53 [14] that can be used as backbone model for the
network. For our work as the model backbone, CSPDarknet53 is used which significantly
reduces computational cost [8] to extract features from the input image. CNN’s ability to
learn even after reducing the number of layers. The computation distribution property of
CSPNet reduces the computational bottleneck and increases the utilization rate of each
computational unit.
Dataset
Their dataset contained 25 signs to classify and 15,000 instances for training purposes.
82.06% mAP is achieved in the research under complex background. the built a dataset
consisting of 12.5k BdSL images of 49 different classes where 39 are Bangla alphabets, 10
are Bangla digits, and the three new proposed signs. Its contains 12.5k images of 49 different
classes among which 46 classes are manual alphabets of BdSL. The classes consist of ten
Bangla digits, six Bangla vowels, thirty Bangla consonants, and three proposed signs.
Shadowed capture, noisy image. (f) Complex background, shadow image, indoor.
Findings
Sign language, the non-verbal language used by the people with hearing and speaking
disability, known as the deaf and mute, to connect the bridge of communication with others.
The system can help to communicate with blind and mute people. Also, it can help the mute
and deaf person in day to day communication tasks. The system helps to create a translator
for BdSL. system converts sign language into sentences and later speech in real-time. As our
system captures over 30 frames per second, a person with high efficiency in sign language
can generate sentences in really quick time. o, it can help the mute and deaf person in day to
day communication tasks. The system helps to create a translator for BdSL.
2.3. An Efficient Approach for Interpretation of Indian Sign Language
using Machine Learning
Author:Dhivyasri S; Krishnaa Hari K B; Dr. Krishnaveni V
Year:2021
Doi: 10.1109/ICSPC51351.2021.9451692
Problem identified:
The sign language is used by people with hearing / speech disabilities to express their
thoughts and feelings. But normally, people find it difficult to understand the hand gestures of
16
the specially challenged people as they do not know the meaning of the sign language
gestures.
Objective
In this paper using image processing techniques and Machine Learning algorithms. Different
neural network classifiers are developed, tested and validated for their performance in gesture
recognition and the most efficient classifier is identified.
Methodology
The proposed system for ISL interpretation performs two major tasks: (i) Gesture to Text
conversion and (ii) Speech to Gesture conversion. Feature extraction is done using the
Speeded-Up Robust Feature (SURF) method. Machine Learning algorithms like
Convolutional Neural Network (CNN), Recurrent Neural Network (RNN) and Support
Vector machine (SVM). The SVM classifier is used along with K-means clustering classifier
and Bag of Visual words (BoV) model to achieve better accuracy. For both CNN and RNN,
the data in the dataset is divided into three parts. K-Means clustering, BoV, and SVM
classifiers has the highest accuracy in recognizing the hand gestures. Hence, it is more
reliable for gesture recognition.
Findings
A user friendly application that can interpret Indian Sign Language has been developed using
the most efficient SVM classifier (for gesture to text conversion) and Google Speech
Recognition API (for speech to gesture conversion). Thus, a more reliable sign language
interpretation system has been developed.
2.4. Automated Sign Language Interpreter Using Data Gloves
Author:H S Anupama; B A Usha
Year:2021
Doi:10.1109/ICAIS50930.2021.9395749
Problem identified
Communication plays a very important role in the everyday life of human beings Sign
language is used for this form of communication and it is an incredible advancement that has
grown over the years. This project helps in improving the communication with the deaf and
the dumb using flex sensor technology. We convert the sign language so that it is understood
by common people and will help them to communicate without any barriers.
Objective
17
The main purpose of this project is to eliminate the barrier between the deaf and dumb and
the others using an automated system to recognize the hand gestures and interpret them
appropriately. The system uses a combination of sensors fixed onto a glove to capture various
sign gestures. The sensor values for these signs are then recorded by the Arduino and
processed using K nearest neighbour machine learning algorithm. There are many challenges
that come with the usage of the sign language.
Methodology
The algorithm used for the implementation automated sign language interpreter is k-nearest
neighbour (KNN) algorithm. KNN is simple ML algorithm based on supervised learning
technique. It stores all the available date and classifies a new data point based on the
similarly. When new data appears then it is easily classified into well suite category by using
KNN algorithm.
Dataset
Alphabets(a), B(B) successful for values of ‘k’ up to 80, C(c), D(d), E(e),’k’ up to 20, L(l),
M(m), N(n), O(o), P(p), Q(q), R(r) was successful for values of ‘k’ is less than 10.it was
observed that the alphabets(t), U(u), V(v), were classified as A(a) and H(h) respectively. this
error was handled by adding more values in the data set for the letter.
Findings
The sign language interpreter is successfully identifying most of the letters in the English
alphabets and couple of basic phrase like “Good Morning”,” Hello”, “Bye” using KNN
algorithm.it is economically feasible, easy to use device that can help the deaf and dumb
communities to overcome the barriers of communications.
2.5. Sign Languages to Speech Conversion Prototype using the SVM
Classifier.
Author:Malli Mahesh Chandra; Rajkumar S
Year: 2019
Doi:10.1109/TENCON.2019.8929356
Problem identified
Around 70 million people in this world are mute people. There are children who suffer from
Nonverbal Autism. Communication between the people with speech impairment and normal
people is very difficult. Normally people with speech impairment use Sign Language to
communicate with others.
Objective
18
In this paper, a prototype is proposed to give speech output for the Sign Language gestures to
bridge the communication gap between the people with speech impairment and normal
people. These sensors capture the real time gestures made by the user. Arduino Nano micro
controller is used to collect data from these sensors and sends it to the PC via Bluetooth. The
PC processes the data sent by the Arduino and runs a Machine Learning Algorithm to classify
the Sign Language gestures and predicts the word associated with each gesture.
Methodology
MPU6050 sensor module is complete 6-axis Motion Tracking Device. It combines 3-axis
Gyroscope, 3-axis Accelerometer, on-chip Temperature sensor and Digital Motion Processor
all in small package. the trained dataset is fit into a Support Vector Machine (SVM)
classifier. Different Languages have different trained datasets. The model is stored in the. pkl
Format he data received at the computer has to be processed before feeding it into the SVM
classifier. to predict both ASL and ISL gestures allowing the user to speak English and some
Indian languages by making gestures.
Dataset
The received data corresponding to each gesture is repeatedly recorded at different positions.
Both ASL and ISL gestures are trained for getting both English and Indian languages speech
output. Default Baud rate is 38400. It has 8 data bits with one stop bit. There are no parity
bits. It also supports other baud rates with values of 9 600, 19200, 38400, 57600, 115200,
230400 and 460800. It can be used as both Master mode and slave mode configuration.
Findings
The user can speak English or some Indian languages through the gestures using this
prototype. Around 22 gestures in ASL (including control gestures) and 11 gestures in ISL
(including control gestures) are trained and tested successfully. An accuracy of 100% is
achieved for ISL database with 25% test data and 75% training data. In this prototype only
one glove is designed, such that only gestures from a single hand can be sensed. But, ASL
and ISL gestures consist both one hand and two hand gestures, so single glove cannot be
sufficient to make all sign language gestures.
2.6. Wearable Sensor-Based Sign Language Recognition: A Comprehensive
Review
Author: Karly Kudrinko, Emile Flavin, Xiaodan Zhu,
Year:2021
DOI: 10.1109/RBME.2020.3019769
19
Problem identified
Sign language is used as a primary form of communication by many people who are Deaf,
deafened, hard of hearing, and non-verbal. Communication barriers exist for members of
these populations during daily interactions with those who are unable to understand or use
sign language.
Objective
The aim of this study is Attributes including sign language variation, sensor configuration,
classification method, study design, and performance metrics were analyzed and compared.
Results from this literature review could aid in the development of user-centred and robust
wearable sensorbased systems for sign language recognition.
Methodology
Sign language recognition (SLR) techniques can be categorized into three groups: computer
vision-based models, wearable sensor-based systems, and hybrid systems Exclusion terms
such as image, vision, and camera were used to narrow results. Non-English articles and
those not published in peerreviewed sources were excluded from the study. Since the primary
focus of this review was to examine methods for SLR used in mobile applications, only
studies involving wearable devices were included. The VPL DataGlove has two fibre optic
sensors on the back of each finger to detect joint flexion, and a tracker on the back of the
palm to detect the six degrees of freedom of orientation and position. practical SLR
applications require a device that maximizes ease of use and comfort while maintaining a
high level of accuracy.
Dataset
Benchmark datasets are available for domain-specific machine learning applications (i.e.
natural language processing, image processing) so algorithms can be directly compared to
one another. it is difficult to have such a dataset when a variety of sensor configurations and
sign language variations exist.
Findings
This paper reviewed wearable sensor systems in sign language recognition, a research area
which has the potential to create profound socioeconomic impact. This analysis involved the
examination of key aspects of studies including sensor configuration, study design, machine
learning models, and evaluation metrics. We reviewed the two variations of recognition tasks
and identified gaps that currently exist in the field
20
2.7. Sign Language Recognition Based on Intelligent Glove Using Machine
Learning Techniques.
Author:Henry Benitez-Pereira3 and Diego H
Year:2019
DOI: 10.1109/ETCM.2018.8580268
Problem identified
Present an intelligent electronic glove system able to detect numbers of sign language in
order to automate the process of communication between a deaf-mute person and others. This
is done by translating the hands move sign language into an oral language.
Objective
This is done by translating the hands move sign language into an oral language. The system is
inside to a glove with flex sensors in each finger that we are used to collect data that are
analysed through a methodology involving the following stages: (i) Data balancing with the
Kennard-Stone (KS), (ii) Comparison of prototypes selection between CHC evolutionary
Algorithm and Decremented Reduction Optimization Procedure 3 (DROP3) to define the best
one. Subsequently, the K-Nearest Neighbours (kNN) as classifier (iii) is implemented. As a
result, the amount of data reduced from stage (i) from storage within the system is 98%.
Methodology
The proposed methodology (Divide data in training set and test set, data balance, prototype
selection comparative and The data balancing stage was performed because, for one hand the
standard computer that compiled the different algorithms of machine learning specifically
with DROP3 and CHC, the processing time was very long with small data set. The final
training matrix is called V of qxn, where q is 268 instances. In reference to the training
matrix V (applied prototypes selection) and the matrix U. The proposed methodology met the
objective of reducing the greater amount of data in the training set and that when
implementing a classifier, it has a high performance. In this way it is possible to store all the
alphabet in sign language.
Dataset
The data set has m=5000 samples and n=5 attributes accumulated. Also, the tags vector
called Ccolors: black (number 1), red (number 2), green (number 3), blue (number 4) and
cyan (number 5).
Findings
21
CHC was the most adequate in sensor data. This is due to training set reduction capacity and
the classifier performance. Also, this algorithm has main advantage that works with R
environment. About the prototypes selection, from the large number of data acquired for the
model, only 2% is used for training the model with an accuracy of 85%.
2. 8. Multimedia technologies to teach Sign Language in a written form
Author: Myasoedova M.A; Farkhadov M.P
Year:2020
Doi: 10.1109/AICT50176.2020.9368720
Problem identified
The notations allow us to record all Sign elements in the form of a sequence of characters
using alphabetic, digital, and various graphic elements. Also, we substantiate our choice of
Sign Writing system to record Russian Sign Language, where the signs and rules allow us to
set the spatio-temporal form of Signs compactly and precisely.
Objective
In this paper the author aims to give examples of a Sign Writing using these characters.
Finally, we demonstrate the multimedia program «SWiSL», an on-line platform to help
people improve their recognition skills in Sign Language writing. The author will solve
certain difficulties in creating systems for automatically recognize Sign speech.
Methodology
The data obtained, the best Sign notations to use is SW system. This SW system allows you
to get the record most closely corresponding to the Sign. One of the features of the SW
system is that each component of a Sign in the sign form corresponds to a certain identifier in
the form of a code chain. When people communicate verbally or via a Sign language,
information transmitted from person to person often changes its meaning. The reason for this
may be poor-quality speech or fuzzy Signs. research Sign Language with the established
corpus of the most common gestures of Russian Sign Language (RSL) of various subjects
(family, people, information technology, etc.). All elements of the corpus (dictionary) are
presented in a written form using the signs of the SW system. The multimedia program
«SWiSL» is self-mastery tool to read Signs in sign form, available anytime and anywhere via
the Internet. This program can be a starting point for user to get acquainted with the SW
system. Time will tell whether it will be accepted by native RSL users.
Dataset
22
The program accompanies every sign with the corresponding text, photo, and video
commentaries. If a sign is hard to recognize, this information helps the user to focus on the
subtle peculiarities of the sign. (a dactyl (Finger Letter)) and their corresponding sign forms.
a user’s request, allows the user to view the Sign that corresponds to the sign's iconic
analogue and video material. user can view iconic Sign forms stored in the database;
Findings
In this section conducted in the field of Sign Language and how to identify features and
specifics of Sign notations, confirm the relevance of our chosen topic. A qualitatively new
direction to introduce multimedia technologies into the educational process is to create
multimedia-based training programs. The scientific novelty of this work is to integrate
multimedia technologies into a single education system; the system teaches Sign Language in
a symbolic form
2.9. Research on Dynamic Sign Language Algorithm Based on Sign
Language Trajectory and Key Frame Extraction
Author:Yufei Yan; Chenyu Liu
Year:2019
Doi:10.1109/ELTECH.2019.8839587
Problem identified
This project Based on the depth information of Kinect, this paper studies the real-time
dynamic sign language recognition algorithm and improves the dynamic time warping
algorithm for the recognition of sign language trajectories. The traditional DTW algorithm is
improved by using the path constraint of the relaxed endpoint, adding the lower bound
function to cull part of the candidate sequence and terminate the match early.
Objective
The aim of the project is dynamic hand sign trajectory and key gesture type information are
combined to obtain the final dynamic sign language recognition result. Experiments show
that the average recognition rate of this method is more than 90%, which is better than the
traditional DTW algorithm in recognition speed and recognition accuracy.
Methodology
This paper uses a hand segmentation method that combines depth threshold and skin color
threshold. The TLD (Tracking-Learning-Detection) algorithm is selected as the method of
hand tracking in this paper. The DTW algorithm transforms the global optimal problem into a
local optimal problem by scaling and distorting the time series of the sample. It looks for an
23
optimal path between the two samples, the sum of the distances of the paths as the distance
between the two-time series. int points can bring different information. For the two-hand sign
language, the amount of information brought by the two palms at different times is different
[7]. The weight of the joint point can be given according to the magnitude of the motion of
the hands in the different sign language of the sample to be tested. r the pre-processed gesture
trajectory, match the template sequence with the improved DTW algorithm. A threshold
value' is set, and it is assumed that the DTW distances of the m template tracksT1,T2 ,,Tm
andT which are closest to the trajectory of the sign language T to be tested are d1,d2 ,,dm in
descending order.
Dataset
70 sign language words were selected for the experiment. The 70 sign language words
contain 20 one-handed sign language words and 50 two-handed sign language words. The
sign language words vary in length and duration is between 1s and 4s According to the 70
sign language words, an experimental database was established: a total of 5 sign language
players were selected, and the sign language of the above 70 sign language words was
collected for each sign language by Kinect2.0. There were 350 sets of sign language data.
Findings
The paper studies the matching algorithm of dynamic sign language: DTW algorithm. Firstly,
the principle of DTW is optimized by endpoint relaxation boundary constraint and early
termination matching, and the LB_BC lower bound function is used to filter some of the
template to be tested. Finally, the paper experiments and analyses the recognition rate of
trajectory matching and key gesture matching, and the time-consuming improvement of
DTW, and then verifies the effectiveness of the algorithm and improvement. Experiments
show that the proposed method can realize the recognition of dynamic sign language well and
has good real-time performance. It has improved the recognition speed and recognition
accuracy rate compared with the traditional DTW algorithm.
2.10. Towards Multilingual Sign Language Recognition
Author: Sandrine Tornay; MarziehRazavi
Year:2020
Doi:10.1109/ICASSP40776.2020.9054631
Problem identified
Sign language recognition involves modelling of multichannel information such as, hand
shapes, hand movements. This requires also sufficient sign language specific data. This is a
24
challenge as sign languages are inherently under-resourced. it has been shown that hand
shape information can be estimated by pooling resources from multiple sign languages. Such
a capability does not exist yet for modelling hand movement information.
Objective
In this paper, we develop a multilingual sign language approach, where hand movement
modelling is also done with target sign language independent data by derivation of hand
movement subunits. We validate the proposed approach through an investigation on Swiss
German Sign Language, German Sign Language and Turkish Sign Language, and
demonstrate that sign language recognition systems can be effectively developed by using
multilingual sign language resources.
Methodology
A sign language processing approach using KullbackLeibler divergence HMM (KL-HMM)
was developed. Briefly, in KL-HMM [16, 17] the feature observations are probabilistic
(posterior distributions). Kullback-Leibler (KL) divergence [18] between the feature
observations and the state categorical distribution. The decoding step is the same as standard
HMM-based approach where the log likelihood of state is replaced by the KL-divergence
between the feature observations and the state categorical distribution. a HMM-based
approach was developed [15], where signer-independent hand movement subunits are derived
based on light supervision. To validate the proposed approach, we derived the language
independent hand movement subunits from three different languages, namely Swiss German
sign language (DSGS) in SMILE database, Turkish sign language (TSL) in HospiSign
database and German sign language (DGS) in DGS database.
Dataset
This author used the DeepHand net which is trained on one-million hands dataset for hand
shape posterior estimation. The one-million hands are a composition of three different sign
languages, namely Danish sign language, New Zealand sign language and German sign
language. The hand shape observations are the hand shape class-conditional posterior
probabilities z hshpt, where the classes are composed by a transition shape and the 60 hand
shapes. the hand shape based KL-HMM systems on three sign languages (DSGS, TSL and
DGS).
Findings
Our investigations showed that there is a performance gap when modelling hand movement
information in a language independent manner and in a language dependent manner.
However, this gap is significantly reduced when combined with hand shape information and
25
yields competitive systems. These findings are promising and they pave the path for
development of sign language processing systems by sharing multiple sign language
resources.
CHAPTER 3
SYSTEM ANALYSIS
3.1. Existing S0ystem
26
3.1.1. Region Based Method - The region-based classification method divides an image into
related areas by applying homogeneity criteria to the collection of pixels. Region based
methods are classified into: region growing, region splitting and merging, thresholding,
watershed and clustering.
Thresholding method
Thresholding is astraightforward and effective method for image segmentation. Thresholding
is used to transform a multilevel image to abinary image. To segment image pixels into
different regions,a suitable threshold is selected. The thresholding approachhas two
limitations: it generates only two classes and cannotbe extended to multichannel images.
Furthermore, thresholddoes not take into account an image's spatial characteristics.As a
result, it is susceptible to noise.
Region growing method
This is a traditionalapproach in which segmentation begins with the manualsorting of seeds
from the image of interest. The manualdealings to attain the seed point are the area growing's
restriction. However, split-and-merge is a region-growingalgorithm that does not need a seed
point. Region growth isalso vulnerable to noise, resulting in gaps in partitioned areas.This
problem is solved by the hemitropic region-growingalgorithm. However, this technique
requires userintervention for seed selection.
Watershed algorithm
Watershed is asegmentation system dependent on gradients. Similargradient values are
analysed as heights. When a hole is drilledinto each local minimum and immersed in water,
the waterrises before it reaches the local maximums. When two bodiesof water intersect, a
barrier is constructed between them, andthe water level increases before both points are
combined.The picture is segmented by dams, which are referred to aswatersheds.
Clustering method
Clustering is anunsupervised learning task that groups items based on asimilarity criterion.
There are two types of clusteringalgorithms: hard clustering and fuzzy clustering. The
hardclustering process seldom assigns a pixel to a single cluster.Due to the presence of partial
volume effects in MRI, thiscannot be used for MRI segmentation. In fuzzy clustering, asingle
pixel may be allocated to several clusters. Fuzzy CMeans (FCM) is a common fuzzy
clustering tool. Althoughthe FCM algorithm performs very quick and simplesegmentation, it
does not guarantee good accuracy for noisyor irregular images.
3.1.2. Edge-based method—changes in the intensity between edges of voxels are used as the
boundaries of the Sign s.
27
Prewitt
The "Prewitt Mask" is one of the unmistakable separation activities. As needs be,
approximated subsidiary qualities in both the headings, with the end goal that even and
vertical, are determined utilizing two 3 × 3 veils. Prewitt veils provide an approximation to
both flat subsidiary and the vertical subordinate.
Robert
By using the "Robert’s edge" location activity, the picture inclination is evaluated by means
of unmistakable separation. Likewise, "Robert Mask" is a network and the districts of high
spatial recurrence are featured, that are frequently compared to edges in the image.
Sobel
The "Sobel Mask" generally function as the "Prewitt veil". It must be taken into account as
the Sobel administrator has values; '2' and '- 2' which are assigned to the focal point of first
and the third segments of the flat veil and first and third columns of the vertical cover.
Henceforth it gives high edge intensity.
Canny Edge Detector
The calculated temperature distribution (thermal image) in this study showed that a large
gradient in Sign borders is the reason to use an edge detection method to track the Sign
contours. An edge in the image represents a strong local variation in pixels’ intensity, usually,
arising on the boundary between two different regions within an image. Edge detection is the
process of objects boundaries detection within an image by finding the changes in
discontinuities intensities.
Histogram of Oriented Gradient
The extraction technique of the "Histogram of Oriented Gradient" (HOG) will take the
calculations given below in consideration. In the first place, the pre-arranged cell picture shall
be coursed as "32 × 32" pixels. The power ofeach pixel is '1' or '0'. By then the result will be
added to "Crowd".
3.1.3. Model Based Method
Artifacts and noise can cause image boundaries to bedistinguishable and detached. Model-
based approaches havebeen used successfully to address these challenges.Deformable models
are the most popular model-basedapproaches. Deformable models are curves that deform
whensubjected to stress. Deformable models can evaluateboundaries with smooth curves that
can span boundary gaps.Active contour and geometric deformable models are the twotypes of
deformable models.
SVM
28
SVMs are maximal margin hyperplane classifiers that exhibit high classification accuracy for
small training sets and good generalization performance on very variable and difficult to
separate data. The trained SVM model is subsequently used to dynamically classify unseen
feature displacements and the result is then returned to the user. Classification thus identifies
the featuredisplacement pattern of an unseen example with the characteristic feature
displacement pattern ofthe most similar expression (e.g. for the basic emotions of ‘anger’,
‘disgust’, ‘fear’, ‘joy’, ‘sorrow’ or ‘surprise’) supplied during training.
LDA
LDA is easy to implement and no tuning parameters or adjustment required. LDA returned
the prior probability of each expression class, the group means for each covariate, the
coefficient for each linear discriminant (for the six classes, we have five linear discriminants)
and the singular values that produced the ratio of the within-class and between-class standard
deviation on the first two LDs variables returned the proportions of the variance by Stirling
and by Bosporus.
GMM
A Gaussian Mixture Model (GMMs) is trained for each of the emotional states that we
examine; angry (ANG), happy (HAP), neutral (NEU) and sad (SAD). The GMMs for each
separate modality give us a limited picture of the overall emotional impression of a sentence.
GMM for the total face (TOTAL) is trained using marker data from all facial regions.
Neighbouring markers are averaged in order to reduce the total number of markers. The
choice of which markers to average is ad-hoc, however, the markers that are averaged belong
to the same facial muscles and their movements are correlated.
ANN
An Artificial Neural Network is a nonlinear and adaptive mathematical model which is
inspired by biological neural networks. The brain consists of larger amount of interconnected
set of nerve cells called neurons. An artificial neural network consists of a smaller number of
interconnected set of nerves or very simple processors, also called neurons, which are
analogous to the biological neurons. It consists of an interconnected group of neurons which
operating in parallel and communicating with each other through weighted interconnections.
Artificial neural network changes its structure during a learning phase because in most cases
it is adaptive system. It is used to model a complex relationship between inputs and outputs
or to find data patterns.ANN is proposed to detect faces with the purpose of decreasing the
performance time but still achieving the desired faces detecting rate.
29
3.2. Disadvantages
Manual segmentation depends on human experience, and is laborious and time
consuming.
Handcrafted Features
Difficult to segment these sign regions automatically.
High Computational Process
Misclassification due to improper segmentation
Performance in Sign detection was not satisfactory.
Computational complexity is severely increased
Time spent in feature extraction
False prediction of Sign Words
3.3. Proposed System
The main goal behind the development of our proposed model is to automatically distinguish
people with Sign Language, while reducing the time required for classification and improving
accuracy. We propose a novel and robust DL framework CNN for detecting Signs using
MNIST datasets. The proposed model is a four step process, in which the steps are named: 1).
Pre-processing, 2). Features Extraction, 3). Features Reduction, and 4). Classification.
Median filter, being one of the best algorithms, is used for the removal of noise such as salt
and pepper, and unwanted components such as scalp and skull, in the pre-processing step.
During this stage,the images are converted from grey scale to coloured images for further
processing. In second step, it uses Grey Level Co-occurrence Matrix GLCM) technique to
extract different features from the images. In third stage, Color Moments (CMs) are used to
reduce the number of features and get an optimal set of characteristics. Images with the
optimal set of features are passed to CNN classifiers for the classification of Signs.
3.3.1. Region Proposal Network
This region proposal network takes convolution feature map that is generated by the
backbone layer as input and outputs the anchors generated by sliding window convolution
applied on the input feature map.
30
3.3.2. Grey Level Co-occurrence Matrix
Grey Level Co-occurrence Matrix (GLCM) based texture analysis of kidney diseases for
parametric variations. The investigations were carried out using three Pyoderma variants
(Boil, Carbuncle, and Impetigo Contagions) using GLCM. GLCM parameters (Energy,
Correlation, Contrast, and Homogeneity) were extracted for each colour component of the
images taken for the investigation. Contrast, correlation, energy, and homogeneity represent
the coarseness, linear dependency, textural uniformity, and pixel distribution of the texture,
respectively. The analysis of the GLCM parameters and their histograms showed that the said
textural features are disease dependent. The approach may be used for the identification of
CKD diseases with satisfactory accuracy by employing a suitable deep learning algorithm.
31
performance in many competitions related to image processing due to its accurate results.
CNN is a hierarchical structure that contains several layers.
32
train the network on. In the input layer, the range can be from 0 to 1. Each neuron is treated
as a filter where the filter is computed for the data network depth; in the conv layer, the
neurons are filters in image processing to detect edges, curves, etc. Each filter of the conv
layer will have its image features, such as vertical edges, horizontal edges, colors, textures,
and density.
All neurons add to the feature extractor array for the entire image. In addition, the pooling
layer is sandwiched between successive convolutional layers to compress the amount of data
and parameters and reduce overfitting. In short, if the input is an image, then the main
function of the pooling layer is to compress the image by resizing the image. When the
information removed when the image is compressed is just some irrelevant information, we
can remove it.
3.4. Advantages
A fast and accurate fully automatic method for signclassification which is competitive both in
terms of accuracy and speed compared to the state of the art.
The method is based on deep neural networks (DNN) and learns features that are specific to
signtypes.
Segmentation technique accomplishes the better segmentationresults with the maximum
accuracy of 99%.
Automatic Feature Extraction
Low Computational Overhead
33
CHAPTER 4
SYSTEM DESIGN
4.1. System Architecture
Training Phase
Testing Phase
Preprocessing Preprocessing
Segmentation Segmentation
Classification
Classified Result
Storag Prediction
e
34
4.2. System Flow – Training Phase
Sign Image
Dataset Preprocessing
RGB to Grey
Resize
Noise Filter
Segmentation
Grey Trans.
FG Extraction
Binarization
BG Subtraction
DCNN
Sign Object Feature Extraction
Localize Size
Color
Shape
Texture
Classification
Sign Class
Labels
Classified
File
35
4.3. System Flow – Testing Phase
SignImage
Preprocessing
RGB-Grey
Resize
Noise Filter
Segmentation
Grey Trans.
FG Extraction
Binarization
BG Subtraction
Localize Size
Color
Shape
Texture
Load
Extracted
Classified Prediction
Fire Feature
File
Match
Sign No-Sign
Sign :Class
Localization
Text to Voice
36
4.3. Data Flow Diagram – Level 0
39
The proposed fine-tuned CNN architecture contains multiple convolutional layers, max-
pooling, dropout, and dense (fully connected). The input data need to be augmented, which
involves augmenting the existing dataset with perturbed current images, including scaling and
rotating. This is used to expose the neural network to a variety of variations. This way, the
neural network is less likely to identify unwanted characteristics in the dataset. The
architecture has three main blocks with different parameter settings. The first block has 32
filters with kernel size, and the ReLU is used as an activation function. In the next layer, the
2 ∗ 2 max-pooling layer is used with half padding, which progressively reduces the spatial
size of the representation to the reduced number of parameters and computation in the model.
128 filters use the kernel size with the ReLU activation function in the second block. After
that, a max-pooling layer is used with half padding. In the third block, 512 filters use the
kernel size with the ReLU activation function. Again in the third block, the max-pooling
layer is used with half padding. From these three-block models, learn features properly.
These features were flattened by using flatten layer, which converts data to a vector before
being connected to a group of the fully connected layer. In the last, two dense layers are used
with ReLU activation function with 1024 and 256 units respectively than dropout layer with
the value of 0.5 to control overfitting. Finally, we used 25-unit dense layer (fully connected
layer) as an output layer and used a softmax function to predict the gesture of the ASL
alphabet.
In the first stage, the input images are split into training, validation, and testing
data and then training data are augmented. We used an image data generator,
expanded the training dataset’s size, and created a modified version of images
in the dataset. The augmented data are passed to train the fine-tuned CNN
model. In the second stage, features are extracted by passing the data through
three blocks, as shown in Figure 4. After applying the softmax activation
function, these features classify the ASL alphabets in the next stage. Then, the
next stage is to predict the unseen test data. We use unseen test data to testify
the model’s capability for ASL recognition. Finally, the data are classified, and
the predicted output is achieved. Figure 6 shows the parameters and output
40
shape of each layer used in the CNN architecture. The total number of trainable
calculated parameters is 2,994,649. The proposed CNN model learns the hand
gesture in the training stage. The model is allowed to check all the pixels in the
images. In the testing phase, we use an unseen hand gesture dataset. If any
pixel has the hand gesture, the output layer node returns the maximum
response. The model will return “1” or “on” state. Suppose there
are pixels, , , …, in the image (where j = 1, 2, 3, …, 9000). When a pixel is
passed to the CNN, it will return the output as. Algorithm 1 shows the testing
phase of the proposed approach.
41
Learning Rate: Default learning rate of 0.0002 has been used to train the algorithms.
Label Map: A label map tells the trainer and algorithms what each object is in an image, by
mapping the class names to class id numbers.
Batch Size: Batch size refers to the number of training samples which can be used by the
algorithm in one iteration.
5.3.1. Training Phase
After configuring the algorithms, the training process is started. Classification loss here refers
to the price paid by the algorithms for their wrong detections. The algorithms validate itself
automatically after each step on the test dataset and displays this classification loss.
5.3.1.1. Dataset Annotation
The process begins with the acquisition of the underwater image marine dataset.Reading: A
Python script using matplotlib is written for reading all the images in the dataset. This would
help to know if all the images in the dataset are loaded and readable by the algorithm.
5.3.1.2. Dataset labelling
Before starting the training process, it is necessary to label the images in the dataset. In this
thesis, Label Image tool is used to label the dataset manually. The steps followed are:
1. Load the dataset directory to Label mg tool.
2. Open each Image to draw bounding boxes around Sign in that image and it is labeled
according to that class
3. All the labels for images are stored in XML format. Separate XML files are generated
for each image in the dataset. These XML files as shown in Figure 1.1 contain the file
name of the image, path for the image, the label for the image and the coordinates for
the bounding boxes for the image. These files are called annotation files which are
used to train the algorithms to detect segmented brain.
42
5.3.2. Pre-processing
Sign Image pre-processing are the steps taken to format images before they are used by
model training and inference.
The steps to be taken are:
Read image
Grey Scale conversion
Resize image - All the images collected are modified to be less than 200KB in
size with a maximum resolution of 1280 times 720 because the larger the images
are the longer it will take to train the algorithms.
Original size (360, 480, 3) — (width, height, no. RGB channels)
Resized (220, 220, 3)
Remove noise (Denoise) - smooth our image to remove unwanted noise. We do
this using Gaussian blur.
5.3.2.1. Bi Lateral Filter:
Filtering is maybe the most basic operation of image processing and computer version. In the
broadest sense "filtering", the estimation of the filtered image at a given location is an
element of the estimations of the input image in a small neighbourhood of a similar location.
For example, Gaussian low-pass filtering calculates each pixel is replaced by a weighted
average of its neighbours, in which the weights diminish with distance from the area of focus.
The formal and quantitative explanations of this weight difference can be given, the instinct
is that images vary gradually over space, so close pixels are probably going to have
comparative values, and it is in this manner suitable to average them together. The noise
values that degenerated these close-by pixels are commonly less corresponded than the signal
values, so noise is midpoint of away while signal is saved. The assumption of slow spatial
variations comes up short at edges, which are therefore blurred by linear low-pass filtering.
Bi Lateral filtering is a simple, non-linear, edge preserving and noise reducing smoothing
filter for images.
5.3.2.2. Binarization
Image binarization is the process of taking a grayscale image and converting it to black-and-
white, essentially reducing the information contained within the image from 256 shades of
grey to 2: black and white, a binary image.
43
Grey
Resize
MNIST Dataset Preprocessing
Denoising
Binarization
44
features. To solve this issue region of interests pooling is used to reduce the generated feature
maps into same size.
Figure 2.1.
5.3.3.2. Tensor flow Object Detection API
It is an open-source framework built over Tensor flow that makes it easy to develop, train and
deploy object detection algorithms. Tensor flow Object Detection API has an extensive
collection of models/algorithms called the model zoo. In this project, tensor flow Object
Detection API is used to implement and train CNN and SSD algorithms.
Background Subtraction
Foreground Subtraction
Preprocessed
Image Sign Segment using RPN
RPN
ROI
45
There exit two tasks in this approach: generating segmentation images and discriminating the
class of each pixel in each image.
5.3.4.1. GLCM Feature Extraction:
GLCM- Gray Level Co-occurrence Matrix is used to extract the statistical texture features.
Attaining the histogram details will only give us details about the texture whereas the GLCM
calculates the relative position of the pixels in relative to its texture. This statistical approach
gives a lot of information about the relative position of the neighbouring pixels in an image.
GLCM is a spatial domain technique which tabulates the difference in combination of pixel
brightness in the image. The features involved in feature estimation are divided into few
steps: First four co-occurrence matrices are calculated from the grey scale image. It considers
the distance between the pixels to be 1 and the four directions as 0o, 45o, 90o and 135o. So,
the co-occurrence matrix is computed at 00. There are four features in every computed matrix
namely correlation, contrast, energy and homogeneity. Hence the feature vector will be of
size 16.
Shape
Size
Segmented
Image GLCM
Pattern
Pattern Color
46
humans. These CNNs are given preference over other neural network architectures because
of the following reasons.
A CNN method is used to predict the bounding boxes and classification probabilities. For the
Sign detection, the targets are difficult to be identified from the background. In order to
47
improve the detection accuracy, the whole image information is used to predict the bounding
boxes of the targets and classify the objects at the same time; through this proposal, the end-
to-end real time targets detection can be realized.
The corresponding ground truth segmentation images encompass the following structures
(and the background) as classes: 0-1 Numbers A-D Alphabets
CNN
Stastical
Fea.ExtractedData Classifier
Storage
48
49
5.4. CNN Algorithm
1. Input MNIST Sign Dataset PNG format described with a RGB format that can
be modeled as a 3D Matrix.
2. FRConv(x,y,z,u,v,h,i,j) x = number of expected input channels in the given
image (3 for RGB)
3. y = number of the output channels after the convolution (FRCONV) phase
4. z = The kernel width of the convolution
5. u = The kernel height of the convolution
6. v = The step (stride) of the convolution in the width dimension
7. h = The step (stride) of the convolution in the height dimension.
8. i = Additional zeros added to the input plane data on both sides of width axis
9. j = Additional zeros added to the input plane data on both sides of height axis
10. ReLU(x) A rectified linear unit (activation function) has output 0 if the input is
less than 0, and raw output otherwise. x = True or False.
11. MaxPooling(x,y,z,u) A max-pooling operation that looks at XxY windows and
finds the max by ZxU stride length.
12. x = The filter width of pooling
13. y = The filter height of pooling
14. z = The stride of pooling width
15. u = The stride of pooling height
16. FullyConnected(x,y) x = Input image size e.g. 3*256*256 (3 color channels,
256 pixels of
17. height and 256 pixels of width)
18. y = Number of output classes less than input image size
19. Loss (x, y) The loss function takes in the predicted labels and the true labels
and computes a value that quantifies the model's performance.
20. x = the predicted label
21. y = true labels
Output Segment and Classification (e.g. Sign type (0-1 Numbers, A-D Alphabets)
50
CHAPTER 6
UML DIAGRAMS
6.1. Use case
51
6.2. Class Diagram
52
6.3. Activity Diagram
53
6.4. Sequence Diagram
54
6.5. Collaboration Diagram
55
CHAPTER 7
SYSTEM SPECIFICATION
7.1 Hardware specification
Processors: Intel® Core™ i5 processor 4300M at 2.60 GHz or 2.59 GHz (1
socket, 2 cores, 2 threads per core), 8 GB of DRAM
Disk space: 320 GB
Operating systems: Windows® 10, macOS*, and Linux*
7.2 Software specification
Server Side : Python 3.7.4(64-bit) or (32-bit)
Client Side : HTML, CSS, Bootstrap
IDE : Flask 1.1.1
Back end : MySQL 5.
Server : Wampserver 2i
OS : Windows 10 64 –bit or Ubuntu 18.04 LTS “Bionic Beaver”
56
CHAPTER 8
SOFTWARE DESCRIPTION
8.1. Python 3.7.4
Python is a general-purpose interpreted, interactive, object-oriented, and high-level
programming language. It was created by Guido van Rossum during 1985- 1990. Like Perl,
Python source code is also available under the GNU General Public License (GPL). This
tutorial gives enough understanding on Python programming language.
57
Tensor Flow
TensorFlow is an end-to-end open-source platform for machine learning. It has a
comprehensive, flexible ecosystem of tools, libraries, and community resources that lets
researchers push the state-of-the-art in ML, and gives developers the ability to easily build
and deploy ML-powered applications.
TensorFlow provides a collection of workflows with intuitive, high-level APIs for both
beginners and experts to create machine learning models in numerous languages. Developers
have the option to deploy models on a number of platforms such as on servers, in the cloud,
on mobile and edge devices, in browsers, and on many other JavaScript platforms. This
enables developers to go from model building and training to deployment much more easily.
Keras
Keras is a deep learning API written in Python, running on top of the machine learning
platform TensorFlow. It was developed with a focus on enabling fast experimentation.
58
pandas is a fast, powerful, flexible and easy to use open source data analysis and
manipulation tool, built on top of the Python programming language.pandas is a Python
package that provides fast, flexible, and expressive data structures designed to make working
with "relational" or "labeled" data both easy and intuitive. It aims to be the fundamental high-
level building block for doing practical, real world data analysis in Python.
Pandas is mainly used for data analysis and associated manipulation of tabular data in Data
frames. Pandas allows importing data from various file formats such as comma-separated
values, JSON, Parquet, SQL database tables or queries, and Microsoft Excel. Pandas allows
various data manipulation operations such as merging, reshaping, selecting, as well as data
cleaning, and data wrangling features. The development of pandas introduced into Python
many comparable features of working with Data frames that were established in the R
programming language. The panda’s library is built upon another library NumPy, which is
oriented to efficiently working with arrays instead of the features of working on Data frames.
NumPy
NumPy, which stands for Numerical Python, is a library consisting of multidimensional array
objects and a collection of routines for processing those arrays. Using NumPy, mathematical
and logical operations on arrays can be performed.
59
Matplotlib is a plotting library for the Python programming language and its numerical
mathematics extension NumPy. It provides an object-oriented API for embedding plots into
applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt, or GTK.
Scikit Learn
scikit-learn is a Python module for machine learning built on top of SciPy and is distributed
under the 3-Clause BSD license.
Scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine
learning library for the Python programming language.It features various classification,
regression and clustering algorithms including support-vector machines, random forests,
gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python
numerical and scientific libraries NumPy and SciPy.
Pillow
Pillow is the friendly PIL fork by Alex Clark and Contributors. PIL is the Python Imaging
Library by Fredrik Lundh and Contributors.
Python pillow library is used to image class within it to show the image. The image modules
that belong to the pillow package have a few inbuilt functions such as load images or create
new images, etc.
OpenCV
OpenCV is an open-source library for the computer vision. It provides the facility to the
machine to recognize the faces or objects.
60
In OpenCV, the CV is an abbreviation form of a computer vision, which is defined as a field
of study that helps computers to understand the content of the digital images such as
photographs and videos.
8.2. MySQL
MySQL tutorial provides basic and advanced concepts of MySQL. Our MySQL tutorial is
designed for beginners and professionals. MySQL is a relational database management
system based on the Structured Query Language, which is the popular language for accessing
and managing the records in the database. MySQL is open-source and free software under the
GNU license. It is supported by Oracle Company.MySQL database that provides for how to
manage database and to manipulate data with the help of various SQL queries. These queries
are: insert records, update records, delete records, select records, create tables, drop tables,
etc. There are also given MySQL interview questions to help you better understand the
MySQL database.
MySQL is currently the most popular database management system software used for
managing the relational database. It is open-source database software, which is supported by
Oracle Company. It is fast, scalable, and easy to use database management system in
comparison with Microsoft SQL Server and Oracle Database. It is commonly used in
conjunction with PHP scripts for creating powerful and dynamic server-side or web-based
enterprise applications. It is developed, marketed, and supported by MySQL AB, a Swedish
company, and written in C programming language and C++ programming language. The
official pronunciation of MySQL is not the My Sequel; it is My Ess Que Ell. However, you
can pronounce it in your way. Many small and big companies use MySQL. MySQL supports
many Operating Systems like Windows, Linux, MacOS, etc. with C, C++, and Java
languages.
8.3. WampServer
WampServer is a Windows web development environment. It allows you to create web
applications with Apache2, PHP and a MySQL database. Alongside, PhpMyAdmin allows
you to manage easily your database.
61
WampServer is a reliable web development software program that lets you create web apps
with MYSQL database and PHP Apache2. With an intuitive interface, the application
features numerous functionalities and makes it the preferred choice of developers from
around the world. The software is free to use and doesn’t require a payment or subscription.
8.4. Bootstrap 4
Bootstrap is a free and open-source tool collection for creating responsive websites and web
applications. It is the most popular HTML, CSS, and JavaScript framework for developing
responsive, mobile-first websites.
It solves many problems which we had once, one of which is the cross-browser compatibility
issue. Nowadays, the websites are perfect for all the browsers (IE, Firefox, and Chrome) and
for all sizes of screens (Desktop, Tablets, Phablets, and Phones). All thanks to Bootstrap
developers -Mark Otto and Jacob Thornton of Twitter, though it was later declared to be an
open-source project.
Easy to use: Anybody with just basic knowledge of HTML and CSS can start using
Bootstrap
Responsive features: Bootstrap's responsive CSS adjusts to phones, tablets, and desktops
Mobile-first approach: In Bootstrap, mobile-first styles are part of the core framework
Browser compatibility: Bootstrap 4 is compatible with all modern browsers (Chrome,
Firefox, Internet Explorer 10+, Edge, Safari, and Opera)
8.5. Flask
Flask is a web framework. This means flask provides you with tools, libraries and
technologies that allow you to build a web application. This web application can be some
62
web pages, a blog, a wiki or go as big as a web-based calendar application or a commercial
website.
Flask is often referred to as a micro framework. It aims to keep the core of an application
simple yet extensible. Flask does not have built-in abstraction layer for database handling,
nor does it have formed a validation support. Instead, Flask supports the extensions to add
such functionality to the application. Although Flask is rather young compared to
most Python frameworks, it holds a great promise and has already gained popularity among
Python web developers. Let’s take a closer look into Flask, so-called “micro” framework for
Python.
Flask was designed to be easy to use and extend. The idea behind Flask is to build a solid
foundation for web applications of different complexity. From then on you are free to plug in
any extensions you think you need. Also you are free to build your own modules. Flask is
great for all kinds of projects. It's especially good for prototyping.
Flask is part of the categories of the micro-framework. Micro-framework is normally
framework with little to no dependencies to external libraries. This has pros and cons. Pros
would be that the framework is light, there are little dependency to update and watch for
security bugs, cons is that some time you will have to do more work by yourself or increase
yourself the list of dependencies by adding plugins. In the case of Flask, its dependencies are:
WSGI-Web Server Gateway Interface (WSGI) has been adopted as a standard for Python
web application development. WSGI is a specification for a universal interface between the
web server and the web applications.
Werkzeug-It is a WSGI toolkit, which implements requests, response objects, and other
utility functions. This enables building a web framework on top of it. The Flask framework
uses Werkzeug as one of its bases.
Jinja2is a popular templating engine for Python. A web templating system combines a
template with a certain data source to render dynamic web pages.
63
CHAPTER 9
TESTING
9.1System Testing
In this phase of methodology, testing was carried out on the several application modules.
Different kind of testing was done on the modules which are described in the following
sections. Generally, tests were done against functional and non-functional requirements of the
application following the test cases. Testing the application again and again helped it to
become a reliable and stable system.
9.2 Unit Testing:Before you can test an entire software program, make sure the individual
parts work properly on their own. Unit testing validates the function of a unit, ensuring that
the inputs (one to a few) result in the lone desired output. This testing type provides the
foundation for more complex integrated software. When done right, unit testing drives higher
quality application code and speeds up the development process. Developers often execute
unit tests through test automation.
9.3Functional Testing
Functional Testing is defined as a type of testing which verifies that each function of the
software application operates in conformance with the requirement specification. This testing
mainly involves black box testing and it is not concerned about the source code of the
application. Functional tests were done based on different kind of features and modules of the
application and observed that whether the features are met actual project objectives and the
modules are hundred percent functional. Functional tests, as shown in the following Table-1
to Table-5, were done based on use cases to determine success or failure of the system
implementation and design. For each use case, testing measures were set with results being
considered successful or unsuccessful. Below are the tables which are showing some of the
major test cases along with their respective test results.
9.4 Integration Testing Integration testing is often done in concert with unit testing.
Through integration testing, QA professionals verify that individual modules of code work
together properly as a group. Many modern applications run on microservices, self-contained
applications that are designed to handle a specific task. These microservices must be able to
communicate with each other, or the application won’t work as intended. Through integration
testing, testers ensure these components operate and communicate together seamlessly.
9.5 White box testing:When the software’s internal infrastructure, code and design are
visible to the developer or tester, that refers to white-box testing. This approach incorporates
64
various functional testing types, including unit, integration and system testing. In a white-box
testing approach, the organization tests several aspects of the software, such as predefined
inputs and expected outputs, as well as decision branches, loops and statements in the code.
65
CHAPTER 10
RESULTS AND CONCLUSION
The primary goal of this research is to assess the classification performance of a
proposed CNN classifier for the recognition of sign language alphabets. To
properly assess the overall effectiveness of the suggested technique, we
conducted experiments with a widely used publicly accessible sign language
dataset from the Modified National Institute of Standards and Technology
(MNIST) database, which comprises ASL alphabetic letters of hand movements.
We performed experiments using the CNN model on the Sign Language MNIST
dataset. Following the experimental results, we compare them with several
state-of-the-art methodologies. We identified the suggested model’s capabilities
using several evaluation metrics. Precision, recall, and F1-score are the
evaluation measures used in this project. We divided the data for
experimentation into two parts: 90% for training and 10% for validation. 70% of
the data are utilized for training, and we used 20% data for testing. We also
partition the training set into test and train datasets.
66
Model Accuracy = 99.4004487991333%
67
CHAPTER 11
CONCLUSION AND FUTURE ENHANCEMENT
11.1. Conclusion
Sign language is popular with hearing-impaired individuals around the globe. However, the
sign language is not widely used and/or mastered outside this community which resulted in a
real social barrier between Deaf and hearing people. The Sign Language Recognition (SLR)
system is a methodfor recognising a collection of formed signs and translating them into text
or speech with the appropriate context. This project proposes the recognition of Indian sign
language gestures using a powerful artificial intelligence tool, convolutional neural networks
(CNN).The proposed model is trained on numbers and alphabets from Indian Sign Language.
We developed a web-based user interface to remove for ease of deployment. It is equipped
with text-to-speech, speech-to-text and auto-correct features to support communication
between deaf-and-mute, hard of hearing, visually impaired and non-signers.The proposed
method achieved an overall recognition accuracy of 99.9%
11.2. Future Enhancement
In the future, we plan to extend this work for real-time sign recognition data provided by the
leap motion controller. We also intend to recognize sign language gestures through video
frames, which is a challenging task. We can develop a model for ISL word and sentence level
recognition. This will require a system that can detect changes with respect to the temporal
space. We can develop a complete product that will help the speech and hearing impaired
people, and thereby reduce the communication gap.
68
APPENDIX-A
CODING
Python Code
import numpy as np
from tkinter import filedialog
from tkinter.filedialog import asksaveasfile
from tkinter import messagebox
import cv2
import shutil
import tkinter as tk
from tkinter import ttk
import PIL.Image, PIL.ImageTk
import playsound
import imagehash
import mysql.connector
mydb = mysql.connector.connect(
host="localhost",
user="root",
password="",
charset="utf8",
database="hand_reco"
class Hand:
def __init__(self, master):
self.master = master
self.frame = Frame(self.master)
69
self.a1 = Label(self.master, text='Hand Gesture Recognition',bg='lightblue', fg='blue',
font=("Helvetica", 16))
self.a1.pack()
self.a1.place(x=210, y=30)
mm = PIL.Image.open("hand.jpg")
img2 = PIL.ImageTk.PhotoImage(mm)
panel2 = Label(root, image = img2)
panel2.image = img2 # keep a reference!
panel2.pack()
panel2.place(x=60,y=60)
self.E1 = Entry(self.master)
self.E1.pack()
self.E1.place(x=310,y=465)
70
self.E2.place(x=310,y=495)
def hand_log(self):
#print('Hello Button1')
#self.master.withdraw()
un=self.E1.get()
pw=self.E2.get()
if un=="admin" and pw=="admin":
self.newWindow = Toplevel(self.master)
bb = Textlog(self.newWindow)
else:
messagebox.showinfo("Login","Username/Password was wrong!")
class Textlog():
71
self.a1.pack()
self.a1.place(x=150, y=30)
self.b1 = Button(self.master, text="Training", command=self.text_train)
self.b2 = Button(self.master, text="Testing", command=self.text_test)
self.b1.pack()
self.b1.place(x=270, y=100)
self.b2.pack()
self.b2.place(x=270, y=150)
self.frame.pack()
self.master.configure(background='lightblue')
self.master.title('Hand Written Identification')
self.master.geometry("600x600")
def text_train(self):
#print('Hello Button1')
#self.master.withdraw()
self.newWindow = Toplevel(self.master)
bb = TrainClass(self.newWindow)
def text_test(self):
#self.master.withdraw()
self.newWindow = Toplevel(self.master)
bb2 = TestClass(self.newWindow)
class TrainClass():
self.master = master
72
self.frame = Frame(self.master)
video_source=0
self.video_source = video_source
self.vid = MyVideoCapture(self.video_source)
self.canvas = Canvas(self.master, width = self.vid.width, height = self.vid.height)
self.canvas.pack()
self.delay = 50
self.update()
self.E1 = Entry(self.master)
self.E1.pack()
self.E1.place(x=290,y=535)
self.frame.pack()
self.master.configure(background='lightblue')
73
self.master.title('Training Process')
self.master.geometry("600x600")
self.sh = 0
self.f = 1
self.rid = 0
self.u1 = 1
self.sd = ""
def split(self,im):
retur = []
emptyColor = im.getpixel((0, 0))
box = self.getbox(im, emptyColor)
width, height = im.size
pixels = im.getdata()
sub_start = 0
sub_width = 0
offset = box[1] * width
for x in range(width):
if pixels[x + offset] == emptyColor:
if sub_width> 0:
retur.append((sub_start, box[1], sub_width, box[3]))
sub_width = 0
sub_start = x + 1
else:
sub_width = x + 1
74
if sub_width> 0:
retur.append((sub_start, box[1], sub_width, box[3]))
return retur
def storedata(self):
txt=self.E1.get()
mycursor = mydb.cursor()
mycursor.execute("SELECT max(id)+1 FROM store_data")
maxid = mycursor.fetchone()[0]
if maxid is None:
maxid=1
imgname="image"+str(maxid)+".jpg"
afile="a"+str(maxid)+".mp3"
tts = gTTS(text=txt, lang='en')
tts.save("audio/"+afile)
shutil.copy('img.jpg', 'dataset/'+imgname)
sql = "INSERT INTO store_data(id, detail, imgname) VALUES (%s, %s, %s)"
val = (maxid, txt, imgname)
mycursor.execute(sql,val)
mydb.commit()
path="dataset/"+imgname
im = Image.open(path)
75
def update(self):
# Get a frame from the video source
ret, frame = self.vid.get_frame()
if ret:
img_name = "img.jpg".format(0)
cv2.imwrite(img_name, frame)
self.master.after(self.delay, self.update)
76
References
1. T. Reddy Gadekallu, G. Srivastava, and M. Liyanage, “Hand gesture recognition
based on a Harris hawks optimized convolution neural network,” Computers &
Electrical Engineering, vol. 100, Article ID 107836, 2022.
2. G. T. R. Chiranji Lal Chowdhary and B. D. Parameshachari, Computer Vision and
Recognition Systems: Research Innovations and Trends, CRC Press, 2022.
3. M.M. Riaz and Z. Zhang, "Surface EMG Real-Time Chinese Language Recognition
Using Artificial Neural Networks" in Intelligent Life System Modelling Image
Processing and Analysis Communications in Computer and Information Science,
Springer, vol. 1467, 2021.
4. G. Halvardsson, J. Peterson, C. Soto-Valero and B. Baudry, "Interpretation of
Swedish sign language using convolutional neural networks and transfer learning",
SN Computer Science, vol. 2, no. 3, pp. 1-15, 2021.
5. P. Likhar, N. K. Bhagat and R. G N, "Deep Learning Methods for Indian Sign
Language Recognition", 2020 IEEE 10th International Conference on Consumer
Electronics (ICCE-Berlin), pp. 1-6, 2020.
6. F. Li, K. Shirahama, M. A. Nisar, X. Huang and M. Grzegorzek, "Deep Transfer
Learning for Time Series Data Based on Sensor Modality Classification", Sensors,
vol. 31, no. 20(15), pp. 4271, Jul 2020.
7. J. J. Bird, A. Ekárt and D. R. Faria, "British sign language recognition via late fusion
of computer vision and leap motion with transfer learning to american sign language",
Sensors, vol. 20, no. 18, pp. 5151, 2020.
8. S. Sharma, R. Gupta and A. Kumar, "Trbaggboost: an ensemble-based transfer
learning method applied to Indian Sign Language recognition", J Ambient Intell
Human Comput Online First, 2020, [online] Available:
https://doi.org/10.1007/s12652-020-01979-z.
9. M. Oudah, A. Al-Naji and J. Chahl, "Hand Gesture Recognition Based on Computer
Vision: A Review of Techniques", J. Imaging, vol. 6, no. 73, 2020.
10. Z. M. Shakeel, S. So, P. Lingga and J. P. Jeong, "MAST: Myo Armband Sign-
Language Translator for Human Hand Activity Classification", IEEE International
77
Conference on Information and Communication Technology Convergence, pp. 494-
499, 2020.
11. M. Zakariya and R. Jindal, "Arabic Sign Language Recognition System on
Smartphone", 2019 10th International Conference on Computing Communication and
Networking Technologies (ICCCNT), pp. 1-5, 2019.
12. E. Abraham, A. Nayak and A. Iqbal, "Real-Time Translation of Indian Sign Language
using LSTM", 2019 Global Conference for Advancement in Technology (GCAT), pp.
1-5, 2019.
13. O. Koller, N. C. Camgoz, H. Ney and R. Bowden, "Weakly supervised learning with
multi-stream CNN-LSTM-HMMs to discover sequential parallelism in sign language
videos", IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 9, pp. 2306-20, 2019.
14. A. A. Hosain, P. S. Santhalingam, P. Pathak, J. Kosecka and H. Rangwala, "Sign
language recognition analysis using multimodal data", 2019.
15. J. Huang, W. Zhou, Q. Zhang, H. Li and W. Li, "Video-based sign language
recognition without temporal segmentation", AAAI, vol. 32, no. 1, pp. 2257-64, 2018.
16. D. Konstantinidis, K. Dimitropoulos and P. Daras, "Sign language recognition based
on hand and body skeletal data", 3DTV-CON, pp. 1-4, 2018.
17. C. Motoche and M. E. Benalcázar, "Real-Time Hand Gesture Recognition Based on
Electromyographic Signals and Artificial Neural Networks" in Artificial Neural
Networks and Machine Learning - ICANN 2018 Lecture Notes in Computer Science,
Springer, vol. 11139, 2018.
18. J. Pu, W. Zhou and H. Li, "Dilated convolutional network with iterative optimization
for continuous sign language recognition", IJCAI, vol. 3, pp. 885-91, 2018.
19. R. Cui, H. Liu and C. Zhang, "Recurrent convolutional neural networks for
continuous sign language recognition by staged optimization", CVPR, pp. 7361-9,
2017.
20. S. Y. Kim, H. G. Han, J. W. Kim, S. Lee, and T. W. Kim, “A hand gesture
recognition sensor using reflected impulses,” IEEE Sensors Journal, vol. 17, no. 10,
pp. 2975-2976, 2017.
78
79