Sathyabama: Conversion of Sign Language Into Speech or Text Using CNN
Sathyabama: Conversion of Sign Language Into Speech or Text Using CNN
Sathyabama: Conversion of Sign Language Into Speech or Text Using CNN
By
Jebakani C. (38110215)
Rishitha S.P. (38110461)
SATHYABAMA
INSTITUTE OF SCIENCE AND TECHNOLOGY
(DEEMED TO BE UNIVERSITY)
Accredited with Grade “A” by NAAC
JEPPIAAR NAGAR, RAJIV GANDHI SALAI,
CHENNAI - 600 119
MARCH-2022
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
BONAFIDE CERTIFICATE
This is to certify that this Project Report is the bonafide work of Jebakani C (REG
NO: 38110215), Rishitha S.P. (REG NO: 38110461) who have done Project work
as a team who carried out the project entitled ―CONVERSION OF SIGN
LANGUAGE INTO SPEECH OR TEXT USING CNN‖ under my supervision from
November 2021 to April 2022.
Internal Guide
Ms. AISHWARYA R M.E.,
We Jebakani C (REG NO: 38110215) and Rishitha S.P. (REG NO: 38110461)
DATE:
I convey my thanks to Dr. T. Sasikala M.E., Ph.D, Dean, School of Computing Dr.
L. Lakshmanan M.E., Ph.D. , and Dr.S.Vigneshwari M.E., Ph.D. Heads of the
Department of Computer Science and Engineering for providing me necessary
support and details at the right time during the progressive reviews.
LIST OF FIGURES
TABLE OF CONTENT
2 LITERATURE SURVEY
3 METHODOLOGY AND
IMPLEMENTATION
3.1 Training Module 28
3.1.1 Pre-Processing 28
i
3.2 Algorithm 32
i
3.3 Segmentation 33
i
3.4 Convolution Neural Networks 34
3.10.1 Precision 57
3.10.2 Recall 57
3.10.3 Support 57
3.10.4 F1 score 57
iv
5 CONCLUSION AND FUTURE WORK
7 APPENDIX
a) Sample code 61
b) Screenshots 68
v
CHAPTER 1
INTRODUCTION
• image analysis.
There are two types of methods used for image processing namely,
analogue and digital image processing. Analogue image processing
can be used for the hard copies like printouts and photographs. Image
analysts use various fundamentals of interpretation while using these
visual techniques. Digital image processing techniques help in
manipulation of the digital images by using computers. The three
general phases that all types of data have to undergo while using
1
digital technique are pre- processing, enhancement, and display,
information extraction.
2
Fig1.1: Phases of pattern recognition
The first phase includes the image segmentation and object separation.
In this phase, different objects are detected and separate from other
background. The second phase is the feature extraction. In this phase,
objects are measured. The measuring feature is to quantitatively
estimate some important features of objects, and a group of the
features are combined to make up a feature vector during feature
extraction. The third phase is classification. In this phase, the output is
just a decision to determine which category every object belongs to.
Therefore, for pattern recognition, what input are images and what
output are object types and structural analysis of images. The
structural analysis is a description of images in order to correctly
understand and judge for the important information of images.
It is a language that includes gestures made with the hands and other
body parts, including facial expressions and postures of the body.It
used primarily by people who are deaf and dumb. There are many
different sign languages as, British, Indian and American sign
languages. British sign language (BSL) is not easily intelligible to users
of American sign Language (ASL) and vice versa .
A functioning signing recognition system could provide a chance for
the inattentive communicate with non-signing people without the
necessity for an interpreter. It might be wont to generate speech or
3
text making the deaf more independent. Unfortunately there has not
been any system with these capabilities thus far. during this project
our aim is to develop a system which may classify signing accurately.
American Sign Language (ASL) is a complete, natural language that
has the same linguistic properties as spoken languages, with grammar
that differs from English. ASL is expressed by movements of the
hands and face. It is the primary language of many North Americans
who are deaf and hard of hearing, and is used by many hearing
people as well.
The process of converting the signs and gestures shown by the user
into text is called sign language recognition. It bridges the
communication gap between people who cannot speak and the
general public. Image processing algorithms along with neural
networks is used to map the gesture to appropriate text in the training
data and hence raw images/videos are converted into respective text
that can be read and understood.
4
or the deaf community. The importance of sign language is
emphasized by the growing public approval and funds for international
project. At this age of Technology the demand for a computer based
system is highly demanding for the dumb community. However,
researchers have been attacking the problem for quite some time now
and the results are showing some promise. Interesting technologies
are being developed for speech recognition but no real commercial
product for sign recognition is actually there in the current market. The
idea is to make computers to understand human language and
develop a user friendly human computer interfaces (HCI). Making a
computer understand speech, facial expressions and human gestures
are some steps towards it. Gestures are the non-verbally exchanged
information. A person can perform innumerable gestures at a time.
Since human gestures are perceived through vision, it is a subject of
great interest forcomputer vision researchers. The project aims to
determine human gestures by creating an HCI. Coding of these
gestures into machine language demands a complex programming
algorithm. In our project we are focusing on Image Processing and
Template matching for better output generation.
1.4 MOTIVATION
The 2011 Indian census cites roughly 1.3 million people with
―hearingimpairment‖. In contrast to that numbers from India‘s National
Association of the Deaf estimates that 18 million people –roughly 1
per cent of Indian population are deaf. These statistics formed the
motivation for our project. As these speech impairment and deaf
people need a proper channel to communicate with normal people
there is a need for a system. Not all normal people can understand
sign language of impaired people. Our project hence is aimed at
converting the sign language gestures into text that is readable for
normal people
5
Normal people face difficulty in understanding their language. Hence
there is a need of a system which recognizes the different signs,
gestures and conveys the information to the normal people. It bridges
the gap between physically challenged people and normal people.
Part 1: The various technologies that are studied are introduced and
the problem statement is stated alongwith the motivation to our
project.
Part 2: The Literature survey is put forth which explains the various
other works and their technologies that are used for Sign Language
Recognition.
Part 5: Provides the experimental analysis, the code involved and the results
obtained.
Part 6: Concludes the project and provides the scope to which the
project can be extended.
6
CHAPTER 2
LITERATURE SURVEY
The domain analysis that we have done for the project mainly
involved understanding the neural networks
2.1.1 TensorFlow:
2.1.2 Opencv:
7
then Itseez (which was later acquired by Intel[2]). The library is cross-
platform and free for use under the open-source BSD license.
Boosting
Decision tree learning
Gradient boosting trees
Expectation-maximization algorithm
k-nearest neighbor algorithm
Naive Bayes classifier
Artificial neural networks
Random forest
Support vector machine (SVM)
Deep neural networks (DNN)
8
AForge.NET, a computer vision library for the Common Language
Runtime (.NET Framework and Mono).
Integrating Vision Toolkit (IVT), a fast and easy-to-use C++ library with
an optional interface to OpenCV.
software packages
OpenCV Functionality
Image/video I/O, processing, display (core, imgproc, highgui)
Object/feature detection (objdetect, features2d, nonfree)
Geometry-based monocular or stereo computer vision
(calib3d, stitching, videostab)
Computational photography (photo, video, superres)
Machine learning & clustering (ml, flann)
CUDA
acceleration (gpu)
9
Image-Processing:
Digital-Image :
10
Output in which result can be altered image or report that is based
Robotics Application
Localization − Determine robot location automatically
Navigation
Obstacles avoidance
Assembly (peg-in-hole, welding, painting)
Manipulation (e.g. PUMA robot manipulator)
Human Robot Interaction (HRI) − Intelligent robotics to interact
with and serve people
Medicine Application
Classification and detection (e.g. lesion or cells classification
and tumor detection)
2D/3D segmentation
3D human organ reconstruction (MRI or ultrasound)
Vision-guided robotics surgery
Industrial Automation Application
Industrial inspection (defect detection)
Assembly
Barcode and package label reading
Object sorting
Document understanding (e.g. OCR)
Security Application
Biometrics (iris, finger print, face recognition)
Surveillance − Detecting certain suspicious activities or behaviors
Transportation Application
Autonomous vehicle
Safety, e.g., driver vigilance monitoring
11
2.1.3 Keras:
coding necessary for writing deep neural network code. The code is
hosted on GitHub, and community support forums include the GitHub
issues page, and a Slack channel.
12
extraction and fine tuning. This chapter explains about Keras
applications in detail.
Pre-trained models
ResNet
VGG16
MobileNet
InceptionResNetV2
InceptionV3
2.1.4 Numpy:
13
code, mostly inner loops using NumPy.
14
this limitation.
15
into an activation function that may be nonlinear.
Areas of Application
Speech Recognition
Great progress has been made in this field, however, still such kinds of
systems are facing the problem of limited vocabulary or grammar
along with the issue of retraining of the system for different speakers in
different conditions. ANN is playing a major role in this area. Following
ANNs have been used for speech recognition −
Multilayer networks
16
Multilayer networks with recurrent
Character Recognition
17
For this application, the first approach is to extract the feature or rather
the geometrical feature set representing the signature. With these
feature sets, we have to train the neural networks using an efficient
neural network algorithm. This trained neural network will classify the
signature as being genuine or forged under the verification stage.
18
output) qualifies as ―deep‖ learning. So deep is not just a buzzword to
make algorithms seem like they read Sartre and listen to bands you
haven‘t heard of yet. It is a strictly defined term that means more than
one hidden layer.
For example, deep learning can take a million images, and cluster
them according to their similarities: cats in one corner, ice breakers in
another, and in a third all the photos of your grandmother. This is the
basis of so-called smart photo albums.
19
of small data science teams, which by their nature do not scale.
20
To solve this problem the computer looks for the characteristics of the
baselevel. In human understanding such characteristics are for example
the trunk or large ears. For the computer, these characteristics are
boundaries or curvatures. And then through the groups of convolutional
layers the computer constructs more abstract concepts.In more detail:
the image is passed through a series of convolutional, nonlinear,
pooling layers and fully connected layers, and then generates the
output.
21
Fig2.1: Layers involved in CNN
2.1.6 EXISTING SYSTEM
22
analyze and compare the methods employed in the SLR systems,
classifications methods that have been used, and suggests the most
reliable method for future research. Due to recent advancement in
classification methods, many of the recently proposed works mainly
contribute to the classification methods, such as hybrid method and
Deep Learning. Based on our review, HMM-based approaches have
been explored extensively in prior research, including its
modifications.Hybrid CNN-HMM and fully Deep Learning approaches
have shown promising results and offer opportunities for further
exploration.
23
In this paper we proposed some methods,through which the
recognition of the signs becomes easy forpeoples while
communication. And the result of thosesymbols signs will be
converted into the text. In this project,we are capturing hand gestures
through webcam andconvert this image into gray scale image. The
segmentationof gray scale image of a hand gesture is performed
usingOtsu thresholdingalgorithm.. Total image level is dividedinto two
classes one is hand and other is background. Theoptimal threshold
value is determined by computing theratio between class variance and
total class variance. Tofind the boundary of hand gesture in image
Canny edgedetection technique is used.In Canny edge detection we
used edge based segmentation and threshold based
segmentation.Then Otsu‘s algorithm is used because of its simple
calculation and stability.This algorithm fails, when the global
distribution of the target and background vary widely.
24
the spoken language dynamically and can make the communication
between people with hearing impairment and normal people both
effective and efficient. The system is we are implementing for Binary
sign language but it can detect any sign language with prior image
processing
One of the major drawback of our society is the barrier that is created
between disabled or handicapped persons and the normal person.
Communication is the only medium by which we can share our
thoughts or convey the message but for a person with disability (deaf
and dumb) faces difficulty in communication with normal person. For
many deaf and dumb people , sign language is the basic means of
communication. Sign language recognition (SLR) aims to interpret sign
languages automatically by a computer in order to help the deaf
communicate with hearing society conveniently. Our aim is to design a
system to help the person who trained the hearing impaired to
communicate with the rest of the world using sign language or hand
gesture recognition techniques. In this system, feature detection and
feature extraction of hand gesture is done with the help of SURF
algorithm using image processing. All this work is done using
MATLAB software. With the help of this algorithm, a person can easily
trained a deaf and dumb.
25
language. The application acquires image data using the webcam of
the computer, then it is preprocessed using a combinational algorithm
and recognition is done using template matching. The translation in
the form of text is then converted to audio. The database used for this
system includes 6000 images of English alphabets. We used 4800
images for training and 1200 images for testing. The system produces
88% accuracy.
26
speaks to some message or data. Gestures are the requirement for
hearing and discourse hindered, they pass on their message to
others just with the assistance of motions. Gesture Recognition
System is the capacity of the computer interface to catch, track and
perceive the motions and deliver the yield in light of the caught signals.
It enables the clients to interface with machines (HMI) without the any
need of mechanical gadgets. There are two sorts of sign recognition
methods: image- based and sensor- based strategies. Image based
approach is utilized as a part of this project that manages
communication via gestures motions to distinguish and track the signs
and change over them into the relating discourse and content.
27
Fig2.2 Architecture of Sign Language recognition System
CHAPTER 3
METHODOLOGY
28
1. model construction
2. model training
3. model testing
4. model evaluation
Before model training it is important to scale data for their further use.
Model training:
After model construction it is time for model training. In this
phase, the model is trained using training data and expected output for
this data. It‘s look this way: model.fit(training_data, expected_output).
Progress is visible on the console when the script runs. At the end it
will report the final accuracy of the model.
Model Testing:
During this phase a second set of data is loaded. This data set
has never been seen by the model and therefore it‘s true accuracy will
be verified. After the model training is complete, and it is understood
that the model shows the right result, it can.
29
evaluation. This means that the model can be used to evaluate new
data.
30
500px.
• Don't scale up the longer side; this can make your image blurry.
Image scaling:
• In compuer graphics and digital imaging , image scaling refers to the
resizing of a digital image. In video technology, the magnification of
digital
material is known as upscaling or resolution enhancement .
• When scaling a vector graphic image, the graphic
primitives that make up the image can be scaled using
geometric transformations, with no loss
of image quality. When scaling a raster graphics image, a new image with
a higher or lower number of pixels must be generated. In the case of
decreasing the pixel number (scaling down) this usually results in
avisible quality loss. From the standpoint of digital signal processing,
the scaling of raster graphics is a two- dimensional example of
sample-rate conversion, the conversion of a discrete signal from a
sampling rate (in this case the local sampling rate) to another.
31
Fig3.1: Sample dataset from train set
3.2 ALGORITHM
HISTOGRAM CALCULATION:
Histograms are collected counts of data organized into a set of predefined bins
32
What happens if we want to count this data in an organized way?
Since we know that the range of information value for this case is 256
values, we can segment our range in subparts (called bins) like:
[0,255]=[0,15]∪[16,31]∪....∪[240,255]range=bin1∪bin2∪ . ∪binn=15
and we can keep count of the number of pixels that fall in the range of each
bini
33
Note. The block before the Target block must use the activation function
Softmax.
3.3 SEGMENTATION
Image segmentation is the process of partitioning a digital image into
multiple segments(sets of pixels, also known as image objects). The
goal of segmentation is to simplify and/or change the representation of
an image into something that is more meaningful and easier to
analyse.Modern image segmentation techniques are powered by deep
learning technology. Here are several deep learning architectures
used for segmentation:
Why does Image Segmentation even matter?
34
picture) and outputting its class or probability that the input is a
particular class. Neural networks are applied in the following steps:
1) One hot encode the data: A one-hot encoding can be applied to the
integer representation. This is where the integer encoded variable is
removed and a new binary variable is added for each unique integer
value.
2) Define the model: A model said in a very simplified form is nothing
but a function that is used to take in certain input, perform certain
operation to its beston the given input (learning and then
predicting/classifying) and produce the suitable output.
3) Compile the model: The optimizer controls the learning rate. We will
be using ‗adam‘ as our optmizer. Adam is generally a good optimizer
to use for many cases. The adam optimizer adjusts the learning rate
throughout training. The learning rate determines how fast the optimal
weights for the model are calculated. A smaller learning rate may lead
to more accurate weights (up to a certain point), but the time it takes to
compute the weights will be longer.
35
Here are the three elements that enter into the convolution operation:
• Input image
• Feature detector
• Feature map
Steps to apply convolution layer:
• You place it over the input image beginning from the top-left
corner within the borders you see demarcated above, and then you
count the number of cells in which the feature detector matches the
input image.
• The number of matching cells is then inserted in the top-left
cell of the feature map
• You then move the feature detector one cell to the right and
do the same thing. This movement is called a and since we are moving
the feature detector one cell at time, that would be called a stride of
one pixel.
• What you will find in this example is that the feature detector's
middle-left cell with the number 1 inside it matches the cell that it is
standing over inside the input image. That's the only matching cell, and
so you write ―1‖ in the next cell in the feature map, and so on and so
forth.
• After you have gone through the whole first row, you can then
move it over to the next row and go through the same process.
There are several uses that we gain from deriving a feature map.
These are the most important of them: Reducing the size of the input
image, and you should know that the larger your strides (the
movements across pixels), the smaller your feature map.
Relu Layer:
Rectified linear unit is used to scale the parameters to non
negativevalues.We get pixel values as negative values too . Inthis
layer we make them as 0‘s. The purpose of applying the rectifier
function is to increase the non-linearity in our images. The reason we
want to do that is that images are naturally non-linear. The rectifier
36
serves to break up the linearity even further in order to make up for
the linearity that we might impose an image when we put it through
the convolution operation. What the rectifier function does to an image
like this is remove all the black elements from it, keeping only those
carrying a positive value (the grey and white colors).The essential
difference between the non-rectified version of the image and the
rectified one is the progression of colors. After we rectify the image,
you will find the colors changing more abruptly. The gradual change is
no longer there. That indicates that the linearity has been disposed of.
Pooling Layer:
The pooling (POOL) layer reduces the height and width of the input. It
helps reduce computation, as well as helps make feature detectors
more invariant to its position in the input This process is what provides
the convolutional neural network with the ―spatial variance‖ capability.
In addition to that, pooling serves to minimize the size of the images as
well as the number of parameters which, in turn, prevents an issue of
―overfitting‖ from coming up. Overfitting in a nutshell is when you create
an excessively complex model in order to account for the
idiosyncracies we just mentioned The result ofusing a pooling layer
and creating down sampled or pooled feature maps is a summarized
version of the features detected in the input. They are useful as small
changes in the location of the feature in the input detected by the
convolutional layer will result in a pooled feature map with the
feature in the same location. Thiscapability added by pooling is called
the model‘s invariance to local translation.
37
on weight values, For now, all you need to know is that the loss
function informs us of how accurate our network is, which we then use
in optimizing our network in order to increase its effectiveness. That
requires certain things to be altered in our network. These include the
weights (the blue lines connecting the neurons, which are basically the
synapses), and the feature detector since the network often turns out
to be looking for the wrong features and has to be reviewed multiple
times for the sake of optimization.This full connection process
practically works as follows:
• The neuron in the fully-connected layer detects a certain feature; say, a
nose.
• It preserves its value.
• It communicates this value to the classes trained images.
3.5 TESTING
Testing Objectives:
There are several rules that can serve as testing objectives they are:
Testing is a process of executing program with the intent of finding an
error.
A good test case is the one that has a high probability of
38
finding an undiscovered error.
Types of Testing:
In order to make sure that the system does not have errors,
the different levels of testing strategies that are applied at different
phases of software development are :
Unit Testing:
Unit testing is done on individual models as they are
completed and becomes executable. It is confined only to the
designer's requirements. Unit testing is different from and should be
preceded by other techniques, including:
Inform Debugging
Code Inspection
39
It has been used to generate the test cases in the following cases:
Guarantee that all independent paths have been executed
Execute all loops at their boundaries and within their operational
bounds.
Execute internal data structures to ensure their validity.
Integration Testing
Integration testing ensures that software and subsystems
work together a whole. It test the interface of all the modules to make
sure that the modules behave properly when integrated together. It is
typically performed by developers, especially at the lower, module to
module level. Testers become involved in higher levels
System Testing
Inclusion of changes/fixes.
Test data to use
Acceptance Testing
User Acceptance
Test (UAT)
40
Requirements
traceability:
Initializing
trained
1 Loadind model Loaded pass
mode
model without
l and load it
errors
into ON
Capturing Image frames
video of captured
2 Converting pass
an video stream
video to
d converting
frames
it
into frames
3.6 DESIGN
Dataflow Diagram
The DFD is also known as bubble chart. It is a simple graphical
41
formalism that can be used to represent a system in terms of the input
data to the system, various processing carried out on these data, and
the output data is generated by the system. It maps out the flow of
information for any process or system, how data is processed in terms
of inputs and outputs. It uses defined symbols like rectangles, circles
and arrows to show data inputs, outputs, storage points and the routes
between each destination. They can be used to analyse an existing
system or model of a new one. A DFD can often visually ―say‖ things
that
would be hard to explain in words and they work for both technical
and non- technical. There are four components in DFD:
1. External Entity
2. Process
3. Data Flow
4. data Store
1) External Entity:
It is an outside system that sends or receives data, communicating with
the system. They are the sources and destinations of information
entering and leaving the system. They might be an outside
organization or person, a computer system or a business system. They
are known as terminators, sources and sinks or actors. They are
typically drawn on the edges of the diagram. These are sources and
destinations of the system‘s input and output.
Representation:
2) Process:
It is just like a function that changes the data, producing an output. It might
perform computations for sort data based on logic or direct the
dataflowbased on business rules
Representation:
42
3) Data Flow:
A dataflow represents a package of information flowing between two
objects in the data-flow diagram, Data flows are used to model the
flow of information into the system, out of the system and between the
elements within the system.
Representation:
4) Data Store:
These are the files or repositories that hold information for later use,
such as a database table or a membership form. Each data store
receives a simple label.
Representation:
43
3.6.1 UML DIAGRAMS
UML stands for Unified Modeling Language. Taking SRS
document of analysis as input to the design phase drawn UML
diagrams. The UML is only language so is just one part of the
software development method. The UML is process independent,
although optimally it should be used in a process that should be
driven, architecture-centric, iterative, and incremental. The UML is
language for visualizing, specifying, constructing, documenting the
articles in a software- intensive system. It is based on diagrammatic
representations of software components.
44
function provided by the system as a set of events that yield a visible
result for the actor.
When the initial task is complete, use case diagrams are modelled to
present the outside view.
system.
Use case diagrams are considered for high level requirement analysis
of a system. When the requirements of a system are analyzed, the
functionalities are captured in use cases.
We can say that use cases are nothing but the system functionalities
45
written in an organized manner. The second thing which is relevant to
use cases are the actors. Actors can be defined as something that
interacts with the system.
Functionalities to be
Actors
46
Fig 3.6.2: Usecase diagram of sign language recognition System
47
Table 3.6.3: Usecase Scenario for sign language recognition system
48
of three things: name, attributes, and operations. Class diagram also
display relationships such as containment, inheritance, association
etc. The association relationship is most common relationship in a
class diagram. The association shows the relationship between
instances of classes.
Class diagrams are the most popular UML diagrams used for
construction of software applications. It is very important to learn the
drawing procedure of class diagram.
Finally, before making the final version, the diagram should be drawn
on plain paper and reworked as many times as possible to make it
correct.
49
Fig 3.6.4: Class diagram of sign language recognition system
3.6.4 Sequence Diagram
50
Logical View of the system under development. Sequence diagrams
are sometimes called event diagrams or event scenarios.
51
UML has introduced significant improvements to the capabilities of
sequence diagrams. Most of these improvements are based on the
idea of interaction fragmentswhich represent smaller pieces of an
enclosing interaction. Multiple interaction fragments are combined to
create a variety of combined fragments, which are then used to model
interactions that include parallelism, conditional branches, optional
interactions
52
4.1.1 State Chart
53
Fig:3.6.7 :State Chart diagram of sign language recognition system
There are several software requirements that must be met for software to function
on a computer, including resources requirements and prerequisites. The minimal
requirements are as follows,
Raspian OS
Anaconda with Spyder
3.7.2 Hardware Requirements
The most common set of requirements defined by any operating system or
software application is the physical computer resources, also known as hardware.
The minimal hardware requirements are as follows,
54
Raspberry Pi B+
Camera Module
8 GB SD Card
3.8 PROCESSING MODULE
Raspberry Pi is a small single-board computers developed in the United Kingdom
by the Raspberry Pi Foundation. The organization behind the Raspberry Pi
consists of two arms. The first two models were developed by the Raspberry Pi
Foundation. The Raspberry Pi hardware has evolved through several versions that
feature variations in memory capacity and peripheral- device support.
The Raspberry Pi device looks like a motherboard, with the mounted chips and
ports exposed (something you'd expect to see only if you opened up your
computer and looked at its internal boards), but it has all the components you need
to connect input, output, and storage devices and start computing.
Raspberry Pi is a low-cost, basic computer that was originally intended to help
spur interest in computing among school-aged children. The Raspberry Pi is
contained on a single circuit board and features ports for:
HDMI
USB 2.0
Composite video
Analog audio
Internet
SD Card
The computer runs entirely on open-source software and gives students the ability
to mix and match software according to the work they wish to do.
The Raspberry Pi debuted in February 2012. The group behind the computer's
development - the Raspberry Pi Foundation - started the project to make
computing fun for students, while also creating interest in how computers work at a
basic level. Unlike using an encased computer from a manufacturer, the
Raspberry Pi shows the essential guts behind the plastic. Even the software, by
virtue of being open- source, offers an opportunity for students to explore the
underlying code-if they wish. The Raspberry Pi is believed to be an ideal learning
tool, in that it is cheap to make, easy to replace and needs only a keyboard and a
TV to run. These same strengths also make it an ideal product to jumpstart
55
computing in the developing world. The quad-core Raspberry Pi 3 is both faster
and more capable than its predecessor, the Raspberry Pi 2. For those interested in
benchmarks, the Pi 3's CPU--the board's main processor--has roughly 50-60
percent better performance in 32-bit mode than that of the Pi 2, and is 10x faster
than the original single-core Raspberry Pi.
Compared to the original PI, real-world applications will see performance increase
of between 2.5x for single threaded applications and more than 20x when video
playback is accelerated by the chip's NEON engine. Unlike its predecessor, the
new board is capable of playing 1080p MP4 video at 60 frames per second (with a
bit rate of about 5400Kbps), boosting the Pi's media centre credentials. That's not
to say, however, that all video will playback this smoothly, with performance
dependent on the source video, the player used and bitrate. The Pi 3 also supports
wireless internet out of the box, with built- in Wi-Fi and Bluetooth. The latest board
can also boot directly from a USB-attached hard drive or pen drive, as well as
supporting booting from a network-attached file system, using PXE, which is useful
for remotely updating a Pi and for sharing an operating system image between
multiple machines.
56
to a remote location, the video stream may be saved, viewed or on sent there.
Unlike an IP camera (which connects using Ethernet or Wi-Fi), a webcam is
generally connected by a USB cable, or similar cable, or built into computer
hardware, such as laptops.
The term ―USB‖ camera (a clipped compound) may also be used in its original
sense of a video camera connected to the USB continuously for an
indefinite time, rather than for a particular session, generally supplying a view for
anyone who visits its web page over the Internet. Some of them, for example,
those used as online cameras, are expensive, rugged professional video cameras.
RECALL
Recall may be defined as the number of positives returned by our ML model. We
can easily calculate it by confusion matrix with the help of following formula.
SUPPORT
57
Support may be defined as the number of samples of the true response that lies
in each class of target values.
F1 SCORE
This score will give us the harmonic mean of precision and recall. Mathematically,
F1 score is the weighted average of the precision and recall. The best value of F1
would be 1 and worst would be 0. We can calculate F1 score with the help of
following formula
F1 = 2 * (precision * recall) / (precision + recall)
F1 score is having equal relative contribution of precision and recall.
We can use classification_report function of sklearn.metrics to get the
classification report of our classification model.
58
CHAPTER 4
RESULTS
4.1 RESULTS
Our proposed methodology‘s execution has been examined on the test data
which was distinct from the training data set. The testing process involves
43,200 image samples of different hand signals. All these images were
simultaneously updated into our proposed model to come up with accurate
results. Our expected result is to obtain a text which is the translation of the
sign language given as an input. Our model will anticipate all the hand
gestures of the Indian Sign Language. To achieve an efficient result. The
estimated accuracy of our proposed system is more than 95% even in a
multiplex lighting environment which is considered an adequate result as of
now for real-time interpretation.
59
CHAPTER 5
CONCLUSIONS AND FUTURE WORK
5.1 CONCLUSIONS AND FUTURE WORK
The proposed a prototype model can recognize and classify the Indian Sign
Language using the deep structured learning techniques called CNN. We
observe that the CNN model gives the highest accuracy due to the advanced
techniques. From the process, we can conclude that CNN is an efficient
technique to categorize hand gestures with high degree of accuracy. In the
future work, we would like to expand the dialects for a few more sign
languages and our response time can also be improved.
60
REFERENCES
[1] Vijayalakshmi, P., & Aarthi, M. (2016, April). Sign language to speech conversion. In 2016 International Conference on
Recent Trends in Information Technology (ICRTIT) (pp. 1-6). IEEE.
[2] NB, M. K. (2018). Conversion of sign language into text. International Journal of Applied Engineering Research, 13(9),
7154-7161.
[3] Masood, S., Srivastava, A., Thuwal, H. C., & Ahmad, M. (2018). Real-time sign language gesture (word) recognition from
video sequences using CNN and RNN. In Intelligent Engineering Informatics (pp. 623-632). Springer, Singapore.
[4] Apoorv, S., Bhowmick, S. K., & Prabha, R. S. (2020, June). Indian sign language interpreter using image processing and
machine learning. In IOP Conference Series: Materials Science and Engineering (Vol. 872, No. 1, p. 012026). IOP
Publishing.
[5] Kaushik, N., Rahul, V., & Kumar, K. S. (2020). A Survey of Approaches for Sign Language Recognition
System. International Journal of Psychosocial Rehabilitation, 24(01).
[6] Kishore, P. V. V., & Kumar, P. R. (2012). A video-based Indian sign language recognition system (INSLR) using wavelet
transform and fuzzy logic. International Journal of Engineering and Technology, 4(5), 537.
[7] Dixit, K., & Jalal, A. S. (2013, February). Automatic Indian sign language recognition system. In 2013 3rd IEEE
International Advance Computing Conference (IACC) (pp. 883-887). IEEE.
[8] Das, A., Yadav, L., Singhal, M., Sachan, R., Goyal, H., Taparia, K., ... & Trivedi, G. (2016, December). Smart glove for
Sign Language communications. In 2016 International Conference on Accessibility to Digital World (ICADW) (pp. 27-31).
IEEE.
[9] Sruthi, R., Rao, B. V., Nagapravallika, P., Harikrishna, G., & Babu, K. N. (2018). Vision-Based Sign Language by Using
MATLAB. International Research Journal of Engineering and Technology (IRJET), 5(3).
[10] Kumar, A., & Kumar, R. (2021). A novel approach for ISL alphabet recognition using Extreme Learning
Machine. International Journal of Information Technology, 13(1), 349-357.
[11] Maraqa, M., & Abu-Zaiter, R. (2008, August). Recognition of Arabic Sign Language (Arce) using recurrent neural
networks. In 2008 First International Conference on the Applications of Digital Information and Web Technologies
(ICADIWT) (pp. 478-481). IEEE.
[12] Masood, S., Srivastava, A., Thuwal, H. C., & Ahmad, M. (2018). Real-time sign language gesture (word) recognition from
video sequences using CNN and RNN. In Intelligent Engineering Informatics (pp. 623-632). Springer, Singapore.
[13] Camgoz, N. C., Hadfield, S., Koller, O., Ney, H., & Bowden, R. (2018). Neural sign language translation. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 7784-7793).
[14] Cheng, K. L., Yang, Z., Chen, Q., & Tai, Y. W. (2020, August). Fully convolutional networks for continuous sign language
recognition. In European Conference on Computer Vision (pp. 697-714). Springer, Cham.
[15] Dabre, K., & Dholay, S. (2014, April). The machine learning model for sign language interpretation using webcam images.
In 2014 International Conference on Circuits, Systems, Communication and Information Technology Applications
(CSCITA) (pp. 317-321). IEEE.
[16] Taskiran, M., Killioglu, M., & Kahraman, N. (2018, July). A real-time system for recognition of American sign language
by using deep learning. In 2018 41st International Conference on Telecommunications and Signal Processing (TSP) (pp. 1-
5). IEEE.
[17] Khan, S. A., Joy, A. D., Asaduzzaman, S. M., & Hossain, M. (2019, April). An efficient sign language translator device
using convolutional neural network and customized ROI segmentation. In 2019 2nd International Conference on
Communication Engineering and Technology (ICCET) (pp. 152-156). IEEE.
[18] Nair, A. V., & Bindu, V. (2013). A review on Indian sign language recognition. International journal of computer
applications, 73(22).
[19] Kumar, D. M., Bavanraj, K., Thavananthan, S., Bastiansz, G. M. A. S., Harshanath, S. M. B., & Alosious, J. (2020,
December). EasyTalk: A Translator for Sri Lankan Sign Language using Machine Learning and Artificial Intelligence.
In 2020 2nd International Conference on Advancements in Computing (ICAC) (Vol. 1, pp. 506-511). IEEE.
[20] Kumar, A., Madaan, M., Kumar, S., Saha, A., & Yadav, S. (2021, August). Indian Sign Language Gesture Recognition in
Real-Time using Convolutional Neural Networks. In 2021 8th International Conference on Signal Processing and
Integrated Networks (SPIN) (pp. 562-568). IEEE.
[21] Manikandan, K., Patidar, A., Walia, P., & Roy, A. B. (2018). Hand gesture detection and conversion to speech and text.
arXiv preprint arXiv:1811.11997.
[22] Misra, S., Singha, J., & Laskar, R. H. (2018). Vision-based hand gesture recognition of alphabets, numbers, arithmetic
operators and ASCII characters to develop a virtual text-entry interface system. Neural Computing and Applications, 29(8),
117-135.
[23] Hoste, L., Dumas, B., & Signer, B. (2012, May). SpeeG: a multimodal speech-and gesture-based text input solution. In
Proceedings of the International working conference on advanced visual interfaces (pp. 156-163).
[24] Buxton, W., Fiume, E., Hill, R., Lee, A., & Woo, C. (1983). Continuous hand-gesture driven input. In Graphics Interface
(Vol. 83, pp. 191-195).
[25] Kunjumon, J., & Megalingam, R. K. (2019, November). Hand gesture recognition system for translating indian sign
language into text and speech. In 2019 International Conference on Smart Systems and Inventive Technology (ICSSIT)
(pp. 14-18). IEEE.
[26] Dardas, N. H., & Georganas, N. D. (2011). Real-time hand gesture detection and recognition using bag-of-features and
support vector machine techniques. IEEE Transactions on Instrumentation and measurement, 60(11), 3592-3607.
[27] Köpüklü, O., Gunduz, A., Kose, N., & Rigoll, G. (2019, May). Real-time hand gesture detection and classification using
convolutional neural networks. In 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition
(FG 2019) (pp. 1-8). IEEE.
[28] Francke, H., Ruiz-del-Solar, J., & Verschae, R. (2007, December). Real-time hand gesture detection and recognition using
boosted classifiers and active learning. In Pacific-Rim Symposium on Image and Video Technology (pp. 533-547).
Springer, Berlin, Heidelberg.
[29] Zhang, Q., Chen, F., & Liu, X. (2008, July). Hand gesture detection and segmentation based on difference background
image with complex background. In 2008 International Conference on Embedded Software and Systems (pp. 338-343).
IEEE.
[30] Mazhar, O., Navarro, B., Ramdani, S., Passama, R., & Cherubini, A. (2019). A real-time humanrobot interaction
framework with robust background invariant hand gesture detection. Robotics and Computer-Integrated Manufacturing, 60,
34-48.
[31] Liu, W., Li, X., Jia, Z., Yan, H., & Ma, X. (2017). A three-dimensional triangular vision-based contouring error detection
system and method for machine tools. Precision Engineering, 50, 85-98.
[32] Cohen, C. J., Beach, G., & Foulk, G. (2001, October). A basic hand gesture control system for PC applications. In
Proceedings 30th Applied Imagery Pattern Recognition Workshop (AIPR 2001). Analysis and Understanding of Time
Varying Imagery (pp. 74-79). IEEE.
[33] Reifinger, S., Wallhoff, F., Ablassmeier, M., Poitschke, T., & Rigoll, G. (2007, July). Static and dynamic hand-gesture
recognition for augmented reality applications. In International Conference on Human-Computer Interaction (pp. 728-737).
Springer, Berlin, Heidelberg.
61
[34] Kurakin, A., Zhang, Z., & Liu, Z. (2012, August). A real time system for dynamic hand gesture recognition with a depth
sensor. In 2012 Proceedings of the 20th European signal processing conference (EUSIPCO) (pp. 1975-1979). IEEE.
[35] Plouffe, G., & Cretu, A. M. (2015). Static and dynamic hand gesture recognition in depth data using dynamic time
warping. IEEE transactions on instrumentation and measurement, 65(2), 305- 316.
[36] Ghotkar, A. S., Khatal, R., Khupase, S., Asati, S., & Hadap, M. (2012, January). Hand gesture recognition for indian sign
language. In 2012 International Conference on Computer Communication and Informatics (pp. 1-4). IEEE.
[37] Dutta, K. K., & GS, A. K. (2015, December). Double handed Indian Sign Language to speech and text. In 2015 Third
International Conference on Image Information Processing (ICIIP) (pp. 374-377). IEEE.
[38] Dixit, K., & Jalal, A. S. (2013, February). Automatic Indian sign language recognition system. In 2013 3rd IEEE
International Advance Computing Conference (IACC) (pp. 883-887). IEEE.
[39] Nair, A. V., & Bindu, V. (2013). A review on Indian sign language recognition. International journal of computer
applications, 73(22).
[40] Rajam, P. S., & Balakrishnan, G. (2011, September). Real time Indian sign language recognition system to aid deaf-dumb
people. In 2011 IEEE 13th international conference on communication technology (pp. 737-742). IEEE.
APPENDICES
(a)SAMPLE CODE
Training and validation
import pandas as pd
import numpy as np
import tensorflow as tf
import os
import cv2
import pydot
def load_dataset(directory):
images = []
labels = []
62
images.append(img)
labels.append(idx)
images = np.asarray(images)
labels = np.asarray(labels)
x, y = x_data, y_data
fig.suptitle(title, fontsize=18)
for i, ax in enumerate(axes.flat):
ax.imshow(cv2.cvtColor(x[i], cv2.COLOR_BGR2RGB))
if display_label:
ax.set_xlabel(uniq_labels[y[i]])
ax.set_xticks([])
ax.set_yticks([])
plt.show()
uniq_labels = sorted(os.listdir(data_dir))
print(X_pre.shape, Y_pre.shape)
#spliting dataset into 80% train, 10% validation and 10% test data
Y_train = to_categorical(Y_train)
Y_test = to_categorical(Y_test)
Y_eval = to_categorical(Y_eval)
X_test = X_test/255
X_eval = X_eval/255
model = tf.keras.Sequential([
tf.keras.layers.MaxPool2D((2,2)),
tf.keras.layers.MaxPool2D((2,2)),
tf.keras.layers.Flatten(),
64
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(1, activation='softmax')
])
model.summary()
#testing
model.evaluate(X_test, Y_test)
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.show()
main.py
import cv2
import numpy as np
import tensorflow as tf
import os
65
model = tf.keras.models.load_model(r'D:\final year main project\1Indian sign
Language\test_train.h5')
model.summary()
labels = sorted(os.listdir(data_dir))
labels[-1] = 'Nothing'
print(labels)
cap = cv2.VideoCapture(0)
while(True):
_ , frame = cap.read()
cv2.imshow('Output', roi)
img = img/255
char_index = np.argmax(prediction)
predicted_char = labels[char_index]
font = cv2.FONT_HERSHEY_TRIPLEX
fontScale = 1
thickness = 2
66
msg = predicted_char +', Conf: '+str(confidence) +' %'
print(predicted_char)
cv2.imshow('Output1', frame)
break
cap.release()
cv2.destroyAllWindows()
(b)OUTPUT SCREENSHOTS
67
(b) Training accuracy
68
(d) Output in windows
69
PLAGIARISM REPORT
70