Implementation_of_Virtual_Assistant_with_Sign_Language_using_Deep_Learning_and_TensorFlow
Implementation_of_Virtual_Assistant_with_Sign_Language_using_Deep_Learning_and_TensorFlow
net/publication/344063213
CITATIONS READS
15 418
4 authors, including:
All content following this page was uploaded by Swati Nadkarni on 19 August 2023.
Swati Nadkarni
Associate Professor,
Department of Information
Technology, Shah and Anchor
Kutchhi Engineering College,
Chembur,
Abstract— The paper is all about the system and interface Ho me Automation. Since these assistants are purely Vo ice
developed, that allows deaf mutes to make use of various voice Automated, Deaf-Mutes find it hard to make use of such
automated virtual assistants with help of Sign Language. technology as observed in [8]. The agenda of the pro ject is
Majority of Virtual Assistants work on basis of audio inputs
to develop an interface that will help the Deaf-mutes to use
and produces audio outputs which in turn makes it impossible
these Virtual Assistants easily with easy. As of now, it
to be used by people with hearing and speaking disabilities.
might seem irrelevant to design such a system but in a
The project makes various voice controlled virtual assistants
respond to hand gestures and also produces results in form of longer run it might help deaf-mutes to equally enjoy their
text outputs. It makes use of concepts like Deep Learning, social and personal life. Designing such an interface will
Convolutional Neural Network, Tensor Flow, Python Audio make them find their freedo m wh ile using such technologies
Modules. A webcam first captures the hand gestures, then and might boost their confidence in this Dig ital Age. This
Convolutional Neural Network interprets the images produced paper focuses on a research that gives an idea of combining
and produces rational languages. These languages are then two modern technologies that are Hand Gesture Recognition
mapped to pre-defined datasets using Deep learning. For this and Virtual Voice Assistants in order to make it possible fo r
purpose, Neural Networks are linked with Tensor flow library. people with hearing/speaking difficulties to interact with
The designed system will then produce audio input for the Dig ital Gadgets and also communicate with the outside
Digital Assistant, using one of the Python text to speech world. This research work has imp lemented Alexa which is
module. The final audio output of the Digital Assistant will be an audio based Virtual Assistant. The proposed system has
converted into text format using one of the Python speech to
been successful in rep lacing Speech Recognition technique
text module which will be displayed on the viewing screen.
with Hand Gesture Recognition technique. The proposed
Keywords— Deep Learning, Virtual Assistants, Ten sor Flow, system makes use of follo wing technologies: TensorFlow
Convolutional Neural Network Hand Gestures, Sign Languages. which is the most important lib rary used for designing and
developing the model of these system, Convolutional Neural
I. INT RODUCT ION Network, is an Deep Learn ing Algorith m that have been
Nowadays, Virtual Assistant devices have been part and used for serving the purpose of Image Recognition, that
parcel of our lives, but most of them are Vo ice Automated . helps in converting the images in form o f matrix that can be
Most common ly used Virtual Assistants are Alexa, Google understood by the model and making it Classifier ready, and
Ho me, Apple Siri and Microsoft Cortana. These assistants lastly OpenCV that will act as an Eye of the system that will
listen to user’s queries and respond accordingly making capture and process Real-t ime Hand Gestures and predict
results with help of Classifier.
there life easier, thus they have been a very important part of
With increasing trends in technology, personal assistant It has been understood by using a block d iagram shown in
devices are becoming mo re and mo re popular. But such Fig.1
devices are voice automated. They need audio inputs and
provide audio outputs. So what if someone does not have
their own voice or are not in a condition to speak properly,
that’s where this project co mes into light. Such people can
easily commun icate with these devices using an interface
that takes hand gestures as an input and provides audio as
well as text output. This project has the capacity to bridge
the gap between such impaired people and booming
technology.
It can be understood by using a Flow chart shown in Fig.2 higher is the success rate. But it is required to keep a note
that any changes made with the labels folder before training
will lead to the system that is being t rained for the very first
time. In simp le terms if it is required to make any changes in
the labels folder that is adding new labels or replacing the
existing labels, the model will need to be trained again fro m
the beginning. It was observed that for training of around 15
labels on an average configured system, it takes about 12-15
hours straight of model training for the first time. Ho wever
retrain ing of same set of labels requires comparat ively lesser
amount of time.
B. Tensor Flow
Fig. 2. Flow of the system
The best part of using TensorFlow library is that it is an
A. Training Dataset open Source Library with lots of pre designed models ,
Dataset is the most fundamental element of any Machine useful in Machine Learning and especially Deep Learning.
Learn ing Model. As it is a process of feeding into For understanding the conceptual use of Tensor Flow is
Machine’s memory to help classify whatever it insights in required to understand the meaning of two terms, where the
future for the designed application. Since our system is an Tensor here is considered as N-Dimensional Array and Flo w
interface for Real-Time Classification of Hand Gestures our refers to graph of operations. Every mathematical
Dataset will purely consis t of large number of Images in computation in TensorFlow is considered as graph of
form of .jpeg, .jpg, these are the only two extensions that out operations where Nodes in the Graph are operations and
model is accepting. The designed model makes use of a Edges are nothing but tensors.
Labelled dataset method for training our system, thus Any mathematical computation is written in form of data
assigning labels to folder names will simply use sub-files of flow diagram in Python Frontend or C++ or Java, as in our
images to be trained under assigned labels. Each label is case Python is used. Then, TensorFlow Execution Engine
being trained with about more than 2000 images captured at comes into picture and makes it deployable on any of the
various possible angles in order to make system learn better hardware of Embedded System let it be CPU or A ndroid or
and classify more accurately and quicker as observed in [2]. IOS. TensorFlo w is a Machine learning framewo rk that
Once the model is co mpletely trained for a set of particular comprises of uses the dataset to train Deep learning models
labelled images it gets Classifier ready and can be u sed for and helps in prediction and also improvise future results.
testing the system’s prediction rate. However it was noticed
that retraining the same set of labels tends to give better The biggest advantage of using TensorFlow is it’s feature o f
results in terms of accuracy and speed of predicting the providing Abstraction, that is the developer does not need to
Hand gestures as observed in [3]. Basically the model will work on every small aspects of designing the model as it is
be trained more nu mber of t imes for the same set of labels, managed by the lib rary itself, thus giving the developer the
freedom to focus on logic building, which was clearly Inception-v3 Convolutional Neural Network has been
explained in [7] . implemented, while designing this system. Inception v3 is a
48 layers deep Neural network. Inception Network is better
TensorFlow in our system helps us in training the model than most of Convolutional Neural Networks because it just
using the provided dataset. TensorFlow object recognition does not dig deeper and deeper in the layers like other
algorith ms helps us classify and identify different hand
Convolutional Neural Networks instead it believes in
gestures when combined with use of Open CV. By analysing
working wider on the same layer before going deeper into
thousands of photos, Tensorflow can help classifying and
the next layer. This is the reason, bottlenecks are used while
identifying real-time hand gestures. It makes possible to
training the model. Bottleneck in Neural Net work is just a
develop a model wh ich can help identify 3D images and
layer having less neurons as compared to the layers above or
classify it on basis of 2D images fro m its feed dataset.
below it. TensorFlow bottleneck is the last step of pre-
TensorFlow is capable of processing more info rmation and
spot more patterns. processing phase that starts before actual training of dataset
starts.
C. Deep Learning
E. OpenCV
Deep Learn ing is basically a subset of Machine Learning
model wh ich consists of algorith ms that make use of mu lti- OpenCV is an open source library for Co mputer Vision.
layer neural networks. Deep Learn ing makes use of Neural Now since all the train ing and classification is ready to be
Network most of the t imes to imp lement its functioning. A executed when it needed an eye for the designed system to
Neural Netwo rk is a collection of layers that transforms the capture real-time images of Hand Gestures which can then
input in some way to produce output. be sent for classification and identification. OpenCV adds
intelligence to Deep Learning models for visualization
Image can be termed as matrix o f pixel values so it may image processing. Here images are considered over 2
seem that classification can be an easier task simp ly based channels as: RGB Channel and Grey Scale Channel so once
on matrix classification but that is not the case with co mplex the image is captured by OpenCV it first converts into Grey
matrix images or images with similar forms of mat rix o r a channel so it can then undergo morphological processing as
very huge dataset of images with min imal changes in the shown in [9]. OpenCV makes use of Nu mpy Library fo r
matrix. Th is may lead to clash in prediction scores and numerical co mputation of Images in form of matrix o f
thereby affecting the accuracy and speed of classifier model.
pixels.
This is where Neural Net work co mes into picture and thus it
is required to use deep learning over mach ine learning. A blue bo x of particular dimension has been designed with
Machine Learning works with lesser number of layers when help of OpenCV in a way that it will consider hand gestures
compared with Deep Learning as observed fro m [12] and present inside this blue box. It then converts the image over
thus not preferred for technologies like Image Recognition different channels and then convert the image into
which requires need of Convolutional Neural Networks. convoluted form of matrix so the Classifier model can
D. Convolutional Neural Network compare it with previously learned labelled images. It will
then predict a suggestion of gesture on basis of the score
A convolutional Neural Network is nothing but a Deep
generated. As OpenCV is converting real-t ime hand gesture
Learn ing algorith m that is capable of assigning biases and
it will be continuously suggesting predictions because of
weights to different objects in an Image and on basis of the
slightest of motion of real-t ime hand gesture. The confirmed
same it can differentiate one image fro m another. It consists
prediction with highest score will enter the sequence until a
of processing different layers of Image Classification and it
CA LL COMAND is executed. Then the entire sequence will
is designed with means of representing functioning of
enter the next stage of designed interface that is it will be
Neurons in Human Brain as explained in [4].
converted into Audio format which will then wake the
Even if the most minimalist pixelated image is considered, it Virtual Voice Assistant and become the Input Query.
still needs 4x4 mat rix and required to consider the same
F. Python Text and Speech APIs
image in different channels of colour fo rmats like RGB,
Greyscale, HSV, etc so it is very difficult to process The Python text-to-speech library that used is very simp le
thousands of images in high rates of pixels for instance and easy to use. It makes use of modules like pyttsx3 and
1020x1980 pixels. Here co mes the need of Convolutional engine.io which let us change different properties like rate
Neural Network that convolutes every image into its basic and intervals of text to speech conversion and outflow.
reduced form of matrix wh ich can be differentiable at the The Python speech-to-text library by which pract icing
same time. These increases the Accuracy and Speed and makes use of speech recognition module. It let us adjust the
also reducing the processing of Classifier model. The amb ient noise and also helps in recording the audio in form
convolutional layer is also supported with Pooling layer to of mp4 files.
decrease the processing need of classifier model. It also
convolutes the matrix but on basis of dominant features .
Pooling is majorly of two types; MAX Pooling and A VG
Pooling, this is clearly explained in [6].
IV. RESULTS
A short demonstration of our project is g iven below with
help of images. Here, it performed a Hand Gesture by
representing a term “Weather” and once CALL
COMMAND is received, a query asking weather will be
sent to Virtual Vo ice Assistant and the real-time output will
be converted into text and displayed on output frame.
Fig. 8. Interface
conditions, the system was able to provide the accurate and favourable for better results, also presence of good amount
best of its results. However sometimes in poor light of light wh ile presenting the hand gestures. So it is required
conditions and in absence of proper background the system to overcome these difficu lties in order to make system
struggled to produce correct and expected results. perform better.