Object Detection With Audio Feedback Using Trigger Word
Object Detection With Audio Feedback Using Trigger Word
Object Detection With Audio Feedback Using Trigger Word
TRIGGER WORD
A PROJECT REPORT
Submitted by
BACHELOR OF TECHNOLOGY
BONAFIDE CERTIFICATE
Certified that this B.Tech project report titled “OBJECT DETECTION WITH
AUDIO FEEDBACK USING TRIGGER WORD” is the bonafide work of
“D.DHEERAJ [Reg No: RA1711003010040], ANKIT KUMAR [Reg No:
RA1711003010086]” who carried out the project work under my supervision.
Certified further, that to the best of my knowledge the work reported herein does
not form part of any other thesis or dissertation on the basis of which a degree or
award was conferred on an earlier occasion for this or any other candidate.
SIGNATURE SIGNATURE
ii
Own Work Declaration
Department of Computer Science and Engineering
SRM Institute of Science & Technology
Own Work* Declaration Form
This sheet must be filled in (each box ticked to show that the condition has been met). It must be signed and dated
along with your student registration number and included with all assignments you submit – work will not be
marked unless this is done.
Title of Work : Object Detection with Audio Feedback using Trigger Word
I / We hereby certify that this assessment compiles with the University’s Rules and Regulations relating to Academic
misconduct and plagiarism**, as listed in the University Website, Regulations, and the Education Committee
guidelines.
I / We confirm that all the work contained in this assessment is my / our own except where indicated, and that I / We
have met the following conditions:
I understand that any false claim for this work will be penalized in accordance with the University policies and
regulations.
DECLARATION:
I am aware of and understand the University’s policy on Academic misconduct and plagiarism and I certify that
this assessment is my / our own work, except where indicated by referring, and that I have followed the good
academic practices noted above.
If you are working in a group, please write your registration numbers and sign with the date for every student in
your group.
iii
Own Work Declaration
Department of Computer Science and Engineering
SRM Institute of Science & Technology
Own Work* Declaration Form
This sheet must be filled in (each box ticked to show that the condition has been met). It must be signed and dated
along with your student registration number and included with all assignments you submit – work will not be
marked unless this is done.
Title of Work : Object Detection with Audio Feedback using Trigger Word
I / We hereby certify that this assessment compiles with the University’s Rules and Regulations relating to Academic
misconduct and plagiarism**, as listed in the University Website, Regulations, and the Education Committee
guidelines.
I / We confirm that all the work contained in this assessment is my / our own except where indicated, and that I / We
have met the following conditions:
I understand that any false claim for this work will be penalized in accordance with the University policies and
regulations.
DECLARATION:
I am aware of and understand the University’s policy on Academic misconduct and plagiarism and I certify that
this assessment is my / our own work, except where indicated by referring, and that I have followed the good
academic practices noted above.
If you are working in a group, please write your registration numbers and sign with the date for every student in
your group.
iv
ACKNOWLEDGEMENT
We express our humble gratitude to C. Muthamizhchelvan, Vice Chancellor (I/C),
SRM Institute of Science and Technology, for the facilities extended for the project
work and his continued support.
We wish to thank Dr. B.Amutha, Professor & Head, Department of Computer Science
and Engineering, SRM Institute of Science and Technology, for her valuable
suggestions and encouragement throughout the period of the project work.
We sincerely thank staff and students of the Computer Science and Engineering
Department, SRM Institute of Science and Technology, for their help during my
research. Finally, we would like to thank my parents, our family members and our
friends for their unconditional love, constant support and encouragement.
D. Dheeraj
Ankit Kumar
v
ABSTRACT
Individuals with helpless vision or visual deficiency can't see components of their
current circumstance similarly as others do. Performing day-by-day assignments
and staying away from obstacles and a few troubles that they should survive. This
application tries to build up an assistive framework for the outwardly handicapped
that will help them manage every one of the hindrances in their way. This
application assists with helping outwardly impeded individuals by including a
structure that can mention to them what item is before them without contacting or
smell it. The application makes use of Natural Language Processing (NLP), Digital
Image Processing and Computer Vision to assist visually impaired individuals
using a voice activation and object detection model with audio feedback, which
allows the individual to get an audio output of any object present in front of a
camera after the system recognizes a trigger word on which it was trained on.
vi
Table of Contents
ACKNOWLEDGEMENT ............................................................................................... v
ABSTRACT .................................................................................................................... vi
ABBREVIATIONS .......................................................................................................xiii
1.1 Purpose................................................................................................................ 1
vii
3.5 Long short-term memory (LSTM).................................................................... 10
viii
8.3.2 Object Detection Module .......................................................................... 39
REFERENCES ............................................................................................................... 47
APPENDIX A ................................................................................................................ 48
A.1.3 detect.ipynb.................................................................................................... 50
ix
LIST OF TABLES
Table 6.1: Results for COCO val 2017 (5k images) ...................................................... 30
x
LIST OF FIGURES
Figure 4.5: Visual Representation of Spectrogram and Probability of final Outputs. ... 15
xi
Figure A.2: Python Notebook to Record Audio in Real Time ....................................... 50
Figure A.3: Importing necessary packages, models, weights and dataset ..................... 51
Figure A.5: Detecting Trigger Word and Calling Object Detection Method upon trigger
........................................................................................................................................ 53
xii
ABBREVIATIONS
CNN Convolutional Neural Network
AI Artificial Intelligence
xiii
LIST OF SYMBOLS
w Weights for a particular layer
xiv
CHAPTER 1: INTRODUCTION
1.1 Purpose
Up until 2015, there were around 940 million individuals all throughout the world
who were experiencing a specific degree of vision misfortune. Out of these 940
million individuals, 240 million had exceptionally low vision and 39 million were
totally visually impaired. As indicated by the World Health Organization's (WHO)
first World Vision Report distributed in2019, in excess of a fifth of the total
populace (2.2 billion individuals) experience the ill effects of a dream that may
have been kept away from or left untreated.
Outwardly incapacitated people face explicit difficulties while doing exercises that
ordinary individuals underestimate, for example, looking for keys, seeing and
finding things in the climate, and strolling along a way both inside and outside.
With the headway of innovation, it is the obligation of Computer Scientists and
researchers to work in the field that improves the manner by which each human
exists. Not exclusively should an item created be financially valuable yet in
addition let an individual conquer his/her incapacity by relying upon innovation if
necessary.
1
1.2 Scope
● LCW Sense: Blind or visually impaired people can readily find all of the
information about a dress, including the colour, fabric, design, washing
instructions, and price. This enables visually impaired buyers to make their
own apparel choices without the help of others.
● Smart Braille: Smart Braille is a braille integrated android application. It
allows a visually impaired person to use braille connected to an application
to enter a text and also let the application translate a text into braille.
2
CHAPTER 2: LITERATURE SURVEY
2.1 Review
In [1] the paper proposes a way to develop an application that can detect a word
based on which it was trained on. This idea is similar to the working of virtual
assistants used these days like Alexa, Google Home Assistant. In order to develop
such an application a model needs to be used that consist of several layers for
fitting and training the model according to the data provided. It consists of layers
like ReLu, GRU, Dropout, Sigmoid. The training data for the model comprises 3
types of audio clip namely Positive, Negative and Background noise. A Positive
audio clip refers to our main trigger word which should activate the application,
whereas a Negative audio clip consists of words apart from the word that was
supposed to trigger the application. And Background noises are also used in order
to let the model work in every possible environment, including a noisy
environment. Once the model is trained and ready to detect for a trigger word, a
pre-recorded audio clip is passed through the model to recognize the trigger word,
once the trigger word is identified it adds a chime sound at the end of the trigger
word of the same audio clip and returns it back.
In [2], the paper aims at building a deep convolutional neural network that
categorizes 1200000 images into 1000 different categories.
3
The neural network developed comprises 60 million connections and 650,000
neurons, and it is made up of five layers of the convolutional network, some of
which are followed by highly cohesive layers, followed by three fully aligned
layers which later passes onto a 1000-way softmax layer. To speed up the
preparation immersive neurons and GPU efficiency were uses in convolution
function. And to avoid overfitting in the network we add a dropout layer, which
reduces the overfitting of the network to a great extent.
In [3], the paper proposes a simple object recognition algorithm that increases the
mAP by more than 30% compared to the previous positive results. The idea is to
make use of high resolution convolutional neural networks that can be used to
extract selective regions as a proposal for availability of object present as per
dataset. This way the number of iterations needed to search for an object reduces.
In [4], the paper stated that by demonstrating the bounding box of YOLOv3, the
maximum of one-stage locators, with a Gaussian boundary and redesigning the
loss factor, the paper proposes a strategy for enhancing discovery accuracy while
promoting an ongoing activity. Furthermore, this paper suggests a method for
anticipating the limitation flaw, demonstrating bounding box’s dependability. The
proposed plans will effectively decrease the false positive and increase the true
positive by using the expected constraint vulnerability during the identification
engagement, thereby improving the precision. The proposed version of YOLOv3
increases the mean average precision and also boost the frames per second with
better precision.
4
examples that are difficult for the neural network to recognize. The proposed idea
suggests that both the original network and the adversary are jointly taught. This
idea resulted in a significant improvement in mean average precision.
In [6], the paper suggests combining the idea of Fast R-CNN and a Quick R-CNN
into a single network to speed up the search result and give output labels with high
accuracy. The network uses a Region Proposal Network (RPN) that incorporates
the image’s features and returns any possible regions that can be searched in for
objects. A RPN is a fully convolutional network that can predict the object’s
bounding area and object’s scores at each position simultaneously.
Taking up inspiration from the present system and applications, we have come up
with a proposed solution which contains the following:
5
Object Detection Neural Network model and will resume the process when
the specific trigger word is detected by the system. Ex: We use “Hey Siri”
or “Ok Google” to trigger voice assistants such as Siri and Google Now. In
the same way even our proposed solution will have such a speech to trigger
the application. An external device such as a microphone will be used for
tracking the trigger word.
6
CHAPTER 3: FUNDAMENTALS
Social networking sites may use face recognition tools to assist users in
tagging and posting photos of friends. Text images are converted into a movable
form using optical character recognition (OCR) technology. Based on the user's
preferences, machine learning-powered recommendation algorithms can suggest a
user which movies or TV show to watch next. Machine Learning has also made it
possible for people involved in automation field to integrate machine learning into
cars and have self-driving car feature.
7
3.2 Computer Vision
Deep learning is yet another method of artificial intelligence that aims to mimic
the functioning of the human brain in the use of data acquisition, speech
recognition, language translation, and decision-making. It is a branch of artificial
intelligence that helps networks to learn from unstructured data, to understand the
meaning behind the data and build a model that has human level intelligence and
can automate and generate results according to the data based on which it was
trained.
Deep learning emerged during the industrial era, which led to the explosion
of data in all formats and in every corner of the globe. This data, commonly known
as big data, comes from various sources like social media, e-commerce sites,
8
search engines. This huge amount of data is readily available and can be shared
with fintech technology like cloud computing.
Since data generated are usually unpredictable, it can take people decades
to understand and gain useful information from it. Companies are increasingly
turning to AI technology for digital assistance as they see the tremendous power
that can be gained by unlocking this data resource.
The power to decode spoken and written human languages (also known as the
natural language) is known as Natural Learning Processing (NLP). It is part of
artificial intelligence (AI).
9
3.5 Long short-term memory (LSTM)
10
CHAPTER 4: REAL-TIME TRIGGER WORD
DETECTION
4.1 Introduction
A trigger word also known as a wake-up word are those words that when identified
activate an application like voice assistant devices like Alexa, Google Home etc.
Whenever these devices are switched on, they go into an alert state when they hear
the trigger word based on which it was trained on. Hearing a trigger word indicates
that the device needs to be ready to attempt to do the task commanded by the
registered user. There are various kinds of trigger words that can be used to train
the model, like “Hey Siri” used by Apple devices for Siri, “Ok Google” for google
home assistant and “Alexa” for Amazon’s voice assistant device Alexa. In our
project we use the word “Activate” to trigger our application.
The most basic type of trigger word detection model would be the one that can
detect a trigger word out of a pre-recorded audio only. Our project aims to build a
trigger word that can work in real time instead. In order to achieve this goal, we
improvise the basic type of trigger word detection by running two python code
simultaneously, one code records the audio of the user for every 1 second and
stores it in the local directory and goes into sleep state for 2.5 seconds. The other
python code fetches the recorded audio and detects for the trigger word. It uses a
pre-trained model trained on trigger word “Activate”. Once the trigger word is
detected, a chime sound is given, and the application works on an object detection
module.
11
4.3 Algorithm
When the recorded module is fetched from the local directory, the audio clip is
passed through the trained model. The model is a recurrent neural network
consisting of various layers in order to achieve expected output from the model.
detection system, each layer of this network processes the input features obtained
in order to process it and give a desired output ŷ<t>, which represents the
12
probability of presence of trigger word at the timestamp t. The layers used for
each timestamp can be understood with the figure and a brief description given
below.
13
The network begins with a Conv1D layer, which compresses the input so
that only the required features are extracted for processing. The output from the
Conv1D layer is fed into the Batch Normalization layer after it has compressed the
input. Batch normalisation is a method of standardising each mini-input batch to a
layer while training very deep neural networks. This stabilizes the learning process
and significantly reduces the number of training epochs required to train deep
networks. The network then uses a ReLU activation function to increase the
model's non-linearity property, and the model is then passed through the Dropout
layer to prevent it from overfitting. It then goes through the GRU layer, which is
similar to a long short-term memory with a forget gate but has less parameters than
an LSTM since it doesn't have an output gate.
14
To understand how the model works by using spectrograms, the figure
below can be used to understand. Once the word “Activate” is recognized, then
the next 50 timestamps of the spectrogram are labelled 1, indicating that the trigger
word was recognized.
The figure below explains how a trigger word detector model recognizes a
trigger word with the help of a spectrogram and a graph is plotted to illustrate and
visualize it better.
15
CHAPTER 5: YOLO: YOU ONLY LOOK ONCE
5.1 Introduction
“You Only Look Once” (a.k.a YOLO) is a cutting-edge object detection model
that is both efficient and precise. It is one of the most powerful algorithms for
object detection out there at the moment. It does so because unlike the previously
used object detection algorithms, like the ones of R-CNN or it’s modified and
upgraded version of Faster R-CNN, this algorithm only needs the image or video
to pass just once through the network. YOLOv4 as used in this paper was first
mentioned in the paper by Alexey Bochkovskiy, Chien-Yao Wang, Hong-Yuan
Mark Liao (2020). In this chapter, we will explain the concept of object detection
used in this algorithm.
Before discussing the YOLO algorithm, let us discuss the concept of Object
Detection and Convolutional Neural Networks and their key observations.
16
5.2 Object Detection
17
autonomous vehicles, self-driving vehicles, and accident-avoidance
systems.
● Object Localization: Object localization helps us to find our object in
the image after it has been classified, essentially addressing the
question "Where is it in the picture?".
In the diagram below (figure 5.1), we see an RGB picture divided into three
colour planes: red, green, and blue. Grayscale, RGB, HSV, CMYK, and so on.
There are several colour spaces in which photographs can occur. Imagine how
18
computationally intensive things are for an image at those dimension marks until
it hits, say, 8K(7680x4320).
The process utilizes an input image and will calculate each sequence of
pixels or pixel value. It can categorise an array of numbers, also known as RGB
values, based on the image's size and resolution. So, essentially, the intention is to
feed the machine this sequence of numbers so that it can generate output numbers
that represent the likelihood of an image belonging to a certain class.
19
on. The architecture also adapts to the high-level features with additional layers,
resulting in a network with a healthy interpretation of the images in the dataset,
similar to how we would like to understand them.
When progressing through the convolutional layers, the output of the first
convolutional layer is used as the feedback of the second convolutional layer.
Typically, the second layer is the secret layer. Per layer output defines the Low-
Level feature positions present in each image. As filters are applied, the output is
in the form of activation, which represents higher-level functions.
20
A Fully Connected layer is a (usually) low-cost way of studying high-level
properties of non-linear combinations expressed by convolutional layer output.
The Fully Connected layer is studying a potentially nonlinear mechanism in that
space. Take, for example, a picture of a cat; the activation maps would display a
high degree of features such as feet, legs, and so on. If it's a bird, it'll show the
appropriate features. The main goal of this layer is to search for higher level
features that closely correlate to a given class and have weights such that when it
is calculated, we get right probabilities for the various classes that are present.
Over a series of periods, the model can distinguish between dominant and certain
low-level features in images and classify them using the Softmax Classification
technique.
YOLO stands for you only look once. It is a real-time object detection algorithm
that is known to be one of the most powerful and useful object detection algorithms
that can extend most of the groundbreaking ideas that come from computer vision.
A critical aspect of autonomous technology is target recognition. It is the field of
computer vision that, relative to a few years ago, is expanding and functioning in
a much better way.
21
In order to explain this concept, it is important to understand the
classification of images first. As we go on, the degree of complexity increases.
There are a number of algorithms that help us to detect objects, and they can
be classified in the following two groups.
22
image, which are seen inside a bounding frame.
Each bounding box has the following characteristics that can be used to
identify them:
We would also normalize each of the 4 b values from 0–1 by describing them as
a W&H ratio. In addition to this, there is one more value pc which needs to be calculated,
it is the probability which tells that an object is present inside the bounding box. We
would also normalize each of the 4 b values from 0–1 by describing them as a
W&H ratio. In addition to this, there is one more value pC which needs to be
calculated, it is the probability which tells that an object is present inside the
bounding box.
, , ,
The 5th value is BC, which stands for box confidence score.
This metric determines how probable the bounding box is to contain an item
of any class and how precise it is in forecasting. 𝐵𝐶 0 if there is no entity in that
box, and we want 𝐵𝐶 1 when estimating the ground truth.
YOLO is trained with a full image and can explicitly optimize the image
detection results. Since the problem of identification is called a problem of
regression, there is no need for a pipeline to be complicated.
23
Figure 5.3: YOLO Network Architecture
The above diagram illustrates the YOLO network architecture that uses the
convolutional layers of the neural network to divide the image in a grid, create
bounding boxes and classify the objects.
Faster R-CNN is the best object detection algorithm, but it allows mistakes
and errors in the estimation of background patches in the image so it cannot search
for wider contexts. YOLO makes very less than half a mistake compared to Faster
R-CNN. YOLO is much less likely to have a malfunction when introduced to a
different domain or when any unwanted feedback is fed to it. YOLO allows you
to complete your training at real-time level while having a higher overall accuracy.
The framework separates the picture into a grid of ‘𝑆’𝑥‘𝑆’. These grid cells would
be accountable for detecting objects.
Every framework cell can foresee distinctive bounding boxes and create
scores called trust scores for the boxes. This course can reflect how sure the model
is that the box has an element, and furthermore how exact the box can anticipate.
On the off chance that there are no things present, the trust score would be zero.
24
Each bounding box contains five predictions which are mentioned above.
The coordination among the five expectations will refer to the center of the case,
which is in contrast with the cell limits. The width and height of the image should
be assumed to be proportional to the whole image. In the end, the scores for
certainty expectations will speak to the relationship between the truth box and the
predictive box. Each cell will also protect the probabilities of the conditional type,
which is called conditional on the grid cell containing the object. Just a single set
of class probabilities for each grid cell needs to be covered.
During the testing time, the odds of the conditional class probabilities are
multiplied by the individual confidence predictions for the box, which in turn gives
class-specific confidence scores relating to each box. The obtained scores are
encoded with both the probability of class and how the expected box matches the
item. Fast YOLO uses a neural network with very few convolution layers and a
few filters that are used in these layers. Apart from the scale of the network, all
testing and training related parameters are identical between YOLO and Fast
YOLO. In this optimization of sum squared error, this model is achieved because
it is very efficient and simple to refine. It provides weights for a localization error
on an even basis with a classification error that might not be suitable for usage.
This can lead to unstable configuration or uncertainty, which can allow the
testing to be very diverse, resulting from bounding box coordinates and reducing
the lack of confidence predictions that have no objects of any sort. Also, the sum
squared error gives equivalent weights for big boxes as well as small boxes,
although this can be handled differently by the system. YOLO predicts separate
bounding boxes for each network cell.
CNN's proposed area network uses a selective search strategy that works
well but does not fix the issue. We urge you to decrease the system's response time
25
so that if the system is queried by the customer, the system will respond as easily
as possible. YOLO natively receives a quicker response and can present a system
that relies on a position-delicate matrix to illustrate maps that run through the
process and are highly sensitive. This can be achieved using Grid Convolutional
Layers (GCL), which governs the unique areas of the matrix layout of the element
maps.
There is a probability of 80 classes for each cell but only one class
probability is predicted per cell. YOLO's prediction is an 𝑆 ∗ 𝑆 ∗ 𝐵 ∗ 5
𝐶 vector, which includes B box predictions for each grid cell and C class
predictions for each grid cell (C stands for number of classes). The working of
YOLO is illustrated in the fig 5.4 given below. The latest release of YOLOv4 is
used in this project. This application uses the COCO dataset which is pre-loaded
with 80 classes of different day to day objects.
26
Figure 5.4: YOLO Algorithm Working
27
CHAPTER 6: WHY CHOOSE YOLOv4?
Individual users can now train and analyze CNN algorithms with the aid of
YOLOv4 by simply using the 𝑁𝑉𝑖𝑑𝑖𝑎 adapter for gaming graphics with 8–16 GB
VRAM, not just the big players. This is not possible for the previous modern
detectors. Let's take a look at the design and developments in YOLOv4.
28
Comparing the previous YOLOv3 with YOLOv4, YOLOv4 is considered
as better version for the following reasons:
Below is a chart showing the results for COCO val 2017, on 𝑅𝑇𝑋 2080𝑇𝑖
with threshold conf=0.001.
29
tkDNN tkDNN darknet darknet
30
Figure 6.1: Graphical representation of YOLOv4’s speed and accuracy
31
CHAPTER 7: DARKNET
7.1 Introduction
Darknet is easy to set up, and it works with either CPU or GPU
computation. The source code for Darknet can be found on 𝐺𝑖𝑡𝐻𝑢𝑏. Darknet
comes with two optional dependencies, for image processing purposes OpenCV
and for GPU processing CUDA. Although, both dependencies are not required,
users can start by downloading the base application.
32
In Darknet the details are displayed while the config file is loaded and the
image is weighed, followed by reporting top 10 classes for the image. In a function
called Nightmare, the architecture can also be used to run neural networks
backwards.
It includes a neural network that determines the most possible next steps in
a Go game. Users should play along with professional games to see what moves
are likely to occur next, or they can let it play itself or attempt to play against it.
7.2 Installation
33
Windows Subsystem Linux (WSL) interface, it only requires to run the following
three lines of code in the terminal:
The code used above, installs the Darknet framework on the Linux machine
or WSL, but additional code is required to use the GPU to process the image.
Following installation, we can either use the pre-trained model provided with the
Darknet platform, or we can create a new model from scratch.
The following code snippet can be used for running sample images present
in the pre-trained model. The Microsoft COCO (Common Objects in Context)
dataset is a dataset for detecting common objects in context. It has many well
segmented and labelled photographs of commonly used objects. It has about 80
class categories which are trained on 1.5 million instances.
34
CHAPTER 8: PROPOSED SYSTEM
The above figure represents the architecture diagram of our project. The
application working can be explained with the following steps:
1. It starts with taking audio input with the help of a microphone installed in
the device running the application.
2. The Logic unit processes the input taken to detect the presence of the trigger
word in the audio recorded. The trigger word detector model consists of
recurrent neural network and uses the idea of LSTM (Long Short Term
Memory) for the required number of timestamps in the fetched audio files.
For each timestamp, the extracted features of audio files are processed
through a bunch of layers in the network, which returns the probability of
35
presence of trigger word in the audio file fetched.
3. Once the trigger word “Activate” is detected by the voice activation module,
it passes on the control to the Image Detection unit, in which it detects any
object present in the particular frame passed by the image processing unit.
The Object Detection unit uses the YOLO algorithm to identify the object
present in the frame captured.
3.1. Each frame of a real-time feed from an input device is analysed by
YOLO. The entire frame is divided into 'S'x'S' grids. Each grid has a
chance to contain one or more elements. These points must be linked
by a box inside each grid. As a result, each grid in the model may
have B bounding boxes and C trained class probabilities.
3.2. The bounding box of the YOLO model has five components: bx, by,
bw, bh, and c, where c is the confidence. The centre of the image's
bounding box is identified by (bx, by). The width and height of the
box are given by (bw, bh). The object's class, also known as
confidence, is represented by the letter C. The bx, by, bw, and bh
coordinates are normalised to fall between [0,1].
3.3. YOLO divides the image into several cells, typically by creating a
19x19 grid, instead of looking for an interesting ROI (Region of
Interest). Each cell must predict the locations of five bounding boxes
(there can be one or more objects in a cell). If this happens, a single
picture will end up with 1805 bounding boxes.
3.4. It's possible that the majority of the bounding boxes in the cell are
empty. These bounding boxes are filtered using the probability
assigned to it. Non-max suppression kicks in, and the processes delete
all of the redundant bounding boxes, leaving only the ones with the
highest likelihood.
36
3.5. To ensure the accuracy of the prediction, a confidence score (usually
greater than 40%) is assumed, against which the bounding box will
estimate the probability of the next class of confidence.
4. The Image detection unit makes use of the dataset based on which it was
trained on in order to detect the presence of any object present in the image.
5. Once the object is detected in the image, a bounding box and label
containing the object name is returned.
6. Using the label passed by the Image detection unit, a voice feedback is
returned with the help of gTTS API.
37
audio signals in NumPy arrays.
8. os - This library is used to interact with the base Operating System.
9. cv2 - It is a huge open source library used for object detection, faces and
even handwriting recognition to serve its purpose in Image Processing and
other Artificial Intelligence applications.
10. subprocess - You can use the subprocess module to create new processes,
connect to their input/output/error pipes, and get their return codes.
11. gTTS - This is a Google Text-to-Speech API that can be used to create a
wav file for audio feedback.
12. AudioSegment - It is a wrapper class of pydub python library used to play,
merge, edit or merge .wav files. Pydub installed for this is v0.24.1.
13. Audio - Audio allows us to play an audio file in the IPython notebook itself.
14. pyaudio(v0.2.11) - PyAudio allows you to quickly play and record audio
on a number of devices using Python.
15. sys - The sys module in Python contains a number of functions and
variables for manipulating various aspects of the Python runtime
environment.
16. matplotlib.pyplot - This library consists of all matplotlib modules for
IPython.
38
8.3 System Overview
Purpose of this module is to allow the system to be activated with voice. Idea
behind this module is the concept of home assistant devices like Google Home,
Alexa etc. The module was built on a pre-trained model based on the trigger word
“Activate”.
But the pre-trained model was designed to work on a static/pre-recorded audio
track.
To make use of it in real time, a small audio clip of 5 seconds is recorded and
passed for detection and this process repeats. The module works is further divided
into two sub modules.
1. The first sub module works on continuous audio recording and writing the
final recorded audio file into the target directory.
2. Second module works on retrieving the recorded audio track and searching
for the trigger word.
Once the word “Activate” is recognized the module calls the image recognition
module which gives us an audio feedback of any object detected.
To work on the object detection aspect, we have several algorithms that can be
useful, such as deep neural networks (ImageNet) and R-CNN, which provide
results that are far more reliable and faster than previous neural networks
developed. R-CNN is divided into three modules: the first produces all potential
regions for categorization, the second extracts feature by running them through a
convolution neural network, and the third module uses SVM to decide the output.
39
According to the paper, the algorithm detects an object with a 66% accuracy
and takes 20 seconds to process. CNN-based regional propositional networks use
the limited search approach, which works well, but it does not address the problem.
We recommend shortening the system's response time so that if a user asks the
system a query, it receives a response as quickly as possible. As a result, to balance
the speed and greater accuracy, YOLO proves to be a better algorithm in contrast,
which uses a regression-based classification method and can be extended to a
single neural system to get the whole picture.
The inputted frame is divided into ‘S' x ‘S' grids by YOLO. Each grid has a
probability of enclosing one or more items. Inside each grid, these points must be
connected by a box. As a result, each grid can have B bounding boxes and C trained
class probabilities in the model. What forecasts are true and which are invalid are
determined by the user's confidence score. Only accurate projections are used to
build boundary boxes. The text to speech module is then given the valid prediction
labels. Here's an example of how YOLO creates grids and bounding boxes:
40
Figure 8.2: 19x19 grids in YOLO
41
We present an idea for an object recognition system based on position-sensitive
matrix highlight maps that are quick to process and extremely responsive. In this,
we put together position-sensitive convolutional layers known as Grid
Convolutional Layers (GCL), which actuate the item's particular areas in the
element maps in matrix structure. The ROI (Region of Interest) matrix pooling
layer, when properly prepared, guides the final arrangement of convolutional
layers to learn specific network feature maps.
As seen in Table 8.1, the Grid Based approach has a much quicker response
time than the Selective Search approach. Since the selective search method relied
heavily on a hierarchical grouping-based segmentation algorithm.
42
CHAPTER 9: CHALLENGES FACED AND SOLUTIONS
9.2 Solutions
1. The solution to making a trigger word detection model work in a real time
scenario can be achieved by allowing the detection module to run every 1
second for a fresh audio clip recorded on another thread simultaneously.
2. The fastest compatible object detection algorithm was found to be
YOLOv4 which also has high frame per seconds and accuracy when
compared to traditional object detection model and its ancestor models.
3. To solve this issue, we make use of gTTS API to convert all labels
detected into an audio file.
43
CHAPTER 10: LIMITATIONS
1. The implemented trigger word module works only on local systems and
not on cloud platforms like Google Colab.
2. The application requires high end processing power in order to run it
smoothly without any drop in frames.
3. The real time object detection window returned is based on python
libraries, therefore in order to get a better output with less latency the real
time object detection window can be returned using JavaScript.
4. The system activates only based on the trigger word “Activate” it was
trained on.
5. The object detected must be of relatively bigger size in order to be able to
be detected by the model.
44
CHAPTER 11: RESULTS AND DISCUSSIONS
Accordingly, as per the objects detected by model, the name of the object
is verbally conveyed to the user by the speech module i.e., “The detected objects
are mid center laptop, bottom right cell phone, mid right mouse”.
45
CHAPTER 12: CONCLUSION AND FUTURE WORK
This application aims to work like a smart cane for visually impaired
individuals. Although the results obtained are satisfactory, it still needs to be
enhanced according to the end goal, in order to achieve that the application
should be able to run faster on a low processing power so that the user does not
have to wait for audio feedback for too long.
Since the application’s NLP module was inspired from home assistant
devices, other features from home assistant devices like News Report,
Entertainment can be added onto this application. Along with the above-
mentioned features, if navigation systems were also added into the application,
life of a visually impaired individual will become a lot easier.
46
REFERENCES
[1] K. Supriya, "Trigger Word Recognition using LSTM.," International Journal of
Engineering Research and Technology, vol. 9, no. 6, p. 8, 2020.
[2] A. Krizhevsky, "ImageNet classification with deep convolutional neural
networks," Communications of the ACM, vol. 60, no. 6, pp. 84-90, 2017.
[3] R. Girshick, J. Donahue, T. Darrell and J. Malik, "Rich Feature Hierarchies for
Accurate Object Detection and Semantic Segmentation," 2014 IEEE Conference
on Computer Vision and Pattern Recognition, 2014.
[4] J. Choi, D. Chun , H. Kim and H.-J. Lee , "Gaussian YOLOv3: An Accurate and
Fast Object Detector Using Localization Uncertainty for Autonomous Driving,"
2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
[5] X. Wang, A. Shrivastava and A. Gupta, "A-Fast-RCNN: Hard Positive Generation
via Adversary for Object Detection," 2017 IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 2017.
[6] S. Ren, K. He, R. Girshick and J. Sun, "Faster R-CNN: Towards Real-Time
Object Detection with Region Proposal Networks," IEEE Transactions on Pattern
Analysis and Machine Intelligence, 2017.
[7] B. Balasuriya, N. Lokuhettiarachchi, A. R. M. D. N. Ranasinghe, K. D. C.
Shiwantha and C. Jayawardena, "Learning platform for visually impaired children
through artificial intelligence and computer vision," 2017 11th International
Conference on Software, Knowledge, Information Management and Applications
(SKIMA), 2017.
[8] T. Giannakopoulos, N.-A. Tatlas, T. Ganchev and I. Potamitis, "A practical, real-
time speech-driven home automation front-end," IEEE Transactions on Consumer
Electronics, 2005.
[9] N. Jmour, S. Zayen and A. Abdelkrim, "Convolutional neural networks for image
classification," 2018 International Conference on Advanced Systems and Electric
Technologies (IC_ASET), 2018.
[10] K. Kumatani, S. Panchapagesan, M. Wu, M. Kim, N. Strom, G. Tiwari and A.
Mandai, "Direct modeling of raw audio with DNNS for wake word detection,"
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU),
2017.
[11] M. J. Shaifee, B. Chywl, F. Li and A. Wong , "Fast YOLO: A Fast You Only
Look Once System for Real-time Embedded Object Detection in Video," Journal
of Computational Vision and Imaging Systems, 2017.
47
APPENDIX A
CODE AND SAMPLE OUTPUT
A.1 Code
A.1.1 utils.py
The methods used in this python file are used as a support for the main
application for detection and preprocessing purposes. These methods are used by
importing this python file in the detect.ipynb file.
48
Figure A.1: Methods used in util.py
49
A.1.2 audio_rec.ipynb
This python notebook runs a sequence of code in order to record audio input in
real time and save it in the required directory for further processing in the
detect.ipynb file.
A.1.3 detect.ipynb
This jupyter notebook does the main job of the application, it detects the trigger
word by importing the audio file saved in the required directory using
audio_rec.ipynb. And once the trigger word is detected, it gives a chime sound to
indicate that a trigger word was found, followed by calling the object detection
function, which passes the label to gTTS API to give an audio feedback after
detecting the object present.
50
Figure A.3: Importing necessary packages, models, weights and dataset
51
Figure A.4: Object detection and output processing methods
52
Figure A.5: Detecting Trigger Word and Calling Object Detection Method upon trigger
53
Figure A.7: Trigger Word Detected on Spectrogram
54
Figure A.9: Label and Confidence Output of detected Object
55
PLAGIARISM REPORT
56
57
58
59
60
61
62
PUBLICATION PROOF
63
64
65
66
High Technology Letters ISSN NO : 1006-6748
Abstract– People with poor vision or blindness are unable to perceive elements of their
environment in the same way as other people do. Performing daily tasks and avoiding hurdles are
only a couple of the difficulties they must overcome. This application seeks to develop an
assistive system for the visually disabled that will help them deal with all the obstacles in their
path. This application helps to assist visually impaired people by including a framework that can
tell them what object is in front of them without having to touch or smell it. Instead, they can
simply call out a trigger word, which triggers an object recognition module, which provides an
audible output of any object associated with their location.
I. INTRODUCTION
Up until 2015, there were around 940 million people around the world who were suffering from
a certain level of vision loss. Out of these 940 million people, 240 million had very low vision
and 39 million were completely blind.According to the World Health Organization's (WHO) first
World Vision Report published in2019, more than a fifth of the world's population (2.2 billion
people) suffers from visionthat may have been avoided or left untreated.
In addition, according to a report released in The Lancet Global Health journal in 2017, the
prevalence rate of vision disorder is expected to rise by 2020. It also estimates that if not handled
correctly, the number of cases will rise to almost 115 million cases of blindness and 588 million
cases of mild to extreme vision disability by 2050.
Visually disabled individuals face specific challenges when doing activities that normal people
take for granted, such as searching for keys, perceiving and locating items in the environment,
and walking along a path both indoors and outdoors.With the advancement of technology, it is
the duty of computer scientists to work in the field that improves the way in which every human
lives. Not only should a product developed be economically beneficial but also let a person
overcome his/her disability by depending on technology if needed [11].
Deep Neural Network [2] has 60 million parameters involved, and also solves overfitting issues.
The limitation of this paper is that the proposed neural network cannot be implemented for real
time object detection. This paper [3] specifies how R-CNN is one of the real time object
detection models available that detects an object present in an image by computing on selective
regions in an image. But the drawbacks of this paper is that the time taken to detect an object is
20s and mAP is only 66%.CNN (Convolutional Neural Network) [4] focuses on the training
convolutional neural network for classification system and transfer learning technique. The
drawbacks of this paper is that the proposed method can only be useful for implementing a pre-
trained model.
YOLOv3 [5] works by identifying the center of an object from a dataset followed by expansion
of the bounding box. The limitation is that with the increase in computation speed (22 ms), the
accuracy of the model has fallen (63.4%)
The paper justifies how Fast-RCNN [6] is similar to normal R-CNN, except for some initial
CNN and feature map extractions followed by RCNN. The limitation is that although it is faster
(2 seconds) and more accurate (70%) compared to RCNN, it still cannot be considered as a fast
enough model.Hence by using the references from the papers mentioned above, we strive to
develop a better implementation.
The object in the image is detected after going through the following steps, which are acquisition
of image frame by frame with the help of webcam followed by object detection by passing the
image through YOLOv4 algorithm in which feature extraction of image followed by labelling of
the object present in image takes place. YOLO algorithm as the name suggests “You Only Look
Once”, scans through the image only once and returns the object detected. It detects the center of
every object present in the dataset and continues to expand the bounding box around it and
returns the probability of the object present.
Apart from the object detection part of the proposed system which represents the Digital Image
Processing module, the proposed system also has an NLP module that works on the voice
activation part of application.The voice activation module of the application letsa user start an
object detection module upon saying a trigger word. Similar to home assistantslike Google
Home, Alexa etc [9],, the system triggers the application for detecting an object present upon
recognizingthe wake-up word [10]
[10]“Activate”
“Activate” it was trained on. This task is carried out with the
help of a microphone to take continuous audio input.
IV. IMPLEMENTATION
In order to begin with a voice activation system, a microphone takes an input as mentioned in the
proposed system.
This method has various layers of neural networks to identify the word that triggers the system.
It detects trigger word from a pre
pre-recorded
recorded audio clip, by passing the audio clip into a neural
network with several layers like GRU, ReLU, Dropout and Sigmoid and adds a chime sound just
after the word is detected
ed and produces the output. The network first starts with a Conv1D layer
that is used to compress the input in order to only extract the required features of our input for
processing. After compressing the input, the output from the Conv1D layer is fed into int the
BatchNormalization layer. Batch normalization is a process for training very deep neural
Since there may be lags of uncertain length between significant events in a time series, LSTM
networks are well-suited
suited to classifying, analyzing,, and making predictions based on time series
data. LSTMs were created to solve the issue of vanishing gradients that can occur while training
conventional RNNs. After passing through the GRU layer, the network proceeds with repeated
combinations of different layers mentioned till now followed by Dense and Sigmoid layers. The
sigmoid layer is another activation func
function that gives us an output in range 0-1
1 which represents
the probability of presence of trigger word in the input.
But since the model is suited only for pre-recorded audio clips, we further divide this module
into two sub modules. One of them keeps recording the audio input with the help of a
microphone and writes the recorded audio file into the cloud environment. The second sub
module retrieves the audio recorded from the cloud and detects for the trigger word. The audio
recorded is of 2 seconds and the above-mentioned submodules keep running simultaneously in a
loop.
And in order to work on the object detection part, we have various algorithms that can be
helpful, like deep neural networks (ImageNet) [2] and R-CNN [3], that provides results in a
much accurate manner and faster than previous neural networks built. R-CNN works in 3
modules, the first module generates all possible regions available for categorization followed by
feature extraction in the second step by passing it through convolution neural network and the
final module uses SVM and determines final output. According to the paper, the accuracy of the
algorithm to detect an object is 66% and time taken to process is 20s.CNN-based regional
propositional networks use the selective search method that works well, but with this the issue is
not solved, we advise the system's response time to be shortened so that if a user sends the
system a question, it can get a response as soon as possible. Hence, to match up with the speed
and better accuracy YOLO [5] proves to be a better algorithm in comparison which uses a
regression-based classification approach and this algorithm can be applied to a single neural
system to the fullest picture.
YOLO is also known as you only look once. It is a real-time object detection algorithm that is
known to be one of the most powerful and useful object detection algorithms that can extend
most of the groundbreaking ideas that come from computer vision.Some deep learning
algorithms such as the Faster-RCNN [7] algorithms are better but it still does not satisfy the
required processing time. Whereas YOLO algorithm gives a faster result when compared to
Faster-RCNN.
predicting the positions of 5 bounding boxes (there can be one or more objects in a cell).
If this occurs, it will result in 1805 bounding boxes for a single image.
4) The majority of the bounding boxes present in the cell could be empty. The probability of
object type is used to filter these bounding boxes. Non
Non-max
max suppression comes into play
and its processes will remove the unnecessary bounding boxes, leaving only the highest
probability bounding boxes.
5) To assure the predictions, a confidence score is assumed (mostly above 40%) against
which the bounding box will predict the probability of the following class of confidence.
There is a probability of 80 classes for each cell but only one class probability is predicted
predi per
cell. YOLO's prediction is an S∗∗S∗(B∗5+C)
5+C) vector, which includes B box predictions for each
grid cell and C class predictions for each grid cell (C stands for number of classes).The working
of YOLO is illustrated in the fig 3 given below. The latelatest
st release of YOLOv4 is used in this
project. This application uses the COCO dataset which is prepre-loaded
loaded with 80 classes of different
day to day objects.
V. RESULTS
This framework is designed as a proof of concept for an object detection module that has been
trained on a specific dataset. The model displays the confidence scores while recording. Here we
have taken an instance where the user is using the model and the following observations are
recorded.
Accordingly, as per the objects detected by model, the name of the object is verbally conveyed to
the user by the speech module i.e., “The detected objects are Laptop, Mouse, Cell Phone”
VI. CONCLUSION
This project proposes a system to assist visually challenged people by giving them an audio
feedback of any object detected with the help of an efficient algorithm that is triggered only
when the system detects a trigger word in order to activate the object detection model. This idea
is achieved by exploring a list of algorithms that can prove to be effective for object detection
modules and get an insight on how to implement voice activation modules. Finally, we observe
that YOLO proves to be the most effective algorithm for object detection with higher accuracy
and faster response compared to other algorithms.
References