Object Detection With Audio Feedback Using Trigger Word

OBJECT DETECTION WITH AUDIO FEEDBACK USING
TRIGGER WORD
A PROJECT REPORT
Submitted by
D.DHEERAJ [Reg No: RA1711003010040]

ANKIT KUMAR [Reg No: RA1711003010086]
Under the guidance of

Ms. M. VAIDHEHI
(Assistant Professor, Department of Computer Science & Engineering)
In partial fulfillment of the requirements for the degree of
BACHELOR OF TECHNOLOGY
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
FACULTY OF ENGINEERING AND TECHNOLOGY

SRM INSTITUTE OF SCIENCE AND TECHNOLOGY
KATTANKULATHUR- 603 203
MAY 2021
SRM INSTITUTE OF SCIENCE AND TECHNOLOGY
KATTANKULATHUR – 603 203
BONAFIDE CERTIFICATE
Certified that this B.Tech project report titled “OBJECT DETECTION WITH
AUDIO FEEDBACK USING TRIGGER WORD” is the bonafide work of
“D.DHEERAJ [Reg No: RA1711003010040], ANKIT KUMAR [Reg No:
RA1711003010086]” who carried out the project work under my supervision.
Certified further, that to the best of my knowledge the work reported herein does
not form part of any other thesis or dissertation on the basis of which a degree or
award was conferred on an earlier occasion for this or any other candidate.
SIGNATURE SIGNATURE
Ms. M. Vaidhehi Dr. B. AMUTHA

GUIDE HEAD OF THE DEPARTMENT
Associate Professor Department of Computer Science
Department of Computer Science and and Engineering
Engineering
Signature of the Internal Examiner Signature of the External Examiner
ii
Own Work Declaration
Department of Computer Science and Engineering
SRM Institute of Science & Technology

Own Work* Declaration Form
This sheet must be filled in (each box ticked to show that the condition has been met). It must be signed and dated
along with your student registration number and included with all assignments you submit – work will not be
marked unless this is done.
To be completed by the student for all assessments
Degree/ Course : B.Tech. / Computer Science and Engineering
Student Name : D.Dheeraj
Registration Number : RA1711003010040
Title of Work : Object Detection with Audio Feedback using Trigger Word
I / We hereby certify that this assessment compiles with the University’s Rules and Regulations relating to Academic
misconduct and plagiarism**, as listed in the University Website, Regulations, and the Education Committee
guidelines.
I / We confirm that all the work contained in this assessment is my / our own except where indicated, and that I / We
have met the following conditions:
● Clearly references / listed all sources as appropriate

● Referenced and put in inverted commas all quoted text (from books, web, etc)
● Given the sources of all pictures, data etc. that are not my own
● Not made any use of the report(s) or essay(s) of any other student(s) either past or present
● Acknowledged in appropriate places any help that I have received from others (e.g. fellow students,
technicians, statisticians, external sources)
● Compiled with any other plagiarism criteria specified in the Course handbook / University website
I understand that any false claim for this work will be penalized in accordance with the University policies and
regulations.
DECLARATION:
I am aware of and understand the University’s policy on Academic misconduct and plagiarism and I certify that
this assessment is my / our own work, except where indicated by referring, and that I have followed the good
academic practices noted above.
If you are working in a group, please write your registration numbers and sign with the date for every student in
your group.
iii
Own Work Declaration
SRM Institute of Science & Technology

Own Work* Declaration Form
This sheet must be filled in (each box ticked to show that the condition has been met). It must be signed and dated
along with your student registration number and included with all assignments you submit – work will not be
marked unless this is done.
To be completed by the student for all assessments
Degree/ Course : B.Tech. / Computer Science and Engineering
Student Name : Ankit Kumar
Registration Number : RA1711003010086
Title of Work : Object Detection with Audio Feedback using Trigger Word
I / We hereby certify that this assessment compiles with the University’s Rules and Regulations relating to Academic
misconduct and plagiarism**, as listed in the University Website, Regulations, and the Education Committee
guidelines.
I / We confirm that all the work contained in this assessment is my / our own except where indicated, and that I / We
have met the following conditions:
● Clearly references / listed all sources as appropriate

● Referenced and put in inverted commas all quoted text (from books, web, etc)
● Given the sources of all pictures, data etc. that are not my own
● Not made any use of the report(s) or essay(s) of any other student(s) either past or present
● Acknowledged in appropriate places any help that I have received from others (e.g. fellow students,
technicians, statisticians, external sources)
● Compiled with any other plagiarism criteria specified in the Course handbook / University website
I understand that any false claim for this work will be penalized in accordance with the University policies and
regulations.
DECLARATION:
I am aware of and understand the University’s policy on Academic misconduct and plagiarism and I certify that
this assessment is my / our own work, except where indicated by referring, and that I have followed the good
academic practices noted above.
If you are working in a group, please write your registration numbers and sign with the date for every student in
your group.
iv
ACKNOWLEDGEMENT
We express our humble gratitude to C. Muthamizhchelvan, Vice Chancellor (I/C),
SRM Institute of Science and Technology, for the facilities extended for the project
work and his continued support.
We extend our sincere thanks to Dr. Revathi Venkataraman, Professor &

Chairperson, School of Computing, SRM Institute of Science and Technology, for his
invaluable support.
We wish to thank Dr. B.Amutha, Professor & Head, Department of Computer Science
and Engineering, SRM Institute of Science and Technology, for her valuable
suggestions and encouragement throughout the period of the project work.
We are extremely grateful to our Academic Advisor Dr. P.Vishalakshi, Assistant

Professor, Department of Computer Science and Engineering, SRM Institute of Science
and Technology, for their great support at all the stages of project work. We would like
to convey our thanks to our Panel Head, Dr. S.Babu, Associate Professor, Department
of Computer Science and Engineering, SRM Institute of Science and Technology, for
his inputs during the project reviews.
We register our immeasurable thanks to our Faculty Advisor, Dr. C.Vijayakumaran,

Associate Professor, Department of Computer Science and Engineering, SRM Institute
of Science and Technology, for leading and helping us to complete our course.
Our inexpressible respect and thanks to my guide, Ms. M.Vaidhehi, Assistant

Professor, Department of Computer Science and Engineering, SRM Institute of Science
and Technology, for providing me an opportunity to pursue my project under his/her
mentorship. She provided me the freedom and support to explore the research topics of
my interest. Her passion for solving the real problems and making a difference in the
world has always been inspiring.
We sincerely thank staff and students of the Computer Science and Engineering
Department, SRM Institute of Science and Technology, for their help during my
research. Finally, we would like to thank my parents, our family members and our
friends for their unconditional love, constant support and encouragement.
D. Dheeraj
Ankit Kumar
v
ABSTRACT
Individuals with helpless vision or visual deficiency can't see components of their
current circumstance similarly as others do. Performing day-by-day assignments
and staying away from obstacles and a few troubles that they should survive. This
application tries to build up an assistive framework for the outwardly handicapped
that will help them manage every one of the hindrances in their way. This
application assists with helping outwardly impeded individuals by including a
structure that can mention to them what item is before them without contacting or
smell it. The application makes use of Natural Language Processing (NLP), Digital
Image Processing and Computer Vision to assist visually impaired individuals
using a voice activation and object detection model with audio feedback, which
allows the individual to get an audio output of any object present in front of a
camera after the system recognizes a trigger word on which it was trained on.
vi
Table of Contents
ACKNOWLEDGEMENT ............................................................................................... v
ABSTRACT .................................................................................................................... vi
LIST OF TABLES ........................................................................................................... x
LIST OF FIGURES ......................................................................................................... xi
ABBREVIATIONS .......................................................................................................xiii
LIST OF SYMBOLS..................................................................................................... xiv
CHAPTER 1: INTRODUCTION .................................................................................... 1
1.1 Purpose................................................................................................................ 1
1.2 Scope ................................................................................................................... 2
1.3 Present Systems .................................................................................................. 2
CHAPTER 2: LITERATURE SURVEY ......................................................................... 3
2.1 Review ................................................................................................................ 3
2.2 Inference from Survey ........................................................................................ 5
CHAPTER 3: FUNDAMENTALS .................................................................................. 7
3.1 Machine Learning ............................................................................................... 7
3.2 Computer Vision ................................................................................................. 8
3.3 Deep Learning..................................................................................................... 8
3.4 Natural language processing ............................................................................... 9
vii
3.5 Long short-term memory (LSTM).................................................................... 10
CHAPTER 4: REAL-TIME TRIGGER WORD DETECTION .................................... 11
4.1 Introduction ....................................................................................................... 11
4.2 Module Outline ................................................................................................. 11
4.3 Algorithm .......................................................................................................... 12
CHAPTER 5: YOLO: YOU ONLY LOOK ONCE ...................................................... 16
5.1 Introduction ....................................................................................................... 16
5.2 Object Detection ............................................................................................... 17
5.3 Convolutional Neural Networks ....................................................................... 18
5.4 YOLO Algorithm.............................................................................................. 21
CHAPTER 6: WHY CHOOSE YOLOv4? .................................................................... 28
CHAPTER 7: DARKNET ............................................................................................. 32
7.1 Introduction ....................................................................................................... 32
7.2 Installation ........................................................................................................ 33
CHAPTER 8: PROPOSED SYSTEM ........................................................................... 35
8.1 Architecture Diagram ....................................................................................... 35
8.2 Importing Libraries ........................................................................................... 37
8.3 System Overview .............................................................................................. 39
8.3.1 Real-Time Trigger Word Detection .......................................................... 39
viii
8.3.2 Object Detection Module .......................................................................... 39
CHAPTER 9: CHALLENGES FACED AND SOLUTIONS ....................................... 43
9.1 Challenges Faced .............................................................................................. 43
9.2 Solutions ........................................................................................................... 43
CHAPTER 10: LIMITATIONS ..................................................................................... 44
CHAPTER 11: RESULTS AND DISCUSSIONS ......................................................... 45
CHAPTER 12: CONCLUSION AND FUTURE WORK ............................................. 46
REFERENCES ............................................................................................................... 47
APPENDIX A ................................................................................................................ 48
A.1 Code ..................................................................................................................... 48
A.1.1 utils.py ........................................................................................................... 48
A.1.2 audio_rec.ipynb ............................................................................................. 50
A.1.3 detect.ipynb.................................................................................................... 50
A.2 Sample Output...................................................................................................... 53
PLAGIARISM REPORT ............................................................................................... 56
PUBLICATION PROOF ............................................................................................... 63
ix
LIST OF TABLES
Table 6.1: Results for COCO val 2017 (5k images) ...................................................... 30
Table 8.1: Selective Search Vs Grid Based execution time ........................................... 42
Table 11.1: Recorded Output ......................................................................................... 45
Table A.1: Sample Output Data ..................................................................................... 53
x
LIST OF FIGURES
Figure 4.1: Trigger Word detector Model Network ....................................................... 12
Figure 4.2: Layers used in Trigger word detection model ............................................. 13
Figure 4.3: RNN, LSTM and GRU Architectures ......................................................... 14
Figure 4.4: Spectrogram ................................................................................................. 15
Figure 4.5: Visual Representation of Spectrogram and Probability of final Outputs. ... 15
Figure 5.1: 4x4x3 RGB Image ....................................................................................... 19
Figure 5.2: CNN Architecture with Pooling .................................................................. 20
Figure 5.3: YOLO Network Architecture ..................................................................... 24
Figure 5.4: YOLO Algorithm Working ......................................................................... 27
Figure 6.1: Graphical representation of YOLOv4’s speed and accuracy ...................... 31
Figure 7.1: Darknet Logo ............................................................................................... 32
Figure 7.2: Darknet Architecture ................................................................................... 33
Figure 7.3: Darknet Linux Installation ........................................................................... 34
Figure 7.4: Running a Pre-Trained Model ..................................................................... 34
Figure 8.1: Architecture Diagram .................................................................................. 35
Figure 8.2: 19x19 grids in YOLO .................................................................................. 41
Figure 8.3: Predicted bounding boxes labels ................................................................. 41
Figure 8.4: YOLO Object Detection with accuracy....................................................... 42
Figure 11.1: Objects Detected ........................................................................................ 45
Figure A.1: Methods used in util.py ............................................................................... 48
xi
Figure A.2: Python Notebook to Record Audio in Real Time ....................................... 50
Figure A.3: Importing necessary packages, models, weights and dataset ..................... 51
Figure A.4: Object detection and output processing methods ....................................... 51
Figure A.4: Object detection and output processing methods ....................................... 52
Figure A.5: Detecting Trigger Word and Calling Object Detection Method upon trigger
........................................................................................................................................ 53
Figure A.6: Audio Feedback returned from gTTS API ................................................. 53
Figure A.7: Trigger Word Detected on Spectrogram ..................................................... 54
Figure A.8: Sample Objects Detected ............................................................................ 54
Figure A.9: Label and Confidence Output of detected Object....................................... 55
xii
ABBREVIATIONS
CNN Convolutional Neural Network
R-CNN Region Convolutional Neural Network
YOLO You Only Look Once
gTTS Google Text-to-Speech
ONNX Open Neural Network Exchange
OpenCL Open Computing Language
SciPy Scientific Python
NumPy Numeric Python
AI Artificial Intelligence
SVM Support Vector Machine
ROI Region of Interest
RPN Region Proposal Network
MS-COCO Microsoft Common Object in Context
GCL Grid Convolutional Layers
NLP Natural Language Processing
GRU Gated Recurrent Unit
DPM Deformable Part Models
tkDNN NVIDIA’s Deep Neural Network Library
xiii
LIST OF SYMBOLS
w Weights for a particular layer
x Trained set of features
S Number of Grid Cells
B Number of Bounding Boxes in a single Grid Cell
C Number of Predicted Class probabilities in a single Grid Cell
xiv
CHAPTER 1: INTRODUCTION
1.1 Purpose
Up until 2015, there were around 940 million individuals all throughout the world
who were experiencing a specific degree of vision misfortune. Out of these 940
million individuals, 240 million had exceptionally low vision and 39 million were
totally visually impaired. As indicated by the World Health Organization's (WHO)
first World Vision Report distributed in2019, in excess of a fifth of the total
populace (2.2 billion individuals) experience the ill effects of a dream that may
have been kept away from or left untreated.
Moreover, as indicated by a report delivered in The Lancet Global Health diary in

2017, the pervasiveness of vision problem is relied upon to ascend by 2020. It
likewise appraises that if not dealt with effectively, the quantity of cases will
ascend to right around 115 million instances of visual impairment and 588 million
instances of gentle to outrageous vision handicap by 2050.
Outwardly incapacitated people face explicit difficulties while doing exercises that
ordinary individuals underestimate, for example, looking for keys, seeing and
finding things in the climate, and strolling along a way both inside and outside.
With the headway of innovation, it is the obligation of Computer Scientists and
researchers to work in the field that improves the manner by which each human
exists. Not exclusively should an item created be financially valuable yet in
addition let an individual conquer his/her incapacity by relying upon innovation if
necessary.
1
1.2 Scope
Our application aims to assist the helping outwardly impeded individuals by

including a framework that can mention to them what item is present before them
without contacting or smelling it. The application makes use of Natural Language
Processing (NLP), Digital Image Processing and Computer Vision to assist
visually impaired individuals using a voice activation and object detection model
with audio feedback, which allows the individual to get an audio output of any
object present in front of a camera after the system recognizes a trigger word on
which it was trained on.
1.3 Present Systems
● LCW Sense: Blind or visually impaired people can readily find all of the
information about a dress, including the colour, fabric, design, washing
instructions, and price. This enables visually impaired buyers to make their
own apparel choices without the help of others.
● Smart Braille: Smart Braille is a braille integrated android application. It
allows a visually impaired person to use braille connected to an application
to enter a text and also let the application translate a text into braille.
2
CHAPTER 2: LITERATURE SURVEY
2.1 Review
Various researchers have published their papers regarding Artificial Intelligence

and other technologies that we have taken inspiration from and implemented it on
our project.
In [1] the paper proposes a way to develop an application that can detect a word
based on which it was trained on. This idea is similar to the working of virtual
assistants used these days like Alexa, Google Home Assistant. In order to develop
such an application a model needs to be used that consist of several layers for
fitting and training the model according to the data provided. It consists of layers
like ReLu, GRU, Dropout, Sigmoid. The training data for the model comprises 3
types of audio clip namely Positive, Negative and Background noise. A Positive
audio clip refers to our main trigger word which should activate the application,
whereas a Negative audio clip consists of words apart from the word that was
supposed to trigger the application. And Background noises are also used in order
to let the model work in every possible environment, including a noisy
environment. Once the model is trained and ready to detect for a trigger word, a
pre-recorded audio clip is passed through the model to recognize the trigger word,
once the trigger word is identified it adds a chime sound at the end of the trigger
word of the same audio clip and returns it back.
In [2], the paper aims at building a deep convolutional neural network that
categorizes 1200000 images into 1000 different categories.
3
The neural network developed comprises 60 million connections and 650,000
neurons, and it is made up of five layers of the convolutional network, some of
which are followed by highly cohesive layers, followed by three fully aligned
layers which later passes onto a 1000-way softmax layer. To speed up the
preparation immersive neurons and GPU efficiency were uses in convolution
function. And to avoid overfitting in the network we add a dropout layer, which
reduces the overfitting of the network to a great extent.
In [3], the paper proposes a simple object recognition algorithm that increases the
mAP by more than 30% compared to the previous positive results. The idea is to
make use of high resolution convolutional neural networks that can be used to
extract selective regions as a proposal for availability of object present as per
dataset. This way the number of iterations needed to search for an object reduces.
In [4], the paper stated that by demonstrating the bounding box of YOLOv3, the
maximum of one-stage locators, with a Gaussian boundary and redesigning the
loss factor, the paper proposes a strategy for enhancing discovery accuracy while
promoting an ongoing activity. Furthermore, this paper suggests a method for
anticipating the limitation flaw, demonstrating bounding box’s dependability. The
proposed plans will effectively decrease the false positive and increase the true
positive by using the expected constraint vulnerability during the identification
engagement, thereby improving the precision. The proposed version of YOLOv3
increases the mean average precision and also boost the frames per second with
better precision.
In [5], presented a paper that proposes an alternative approach to traditional object

detection solutions. The paper proposes training an adversarial network to generate
instances of occlusions and deformations. The adversary’s aim is to generate
4
examples that are difficult for the neural network to recognize. The proposed idea
suggests that both the original network and the adversary are jointly taught. This
idea resulted in a significant improvement in mean average precision.
In [6], the paper suggests combining the idea of Fast R-CNN and a Quick R-CNN
into a single network to speed up the search result and give output labels with high
accuracy. The network uses a Region Proposal Network (RPN) that incorporates
the image’s features and returns any possible regions that can be searched in for
objects. A RPN is a fully convolutional network that can predict the object’s
bounding area and object’s scores at each position simultaneously.
2.2 Inference from Survey
Taking up inspiration from the present system and applications, we have come up
with a proposed solution which contains the following:
● In our proposed system, we will use an Object Detection Neural Network

algorithm for performing object detection tasks by determining the location
of the image as well as classifying the object. Compared to the conventional
methods of R-CNN or DPM by using multiple evaluations, We’ll consider
an Object Detection Neural Network that can do it in a single evaluation.
● We aim to develop an application that works in 4 phases which are
acquisition of the image, pre-processing on image, segmentation, feature
extraction. It then tags the words into their respective object and generates
a text description, this description will be sent to a Text-to-Speech API. Via
this we will be getting audio feedback.
● In order to let the visually impaired people access this application, we will
use a voice activated control using LSTM. This speech will trigger the
5
Object Detection Neural Network model and will resume the process when
the specific trigger word is detected by the system. Ex: We use “Hey Siri”
or “Ok Google” to trigger voice assistants such as Siri and Google Now. In
the same way even our proposed solution will have such a speech to trigger
the application. An external device such as a microphone will be used for
tracking the trigger word.
6
CHAPTER 3: FUNDAMENTALS
3.1 Machine Learning
Machine learning is an artificial intelligence (AI) sub-domain. Machine learning

aims to analyze the pattern in data and present it in such a manner that people can
understand and make use of it. Even though machine learning is a part of the
computer science field, its approaches are different when compared to traditional
computing approaches. In conventional programming, algorithms are a list of hard
coded instructions that computers use to generate solutions to a problem.
On the other hand, machine learning algorithms allow computers to train on

data inputs before using statistical analysis to generate values that fall within a
given range.
Consequently, machine learning allows programs to generate models from

sample data to automate data-driven decision-making processes. Today's
technology consumers have learned from machine learning.
Social networking sites may use face recognition tools to assist users in
tagging and posting photos of friends. Text images are converted into a movable
form using optical character recognition (OCR) technology. Based on the user's
preferences, machine learning-powered recommendation algorithms can suggest a
user which movies or TV show to watch next. Machine Learning has also made it
possible for people involved in automation field to integrate machine learning into
cars and have self-driving car feature.
7
3.2 Computer Vision
Computer vision is a technique that allows computers to process, interpret, and

analyze images and video. While algorithms similar to computer vision have been
around since the 60s, there has been a significant change in how well the program
will explore this type of data due to recent research in machine learning and
advances in storage technology, much better processing capacities, and cheaper
input devices. Computer vision is a generic term that refers to any computation
that involves visual material, such as pictures, videos, icons, or pixels in general.
However, within computer vision, there are a few basic activities that serve as the
foundation. Relevant objects are used to train the model in object classification.
The trained model assigns an object to one or more of the trained classes. Since
the model was educated on an object's dataset, it should be able to identify a single
instance of the object in object recognition.
3.3 Deep Learning
Deep learning is yet another method of artificial intelligence that aims to mimic
the functioning of the human brain in the use of data acquisition, speech
recognition, language translation, and decision-making. It is a branch of artificial
intelligence that helps networks to learn from unstructured data, to understand the
meaning behind the data and build a model that has human level intelligence and
can automate and generate results according to the data based on which it was
trained.
Deep learning emerged during the industrial era, which led to the explosion
of data in all formats and in every corner of the globe. This data, commonly known
as big data, comes from various sources like social media, e-commerce sites,
8
search engines. This huge amount of data is readily available and can be shared
with fintech technology like cloud computing.
Since data generated are usually unpredictable, it can take people decades
to understand and gain useful information from it. Companies are increasingly
turning to AI technology for digital assistance as they see the tremendous power
that can be gained by unlocking this data resource.
3.4 Natural language processing
The power to decode spoken and written human languages (also known as the
natural language) is known as Natural Learning Processing (NLP). It is part of
artificial intelligence (AI).
The NLP uses various areas of available technologies like computer

linguistic, some NLP related mathematical models and a required neural network,
to incorporate all of them together to allow a computer to understand human
language in the form of speech or text data. This will also allow the computer to
understand the actual meaning and sentiment when a user conveys his message.
NLP allows a system to translate a text from one language to another.

Identify any word in a speech data or understand the sentiment behind a person’s
speech or text. With the help of NLP applications such as voice assistant, text
translation can be developed and incorporate it with other domains of computer
science to get advance applications.
9
3.5 Long short-term memory (LSTM)
In applications such as text translation or sentiment analysis, to understand the

actual meaning behind the sentence, the computer or even the user require to
understand the complete context of the text. And to make it possible, the system
needs to retain the information it understood before taking in further tokens as
input, and to do this task LSTM proves to be useful. It is a RNN which takes the
output generated by the network as a feedback into the same network so that the
correlation of this feedback with further inputs can be understood.
Although its purpose is to get correlation among the input data itself, the
network works at its best only when the gap between feedback and latter input are
short. Thus, making LSTM as the ideal option for working with time series data.
10
CHAPTER 4: REAL-TIME TRIGGER WORD
DETECTION
4.1 Introduction
A trigger word also known as a wake-up word are those words that when identified
activate an application like voice assistant devices like Alexa, Google Home etc.
Whenever these devices are switched on, they go into an alert state when they hear
the trigger word based on which it was trained on. Hearing a trigger word indicates
that the device needs to be ready to attempt to do the task commanded by the
registered user. There are various kinds of trigger words that can be used to train
the model, like “Hey Siri” used by Apple devices for Siri, “Ok Google” for google
home assistant and “Alexa” for Amazon’s voice assistant device Alexa. In our
project we use the word “Activate” to trigger our application.
4.2 Module Outline
The most basic type of trigger word detection model would be the one that can
detect a trigger word out of a pre-recorded audio only. Our project aims to build a
trigger word that can work in real time instead. In order to achieve this goal, we
improvise the basic type of trigger word detection by running two python code
simultaneously, one code records the audio of the user for every 1 second and
stores it in the local directory and goes into sleep state for 2.5 seconds. The other
python code fetches the recorded audio and detects for the trigger word. It uses a
pre-trained model trained on trigger word “Activate”. Once the trigger word is
detected, a chime sound is given, and the application works on an object detection
module.
11
4.3 Algorithm
When the recorded module is fetched from the local directory, the audio clip is
passed through the trained model. The model is a recurrent neural network
consisting of various layers in order to achieve expected output from the model.
Figure 4.1: Trigger Word detector Model Network
The above model network represents the working of a trigger word
detection system, each layer of this network processes the input features obtained
in order to process it and give a desired output ŷ<t>, which represents the
12
probability of presence of trigger word at the timestamp t. The layers used for
each timestamp can be understood with the figure and a brief description given
below.
Figure 4.2: Layers used in Trigger word detection model
The neural network of a trigger word detector, shown above, is made up of

several layers that process the input feed obtained from the threads every 0.5
seconds.
13
The network begins with a Conv1D layer, which compresses the input so
that only the required features are extracted for processing. The output from the
Conv1D layer is fed into the Batch Normalization layer after it has compressed the
input. Batch normalisation is a method of standardising each mini-input batch to a
layer while training very deep neural networks. This stabilizes the learning process
and significantly reduces the number of training epochs required to train deep
networks. The network then uses a ReLU activation function to increase the
model's non-linearity property, and the model is then passed through the Dropout
layer to prevent it from overfitting. It then goes through the GRU layer, which is
similar to a long short-term memory with a forget gate but has less parameters than
an LSTM since it doesn't have an output gate.
Figure 4.3: RNN, LSTM and GRU Architectures
Since there may be lags of uncertain length between important events in a

time series, LSTM networks are ideal for processing, classifying, and predicting
based on time series data. LSTMs were created to solve the problem of vanishing
gradients that can occur while training conventional RNNs. After moving through
the GRU layer, the network moves on to the Dense and Sigmoid layers, which are
repeated combinations of the different layers described so far. Another activation
feature is the sigmoid layer, which produces an output in the range 0-1 that reflects
the likelihood of finding a trigger word in the input.
14
To understand how the model works by using spectrograms, the figure
below can be used to understand. Once the word “Activate” is recognized, then
the next 50 timestamps of the spectrogram are labelled 1, indicating that the trigger
word was recognized.
Figure 4.4: Spectrogram
The figure below explains how a trigger word detector model recognizes a
trigger word with the help of a spectrogram and a graph is plotted to illustrate and
visualize it better.
Figure 4.5: Visual Representation of Spectrogram and Probability of final Outputs.
15
CHAPTER 5: YOLO: YOU ONLY LOOK ONCE
5.1 Introduction
“You Only Look Once” (a.k.a YOLO) is a cutting-edge object detection model
that is both efficient and precise. It is one of the most powerful algorithms for
object detection out there at the moment. It does so because unlike the previously
used object detection algorithms, like the ones of R-CNN or it’s modified and
upgraded version of Faster R-CNN, this algorithm only needs the image or video
to pass just once through the network. YOLOv4 as used in this paper was first
mentioned in the paper by Alexey Bochkovskiy, Chien-Yao Wang, Hong-Yuan
Mark Liao (2020). In this chapter, we will explain the concept of object detection
used in this algorithm.
One of the most thrilling and game-changing applications of convolutional

neural networks is its image classification. Aside from basic image recognition,
there are many intriguing issues in computer vision, one of which is object
identification. The self-driving car is the most popular use of this technology. It
employs LIDAR, obstacle recognition, and a few other innovations to create a
multidimensional image of the road and its various components such as traffic
lights, intersections, poles, and so on. It is also used in surveillance to track crowds
and deter terror activities, as well as to keep population data.
Before discussing the YOLO algorithm, let us discuss the concept of Object
Detection and Convolutional Neural Networks and their key observations.
16
5.2 Object Detection
Object detection is a machine learning algorithm that recognises and locates

objects in images and videos. Using this method of recognition and localization,
object tracking can be used to count objects in a scene, determine and chart their
exact locations, and accurately identify them. The primary goal of Object
Detection is to classify and distinguish between all of the objects in a given scene.
Our architecture would provide real-time object recognition to differentiate and
test each object in front of the camera. To complete this task, we need a dataset
containing the data of all the objects that can be found in the user's environment.
Our model should be correctly trained to the user's environment. Doing

Object Detection necessitates two main steps: splitting the image into several sub-
parts and assigning the sub-parts to an image classifier, which predicts whether or
not the image contains an object. If the response is yes, it is known as a sensed
object.
Deep learning has grown in popularity as a means of implementing Object

Detection. Deep learning algorithms, such as Convolutional Neural Networks
(CNNs), are used to automatically learn the inherent characteristics of an object in
order to classify that object.
To be well-versed with Object detection we need to understand how Image

Processing works
● Image Classification: Image recognition is the process by which an

image is assigned a label. Image recognition is used to perform a wide
range of machine-based visual functions, including tagging image
information with meta-tags, searching image content, and directing
17
autonomous vehicles, self-driving vehicles, and accident-avoidance
systems.
● Object Localization: Object localization helps us to find our object in
the image after it has been classified, essentially addressing the
question "Where is it in the picture?".
● Object Detection: It gives you the resources you need to do just as it

suggests. It entails locating all of the items in the image and drawing
a box around them to indicate their location. These are known as the
bounding boxes. In certain cases, we will want to find the exact
boundary of an object; this method is known as instance
segmentation, although it was not included in this project.
5.3 Convolutional Neural Networks
A Convolutional Neural Network (CNN) is a Deep Learning neural network

algorithm that takes an input, processes the image, and applies values and
predictions of higher accuracy values to the image. When compared to other
classification algorithms, CNN has much less preprocessing. About the fact that
the approaches are hand-engineered in rudimentary ways, CNNs can study these
filters/characteristics with sufficient training. The CNN architecture is similar to
that of the Neurons of the Human Brain connectivity pattern and is inspired by the
organisation of the Visual Cortex. Individual neurons respond to stimulus only in
a specific region of the visual field known as the Receptive Field.
In the diagram below (figure 5.1), we see an RGB picture divided into three
colour planes: red, green, and blue. Grayscale, RGB, HSV, CMYK, and so on.
There are several colour spaces in which photographs can occur. Imagine how
18
computationally intensive things are for an image at those dimension marks until
it hits, say, 8K(7680x4320).
Figure 5.1: 4x4x3 RGB Image
CNN's capacity is to pack images into an format that is simpler to measure

while holding features that are fundamental for making exact expectations. This is
basic when planning an engineering that isn't just acceptable at learning qualities
yet in addition versatile to huge datasets.
The process utilizes an input image and will calculate each sequence of
pixels or pixel value. It can categorise an array of numbers, also known as RGB
values, based on the image's size and resolution. So, essentially, the intention is to
feed the machine this sequence of numbers so that it can generate output numbers
that represent the likelihood of an image belonging to a certain class.
CNN is an extremely well-defined structure. The first layer of the

convolutional is the input layer, which is responsible for detecting features and
recording Low-Level features such as edges, colour, gradient orientation, and so
19
on. The architecture also adapts to the high-level features with additional layers,
resulting in a network with a healthy interpretation of the images in the dataset,
similar to how we would like to understand them.
When progressing through the convolutional layers, the output of the first
convolutional layer is used as the feedback of the second convolutional layer.
Typically, the second layer is the secret layer. Per layer output defines the Low-
Level feature positions present in each image. As filters are applied, the output is
in the form of activation, which represents higher-level functions.
The pooling layer, similar to the convolutional layer, is accountable for

lessening the spatial size of the Convoluted Function. This is done to lessen the
measure of computational force used to decipher the information by diminishing
its dimensionality. It is additionally helpful for removing rotational and spatial
invariant predominant aspects while keeping up the cycle of adequately training
the model. The most celebrated pooling activity is Max Pooling, which goes about
as a Noise Suppressant.
Figure 5.2: CNN Architecture with Pooling
20
A Fully Connected layer is a (usually) low-cost way of studying high-level
properties of non-linear combinations expressed by convolutional layer output.
The Fully Connected layer is studying a potentially nonlinear mechanism in that
space. Take, for example, a picture of a cat; the activation maps would display a
high degree of features such as feet, legs, and so on. If it's a bird, it'll show the
appropriate features. The main goal of this layer is to search for higher level
features that closely correlate to a given class and have weights such that when it
is calculated, we get right probabilities for the various classes that are present.
Over a series of periods, the model can distinguish between dominant and certain
low-level features in images and classify them using the Softmax Classification
technique.
Now, CNN-based regional propositional networks use the selective search

method that works well, but with this the issue is not solved, we advise the system's
response time to be shortened so that if a user sends the system a question, it can
get a response as soon as possible. Hence, to match up with the speed and better
accuracy we will be implementing the YOLO algorithm in our proposed system
which uses a regression-based classification approach, and this algorithm can be
applied to a single neural system to the fullest picture.
5.4 YOLO Algorithm
YOLO stands for you only look once. It is a real-time object detection algorithm
that is known to be one of the most powerful and useful object detection algorithms
that can extend most of the groundbreaking ideas that come from computer vision.
A critical aspect of autonomous technology is target recognition. It is the field of
computer vision that, relative to a few years ago, is expanding and functioning in
a much better way.
21
In order to explain this concept, it is important to understand the
classification of images first. As we go on, the degree of complexity increases.
There are a number of algorithms that help us to detect objects, and they can
be classified in the following two groups.
1. Classification-based algorithms: Classification-based algorithms works

based on region of interest that has higher chance of consisting an image
from the dataset. The algorithm later starts the detection network on these
selected regions of interest and return the object detected in that region. The
only con to this method is that it involves multiple execution of detection
network over all the selected regions, making it a slow and computationally
expensive method.
2. Regression-based algorithms: When there is no important ROI in the image,

this algorithm can be used to predict various objects and its corresponding
bounding boxes present in the image. The main advantage of this algorithm
is that it executes it only once. Therefore, this algorithm outperforms
classification algorithms in terms of speed. The YOLO algorithm is a well-
known regression-based algorithm. The YOLO detector is used in
applications that require real-time object detection due to its high speed.
YOLO is a regression-based algorithm. In a single run of the algorithm, we

can easily estimate classes and bounding boxes for the entire image (just one look
at the pixels of the image), so that the predictions are told by the image's global
background.
To further understand this algorithm, we need to understand how the YOLO

algorithm plants out the prediction of the objects present in the image. At the end
of the day, we want the algorithm to predict the groups of objects present in the
22
image, which are seen inside a bounding frame.
Each bounding box has the following characteristics that can be used to
identify them:
● centre of the bounding box (bx, by)

● height of the bounding box (bh)
● width of the bounding box (bw)
● class value of an object (C)
We would also normalize each of the 4 b values from 0–1 by describing them as
a W&H ratio. In addition to this, there is one more value pc which needs to be calculated,
it is the probability which tells that an object is present inside the bounding box. We
would also normalize each of the 4 b values from 0–1 by describing them as a
W&H ratio. In addition to this, there is one more value pC which needs to be
calculated, it is the probability which tells that an object is present inside the
bounding box.
, , ,
The 5th value is BC, which stands for box confidence score.
𝐵𝐶 𝑃𝑟 𝐸𝑥𝑖𝑠𝑡𝑖𝑛𝑔 𝑂𝑏𝑗𝑒𝑐𝑡 𝑖𝑛 𝐵𝑜𝑥 ∗ 𝐼𝑂𝑈 𝐼𝑛𝑡𝑒𝑟𝑠𝑒𝑐𝑡𝑖𝑜𝑛 𝑂𝑣𝑒𝑟 𝑈𝑛𝑖𝑜𝑛
This metric determines how probable the bounding box is to contain an item
of any class and how precise it is in forecasting. 𝐵𝐶 0 if there is no entity in that
box, and we want 𝐵𝐶 1 when estimating the ground truth.
YOLO is trained with a full image and can explicitly optimize the image
detection results. Since the problem of identification is called a problem of
regression, there is no need for a pipeline to be complicated.
23
Figure 5.3: YOLO Network Architecture
The above diagram illustrates the YOLO network architecture that uses the
convolutional layers of the neural network to divide the image in a grid, create
bounding boxes and classify the objects.
Faster R-CNN is the best object detection algorithm, but it allows mistakes
and errors in the estimation of background patches in the image so it cannot search
for wider contexts. YOLO makes very less than half a mistake compared to Faster
R-CNN. YOLO is much less likely to have a malfunction when introduced to a
different domain or when any unwanted feedback is fed to it. YOLO allows you
to complete your training at real-time level while having a higher overall accuracy.
The framework separates the picture into a grid of ‘𝑆’𝑥‘𝑆’. These grid cells would
be accountable for detecting objects.
Every framework cell can foresee distinctive bounding boxes and create
scores called trust scores for the boxes. This course can reflect how sure the model
is that the box has an element, and furthermore how exact the box can anticipate.
On the off chance that there are no things present, the trust score would be zero.
24
Each bounding box contains five predictions which are mentioned above.
The coordination among the five expectations will refer to the center of the case,
which is in contrast with the cell limits. The width and height of the image should
be assumed to be proportional to the whole image. In the end, the scores for
certainty expectations will speak to the relationship between the truth box and the
predictive box. Each cell will also protect the probabilities of the conditional type,
which is called conditional on the grid cell containing the object. Just a single set
of class probabilities for each grid cell needs to be covered.
During the testing time, the odds of the conditional class probabilities are
multiplied by the individual confidence predictions for the box, which in turn gives
class-specific confidence scores relating to each box. The obtained scores are
encoded with both the probability of class and how the expected box matches the
item. Fast YOLO uses a neural network with very few convolution layers and a
few filters that are used in these layers. Apart from the scale of the network, all
testing and training related parameters are identical between YOLO and Fast
YOLO. In this optimization of sum squared error, this model is achieved because
it is very efficient and simple to refine. It provides weights for a localization error
on an even basis with a classification error that might not be suitable for usage.
This can lead to unstable configuration or uncertainty, which can allow the
testing to be very diverse, resulting from bounding box coordinates and reducing
the lack of confidence predictions that have no objects of any sort. Also, the sum
squared error gives equivalent weights for big boxes as well as small boxes,
although this can be handled differently by the system. YOLO predicts separate
bounding boxes for each network cell.
CNN's proposed area network uses a selective search strategy that works
well but does not fix the issue. We urge you to decrease the system's response time
25
so that if the system is queried by the customer, the system will respond as easily
as possible. YOLO natively receives a quicker response and can present a system
that relies on a position-delicate matrix to illustrate maps that run through the
process and are highly sensitive. This can be achieved using Grid Convolutional
Layers (GCL), which governs the unique areas of the matrix layout of the element
maps.
There is a probability of 80 classes for each cell but only one class
probability is predicted per cell. YOLO's prediction is an 𝑆 ∗ 𝑆 ∗ 𝐵 ∗ 5
𝐶 vector, which includes B box predictions for each grid cell and C class
predictions for each grid cell (C stands for number of classes). The working of
YOLO is illustrated in the fig 5.4 given below. The latest release of YOLOv4 is
used in this project. This application uses the COCO dataset which is pre-loaded
with 80 classes of different day to day objects.
26
Figure 5.4: YOLO Algorithm Working
27
CHAPTER 6: WHY CHOOSE YOLOv4?
YOLOv4 is an enhanced variant of YOLOv3. You Look Only Once neural

network technology is abbreviated as YOLO. It is the well-known real-time Object
Recognition technology, which can recognize multiple objects in a single frame.
The market YOLOv4 is designed to distinguish objects with greater precision and
speed than any other device on the market, including Google TensorFlow,
Facebook Detectron, MaskRCNN, EfficientDet, and Yolov3 on Microsoft COCO
dataset. It is capable of recognizing over 9000 classes as well as undetectable
classes.
According to the Microsoft COCO test, YOLOv4 achieves 43.5% Average

Precision (AP) and 65.7 percent AP50 accuracy and has the highest speed of 62
FPS on 𝑇𝑖𝑡𝑎𝑛𝑉 or 34 FPS on 𝑅𝑇𝑋 2070.
Individual users can now train and analyze CNN algorithms with the aid of
YOLOv4 by simply using the 𝑁𝑉𝑖𝑑𝑖𝑎 adapter for gaming graphics with 8–16 GB
VRAM, not just the big players. This is not possible for the previous modern
detectors. Let's take a look at the design and developments in YOLOv4.
The architecture used are comprised of the following:
1. CSPDarkNet53: It is considered as the backbone for the higher

capability of CNN.
2. YOLOv3 Head: Inspired as the sole architecture where these
additional characteristics are added.
3. Spatial Pyramid Pooling: An additional feature used to increase the
reception field and segregate the context features.
4. PANet path-aggregation: Used for collaborating detector classes.
28
Comparing the previous YOLOv3 with YOLOv4, YOLOv4 is considered
as better version for the following reasons:
1. YOLOv4 proves to be allow anyone with 1080𝑇𝑖 or 2080𝑇𝑖 GPU to

train a powerful, efficient, and faster object detection model.
2. The impact of cutting-edge “Bag-of-Freebies” and “Bag-of-Specials”

object detection approaches on detector training has been shown.
3. Forefront techniques, for example, Cross-iteration batch

normalization, Path aggregation network, and others have been
improved and are presently appropriate for single GPU training.
With comparable efficiency, YOLOv4 is twice as fast as EfficientNet.

Furthermore, as opposed to YOLOv3, YOLOv4 has improved significantly in the
area of AP and Frames Per Second by 10% and 12% respectively.
Below is a chart showing the results for COCO val 2017, on 𝑅𝑇𝑋 2080𝑇𝑖
with threshold conf=0.001.
29
tkDNN tkDNN darknet darknet
mAP(scaled at AP50 mAP(scaled AP50

0.5:0.95) at 0.5:0.95)
yolov3 0.371 0.621 0.378 0.690
yolov4 0.467 0.700 0.467 0.709
yolov3_tiny 0.091 0.200 0.091 0.198
yolov4_tiny 0.201 0.389 0.200 0.389
Cnet-dla34 0.363 0.539 - -

(512x512)
Table 6.1: Results for COCO val 2017 (5k images)
YOLOv4's excellent speed and accuracy, as well as the well-written paper,

contribute significantly to engineering and academia. The update also
demonstrates an encouraging promotion and advancement of open-source
software: while YOLO's founder has abandoned model updates, others will
continue to support and promote the growth of the powerful tools on which we are
becoming increasingly reliant.
30
Figure 6.1: Graphical representation of YOLOv4’s speed and accuracy
31
CHAPTER 7: DARKNET
7.1 Introduction
YOLO is a real-time object detection based on Darknet, which is an Open-Source

Framework written and implemented in “C” and “CUDA”. In this project, we
enhanced YOLO by including many useful functions for detecting artefacts for the
research and development communities.
Darknet is easy to set up, and it works with either CPU or GPU
computation. The source code for Darknet can be found on 𝐺𝑖𝑡𝐻𝑢𝑏. Darknet
comes with two optional dependencies, for image processing purposes OpenCV
and for GPU processing CUDA. Although, both dependencies are not required,
users can start by downloading the base application.
Figure 7.1: Darknet Logo
On a Titan X, it processes images at 40 to 90 frames per second at a mAP

of around 78% on the VOC 2007 and 44% on the COCO test-dev. Darknet is being
used by users to identify images for the ImageNet 1000-class challenge.
32
In Darknet the details are displayed while the config file is loaded and the
image is weighed, followed by reporting top 10 classes for the image. In a function
called Nightmare, the architecture can also be used to run neural networks
backwards.
Figure 7.2: Darknet Architecture
The conceptual model is similar to GoogLeNet, except the inception

modules have been replaced by 1x1 and 3x3 conv layers. Two completely linked
layers over the entire convolutional feature map generate the final prediction of
form 𝑆 𝑆 5𝐵 𝐾 .
Without CUDA or OpenCV, Darknet supports RNN, which can be

considered as the most ideal model for describing data that changes over time.
Developers can also use the platform to experiment with game-playing neural
networks.
It includes a neural network that determines the most possible next steps in
a Go game. Users should play along with professional games to see what moves
are likely to occur next, or they can let it play itself or attempt to play against it.
7.2 Installation
The installation of darknet is pretty straightforward. On a linux machine or on a
33
Windows Subsystem Linux (WSL) interface, it only requires to run the following
three lines of code in the terminal:
Figure 7.3: Darknet Linux Installation
The code used above, installs the Darknet framework on the Linux machine
or WSL, but additional code is required to use the GPU to process the image.
Following installation, we can either use the pre-trained model provided with the
Darknet platform, or we can create a new model from scratch.
Figure 7.4: Running a Pre-Trained Model
The following code snippet can be used for running sample images present
in the pre-trained model. The Microsoft COCO (Common Objects in Context)
dataset is a dataset for detecting common objects in context. It has many well
segmented and labelled photographs of commonly used objects. It has about 80
class categories which are trained on 1.5 million instances.
34
CHAPTER 8: PROPOSED SYSTEM
8.1 Architecture Diagram
Figure 8.1: Architecture Diagram
The above figure represents the architecture diagram of our project. The
application working can be explained with the following steps:
1. It starts with taking audio input with the help of a microphone installed in
the device running the application.
2. The Logic unit processes the input taken to detect the presence of the trigger
word in the audio recorded. The trigger word detector model consists of
recurrent neural network and uses the idea of LSTM (Long Short Term
Memory) for the required number of timestamps in the fetched audio files.
For each timestamp, the extracted features of audio files are processed
through a bunch of layers in the network, which returns the probability of
35
presence of trigger word in the audio file fetched.
3. Once the trigger word “Activate” is detected by the voice activation module,
it passes on the control to the Image Detection unit, in which it detects any
object present in the particular frame passed by the image processing unit.
The Object Detection unit uses the YOLO algorithm to identify the object
present in the frame captured.
3.1. Each frame of a real-time feed from an input device is analysed by
YOLO. The entire frame is divided into 'S'x'S' grids. Each grid has a
chance to contain one or more elements. These points must be linked
by a box inside each grid. As a result, each grid in the model may
have B bounding boxes and C trained class probabilities.
3.2. The bounding box of the YOLO model has five components: bx, by,
bw, bh, and c, where c is the confidence. The centre of the image's
bounding box is identified by (bx, by). The width and height of the
box are given by (bw, bh). The object's class, also known as
confidence, is represented by the letter C. The bx, by, bw, and bh
coordinates are normalised to fall between [0,1].
3.3. YOLO divides the image into several cells, typically by creating a
19x19 grid, instead of looking for an interesting ROI (Region of
Interest). Each cell must predict the locations of five bounding boxes
(there can be one or more objects in a cell). If this happens, a single
picture will end up with 1805 bounding boxes.
3.4. It's possible that the majority of the bounding boxes in the cell are
empty. These bounding boxes are filtered using the probability
assigned to it. Non-max suppression kicks in, and the processes delete
all of the redundant bounding boxes, leaving only the ones with the
highest likelihood.
36
3.5. To ensure the accuracy of the prediction, a confidence score (usually
greater than 40%) is assumed, against which the bounding box will
estimate the probability of the next class of confidence.
4. The Image detection unit makes use of the dataset based on which it was
trained on in order to detect the presence of any object present in the image.
5. Once the object is detected in the image, a bounding box and label
containing the object name is returned.
6. Using the label passed by the Image detection unit, a voice feedback is
returned with the help of gTTS API.
8.2 Importing Libraries
The list of python libraries used in this project are as follows:

1. numpy - NumPy library can be used for advanced mathematical
calculations on arrays and matrices.
2. time - This library is used to access various time related methods.
3. load_model - load_model is a method part of the keras.model library used
to import pre-trained models. Keras package installed should be v2.1.6. And
tensorflow installed should be v1.15.0.
4. wavfile - This library is used to process a wavfile. The version of scipy used
is v1.2.1.
5. utils - utils is a custom python library that includes necessary python
methods for trigger word detection modules.
6. soundfile(v0.10.3) - Powerful audio based library used for working with
audio files.
7. sounddevice(v0.4.0) - This Python module includes bindings for the
PortAudio library as well as a few handy functions for playing and recording
37
audio signals in NumPy arrays.
8. os - This library is used to interact with the base Operating System.
9. cv2 - It is a huge open source library used for object detection, faces and
even handwriting recognition to serve its purpose in Image Processing and
other Artificial Intelligence applications.
10. subprocess - You can use the subprocess module to create new processes,
connect to their input/output/error pipes, and get their return codes.
11. gTTS - This is a Google Text-to-Speech API that can be used to create a
wav file for audio feedback.
12. AudioSegment - It is a wrapper class of pydub python library used to play,
merge, edit or merge .wav files. Pydub installed for this is v0.24.1.
13. Audio - Audio allows us to play an audio file in the IPython notebook itself.
14. pyaudio(v0.2.11) - PyAudio allows you to quickly play and record audio
on a number of devices using Python.
15. sys - The sys module in Python contains a number of functions and
variables for manipulating various aspects of the Python runtime
environment.
16. matplotlib.pyplot - This library consists of all matplotlib modules for
IPython.
38
8.3 System Overview
8.3.1 Real-Time Trigger Word Detection
Purpose of this module is to allow the system to be activated with voice. Idea
behind this module is the concept of home assistant devices like Google Home,
Alexa etc. The module was built on a pre-trained model based on the trigger word
“Activate”.
But the pre-trained model was designed to work on a static/pre-recorded audio
track.
To make use of it in real time, a small audio clip of 5 seconds is recorded and
passed for detection and this process repeats. The module works is further divided
into two sub modules.
1. The first sub module works on continuous audio recording and writing the
final recorded audio file into the target directory.
2. Second module works on retrieving the recorded audio track and searching
for the trigger word.
Once the word “Activate” is recognized the module calls the image recognition
module which gives us an audio feedback of any object detected.
8.3.2 Object Detection Module
To work on the object detection aspect, we have several algorithms that can be
useful, such as deep neural networks (ImageNet) and R-CNN, which provide
results that are far more reliable and faster than previous neural networks
developed. R-CNN is divided into three modules: the first produces all potential
regions for categorization, the second extracts feature by running them through a
convolution neural network, and the third module uses SVM to decide the output.
39
According to the paper, the algorithm detects an object with a 66% accuracy
and takes 20 seconds to process. CNN-based regional propositional networks use
the limited search approach, which works well, but it does not address the problem.
We recommend shortening the system's response time so that if a user asks the
system a query, it receives a response as quickly as possible. As a result, to balance
the speed and greater accuracy, YOLO proves to be a better algorithm in contrast,
which uses a regression-based classification method and can be extended to a
single neural system to get the whole picture.
YOLO (𝑌𝑜𝑢 𝑂𝑛𝑙𝑦 𝐿𝑜𝑜𝑘 𝑂𝑛𝑐𝑒) is a real-time object detection algorithm

that is widely regarded as one of the most powerful and useful object detection
algorithms capable of extending many of the pioneering concepts that have
emerged from computer vision. Some deep learning algorithms, such as Faster-
RCNN, are stronger, but they still do not meet the requisite processing time. When
compared to Faster-RCNN, the YOLO algorithm produces faster results.
The inputted frame is divided into ‘S' x ‘S' grids by YOLO. Each grid has a
probability of enclosing one or more items. Inside each grid, these points must be
connected by a box. As a result, each grid can have B bounding boxes and C trained
class probabilities in the model. What forecasts are true and which are invalid are
determined by the user's confidence score. Only accurate projections are used to
build boundary boxes. The text to speech module is then given the valid prediction
labels. Here's an example of how YOLO creates grids and bounding boxes:
40
Figure 8.2: 19x19 grids in YOLO
Figure 8.3: Predicted bounding boxes labels
41
We present an idea for an object recognition system based on position-sensitive
matrix highlight maps that are quick to process and extremely responsive. In this,
we put together position-sensitive convolutional layers known as Grid
Convolutional Layers (GCL), which actuate the item's particular areas in the
element maps in matrix structure. The ROI (Region of Interest) matrix pooling
layer, when properly prepared, guides the final arrangement of convolutional
layers to learn specific network feature maps.
As seen in Table 8.1, the Grid Based approach has a much quicker response
time than the Selective Search approach. Since the selective search method relied
heavily on a hierarchical grouping-based segmentation algorithm.
Dimensions Selective Search Grid Based

3269*2497 4.53 seconds 0.51 seconds
2560*1440 28.79 seconds 1.10 seconds
1280*720 62.63 seconds 1.92 seconds
Table 8.1: Selective Search Vs Grid Based execution time
Figure 8.4 shows an example of YOLO Object Detection:
Figure 8.4: YOLO Object Detection with accuracy
42
CHAPTER 9: CHALLENGES FACED AND SOLUTIONS
9.1 Challenges Faced
1. Traditional trigger word detection as per reference suggests to use a model

that can recognize trigger word from a pre-recorded audio, whereas our
project aims to develop a real-time trigger word detection model.
2. The object detection algorithm should be fast enough to allow the user to
act fast.
3. Since the project aims at assisting visually impaired people, it should be
able to give audio feedback.
9.2 Solutions
1. The solution to making a trigger word detection model work in a real time
scenario can be achieved by allowing the detection module to run every 1
second for a fresh audio clip recorded on another thread simultaneously.
2. The fastest compatible object detection algorithm was found to be
YOLOv4 which also has high frame per seconds and accuracy when
compared to traditional object detection model and its ancestor models.
3. To solve this issue, we make use of gTTS API to convert all labels
detected into an audio file.
43
CHAPTER 10: LIMITATIONS
1. The implemented trigger word module works only on local systems and
not on cloud platforms like Google Colab.
2. The application requires high end processing power in order to run it
smoothly without any drop in frames.
3. The real time object detection window returned is based on python
libraries, therefore in order to get a better output with less latency the real
time object detection window can be returned using JavaScript.
4. The system activates only based on the trigger word “Activate” it was
trained on.
5. The object detected must be of relatively bigger size in order to be able to
be detected by the model.
44
CHAPTER 11: RESULTS AND DISCUSSIONS
This framework is designed as a proof of concept for collaboration of trigger

word detection and object detection modules that have been trained on a specific
dataset. The model displays the confidence scores while recording. Here we have
taken an instance where the user is using the model and the following
observations are recorded.
Query “Activate” (User generated speech)
Class Laptop, Mouse, Cell Phone
Confidence 93.98% (Laptop), 91.73% (Mouse),

91.74% (Cell Phone)
Output “The detected objects are mid center
laptop, bottom right cell phone, mid right
mouse”
Table 11.1: Recorded Output
Accordingly, as per the objects detected by model, the name of the object
is verbally conveyed to the user by the speech module i.e., “The detected objects
are mid center laptop, bottom right cell phone, mid right mouse”.
Figure 11.1: Objects Detected
45
CHAPTER 12: CONCLUSION AND FUTURE WORK
This application aims to assist visually impaired individuals by collaborating

NLP and Digital Image Processing concepts of Artificial Intelligence. With the
help of NLP, we let the user make use of this application with the help of voice
activation method, like home assistant devices. Once the system recognizes the
trained word it passes the control to the Digital Image Processing module, where
it uses the latest object detection algorithm to recognize the object present in
front of the camera. The labels returned from this module are used to give audio
feedback with the help of gTTS API to the user for their convenience.
This application aims to work like a smart cane for visually impaired
individuals. Although the results obtained are satisfactory, it still needs to be
enhanced according to the end goal, in order to achieve that the application
should be able to run faster on a low processing power so that the user does not
have to wait for audio feedback for too long.
Since the application’s NLP module was inspired from home assistant
devices, other features from home assistant devices like News Report,
Entertainment can be added onto this application. Along with the above-
mentioned features, if navigation systems were also added into the application,
life of a visually impaired individual will become a lot easier.
46
REFERENCES

[1] K. Supriya, "Trigger Word Recognition using LSTM.," International Journal of
Engineering Research and Technology, vol. 9, no. 6, p. 8, 2020.
[2] A. Krizhevsky, "ImageNet classification with deep convolutional neural
networks," Communications of the ACM, vol. 60, no. 6, pp. 84-90, 2017.
[3] R. Girshick, J. Donahue, T. Darrell and J. Malik, "Rich Feature Hierarchies for
Accurate Object Detection and Semantic Segmentation," 2014 IEEE Conference
on Computer Vision and Pattern Recognition, 2014.
[4] J. Choi, D. Chun , H. Kim and H.-J. Lee , "Gaussian YOLOv3: An Accurate and
Fast Object Detector Using Localization Uncertainty for Autonomous Driving,"
2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
[5] X. Wang, A. Shrivastava and A. Gupta, "A-Fast-RCNN: Hard Positive Generation
via Adversary for Object Detection," 2017 IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 2017.
[6] S. Ren, K. He, R. Girshick and J. Sun, "Faster R-CNN: Towards Real-Time
Object Detection with Region Proposal Networks," IEEE Transactions on Pattern
Analysis and Machine Intelligence, 2017.
[7] B. Balasuriya, N. Lokuhettiarachchi, A. R. M. D. N. Ranasinghe, K. D. C.
Shiwantha and C. Jayawardena, "Learning platform for visually impaired children
through artificial intelligence and computer vision," 2017 11th International
Conference on Software, Knowledge, Information Management and Applications
(SKIMA), 2017.
[8] T. Giannakopoulos, N.-A. Tatlas, T. Ganchev and I. Potamitis, "A practical, real-
time speech-driven home automation front-end," IEEE Transactions on Consumer
Electronics, 2005.
[9] N. Jmour, S. Zayen and A. Abdelkrim, "Convolutional neural networks for image
classification," 2018 International Conference on Advanced Systems and Electric
Technologies (IC_ASET), 2018.
[10] K. Kumatani, S. Panchapagesan, M. Wu, M. Kim, N. Strom, G. Tiwari and A.
Mandai, "Direct modeling of raw audio with DNNS for wake word detection,"
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU),
2017.
[11] M. J. Shaifee, B. Chywl, F. Li and A. Wong , "Fast YOLO: A Fast You Only
Look Once System for Real-time Embedded Object Detection in Video," Journal
of Computational Vision and Imaging Systems, 2017.
47
APPENDIX A
CODE AND SAMPLE OUTPUT
A.1 Code
A.1.1 utils.py
The methods used in this python file are used as a support for the main
application for detection and preprocessing purposes. These methods are used by
importing this python file in the detect.ipynb file.
Figure A.1: Methods used in util.py
48
Figure A.1: Methods used in util.py
49
A.1.2 audio_rec.ipynb
This python notebook runs a sequence of code in order to record audio input in
real time and save it in the required directory for further processing in the
detect.ipynb file.
Figure A.2: Python Notebook to Record Audio in Real Time
A.1.3 detect.ipynb
This jupyter notebook does the main job of the application, it detects the trigger
word by importing the audio file saved in the required directory using
audio_rec.ipynb. And once the trigger word is detected, it gives a chime sound to
indicate that a trigger word was found, followed by calling the object detection
function, which passes the label to gTTS API to give an audio feedback after
detecting the object present.
50
Figure A.3: Importing necessary packages, models, weights and dataset
Figure A.4: Object detection and output processing methods
51
Figure A.4: Object detection and output processing methods
52
Figure A.5: Detecting Trigger Word and Calling Object Detection Method upon trigger
Figure A.6: Audio Feedback returned from gTTS API
A.2 Sample Output
Class Person, Fork, Cell Phone
Confidence 97.00% (Person), 97.61% (Cellphone), 77.97%

(Fork)
Output “The detected objects are mid center person,
bottom left cell phone, Beware there’s a fork at
bottom right”
Table A.1: Sample Output Data
53
Figure A.7: Trigger Word Detected on Spectrogram
Figure A.8: Sample Objects Detected
54
Figure A.9: Label and Confidence Output of detected Object
55
PLAGIARISM REPORT
56
57
58
59
60
61
62
PUBLICATION PROOF
63
64
65
66
High Technology Letters ISSN NO : 1006-6748
Object Detection with Audio Feedback using Trigger Word

D.Dheeraj1 Ankit Kumar1 M.Vaidhehi2
1
B.Tech Student 2Assistant Professor
1,2
1,2
SRM Institute of Science and Technology, Kattankulathur, India
Abstract– People with poor vision or blindness are unable to perceive elements of their
environment in the same way as other people do. Performing daily tasks and avoiding hurdles are
only a couple of the difficulties they must overcome. This application seeks to develop an
assistive system for the visually disabled that will help them deal with all the obstacles in their
path. This application helps to assist visually impaired people by including a framework that can
tell them what object is in front of them without having to touch or smell it. Instead, they can
simply call out a trigger word, which triggers an object recognition module, which provides an
audible output of any object associated with their location.
Keywords: Visually Impaired, Object Detection, Natural Language Processing, Convolutional

Neural Network, Computer Vision.
I. INTRODUCTION
Up until 2015, there were around 940 million people around the world who were suffering from
a certain level of vision loss. Out of these 940 million people, 240 million had very low vision
and 39 million were completely blind.According to the World Health Organization's (WHO) first
World Vision Report published in2019, more than a fifth of the world's population (2.2 billion
people) suffers from visionthat may have been avoided or left untreated.
In addition, according to a report released in The Lancet Global Health journal in 2017, the
prevalence rate of vision disorder is expected to rise by 2020. It also estimates that if not handled
correctly, the number of cases will rise to almost 115 million cases of blindness and 588 million
cases of mild to extreme vision disability by 2050.
Visually disabled individuals face specific challenges when doing activities that normal people
take for granted, such as searching for keys, perceiving and locating items in the environment,
and walking along a path both indoors and outdoors.With the advancement of technology, it is
the duty of computer scientists to work in the field that improves the way in which every human
lives. Not only should a product developed be economically beneficial but also let a person
overcome his/her disability by depending on technology if needed [11].
II. LITERATURE SURVEY

Several academically published papers on Artificial Intelligence and other technologies have
been used in making this project.The paper[1] deals with Recurrent neural networks comprising
various layers of computations to detect the presence of trigger word using spectrogram. The
limitations of this paper is the existing system that is useful only for pre-recorded audio and does
not function in real-time scenarios.
Volume 27, Issue 4, 2021 420 http://www.gjstx-e.cn/

Deep Neural Network [2] has 60 million parameters involved, and also solves overfitting issues.
The limitation of this paper is that the proposed neural network cannot be implemented for real
time object detection. This paper [3] specifies how R-CNN is one of the real time object
detection models available that detects an object present in an image by computing on selective
regions in an image. But the drawbacks of this paper is that the time taken to detect an object is
20s and mAP is only 66%.CNN (Convolutional Neural Network) [4] focuses on the training
convolutional neural network for classification system and transfer learning technique. The
drawbacks of this paper is that the proposed method can only be useful for implementing a pre-
trained model.
YOLOv3 [5] works by identifying the center of an object from a dataset followed by expansion
of the bounding box. The limitation is that with the increase in computation speed (22 ms), the
accuracy of the model has fallen (63.4%)
The paper justifies how Fast-RCNN [6] is similar to normal R-CNN, except for some initial
CNN and feature map extractions followed by RCNN. The limitation is that although it is faster
(2 seconds) and more accurate (70%) compared to RCNN, it still cannot be considered as a fast
enough model.Hence by using the references from the papers mentioned above, we strive to
develop a better implementation.
III. PROPOSED SYSTEM

Taking all the references and present applications that help visually challenged individuals under
consideration an application with an integration of NLP and Digital Image Processing can be
proposed for further aid for such people. The proposed idea makes use of an object detection
algorithm in order to recognize any object present in the based on the dataset the model was
trained on. Unlike conventional objection detection algorithms, the proposed system makes use
of the latest available object detection algorithm that not just returns the output fast with higher
accuracy but also performs the object detection task in a single evaluation which takes a load off
the computation cost.
The object in the image is detected after going through the following steps, which are acquisition
of image frame by frame with the help of webcam followed by object detection by passing the
image through YOLOv4 algorithm in which feature extraction of image followed by labelling of
the object present in image takes place. YOLO algorithm as the name suggests “You Only Look
Once”, scans through the image only once and returns the object detected. It detects the center of
every object present in the dataset and continues to expand the bounding box around it and
returns the probability of the object present.
Apart from the object detection part of the proposed system which represents the Digital Image
Processing module, the proposed system also has an NLP module that works on the voice
activation part of application.The voice activation module of the application letsa user start an
object detection module upon saying a trigger word. Similar to home assistantslike Google

Home, Alexa etc [9],, the system triggers the application for detecting an object present upon
recognizingthe wake-up word [10]
[10]“Activate”
“Activate” it was trained on. This task is carried out with the
help of a microphone to take continuous audio input.
Hence, by implementing both the emerging models/technologies, we aim to develop a better

generated solution for our given project. This proposed system will have areas that are better
developed and newly implemented when compa
compared
red to the previous proposed systems.
Figure 1: Architecture Diagram
IV. IMPLEMENTATION
In order to begin with a voice activation system, a microphone takes an input as mentioned in the
proposed system.
This method has various layers of neural networks to identify the word that triggers the system.
It detects trigger word from a pre
pre-recorded
recorded audio clip, by passing the audio clip into a neural
network with several layers like GRU, ReLU, Dropout and Sigmoid and adds a chime sound just
after the word is detected
ed and produces the output. The network first starts with a Conv1D layer
that is used to compress the input in order to only extract the required features of our input for
processing. After compressing the input, the output from the Conv1D layer is fed into int the
BatchNormalization layer. Batch normalization is a process for training very deep neural

networks that standardizes each mini

mini-batch input to a layer. This stabilizes the learning process
and greatly decreases the amount of training epochs required to train deep networks. The
network later proceeds with anactivation function called ReLU which increases the non-linearity
non
property of the model, and later to prevent the model from overfitting it is passed through the
Dropout layer. Later it passes through the GRU layer, a GRU is like a long short-term
short memory
with a forget gate, but has fewer parameters than LSTM [1],, as it lacks an output gate.
Figure 2: Trigger Word detector network
Since there may be lags of uncertain length between significant events in a time series, LSTM
networks are well-suited
suited to classifying, analyzing,, and making predictions based on time series
data. LSTMs were created to solve the issue of vanishing gradients that can occur while training
conventional RNNs. After passing through the GRU layer, the network proceeds with repeated
combinations of different layers mentioned till now followed by Dense and Sigmoid layers. The
sigmoid layer is another activation func
function that gives us an output in range 0-1
1 which represents
the probability of presence of trigger word in the input.

But since the model is suited only for pre-recorded audio clips, we further divide this module
into two sub modules. One of them keeps recording the audio input with the help of a
microphone and writes the recorded audio file into the cloud environment. The second sub
module retrieves the audio recorded from the cloud and detects for the trigger word. The audio
recorded is of 2 seconds and the above-mentioned submodules keep running simultaneously in a
loop.
And in order to work on the object detection part, we have various algorithms that can be
helpful, like deep neural networks (ImageNet) [2] and R-CNN [3], that provides results in a
much accurate manner and faster than previous neural networks built. R-CNN works in 3
modules, the first module generates all possible regions available for categorization followed by
feature extraction in the second step by passing it through convolution neural network and the
final module uses SVM and determines final output. According to the paper, the accuracy of the
algorithm to detect an object is 66% and time taken to process is 20s.CNN-based regional
propositional networks use the selective search method that works well, but with this the issue is
not solved, we advise the system's response time to be shortened so that if a user sends the
system a question, it can get a response as soon as possible. Hence, to match up with the speed
and better accuracy YOLO [5] proves to be a better algorithm in comparison which uses a
regression-based classification approach and this algorithm can be applied to a single neural
system to the fullest picture.
YOLO is also known as you only look once. It is a real-time object detection algorithm that is
known to be one of the most powerful and useful object detection algorithms that can extend
most of the groundbreaking ideas that come from computer vision.Some deep learning
algorithms such as the Faster-RCNN [7] algorithms are better but it still does not satisfy the
required processing time. Whereas YOLO algorithm gives a faster result when compared to
Faster-RCNN.
The algorithm works like in following manner:

1) YOLO analyzes each frame of a real-time feed from an input system. It splits the whole
frame into 'S'x'S' grids. Each grid has a probability of enclosing one or more items. Inside
each grid, these points must be connected by a box. As a result, each grid can have B
bounding boxes and C trained class probabilities in the model.
2) The YOLO model’s bounding box contains five components (mentioned in fig.): bx, by,
bw, bh and c where c is the confidence. The image's bounding box's center positioning is
defined by (bx, by). The box's width and height are expressed by (bw, bh). C reflects the
object's class, which is also known as confidence. The coordinates bx, by, bw, bh are
normalized to fall between [0,1].
3) Instead of searching for an interesting ROI (Region of Interest), YOLO divides the
picture into many cells, usually by creating a 19x19 grid. Each cell is in charge of

predicting the positions of 5 bounding boxes (there can be one or more objects in a cell).
If this occurs, it will result in 1805 bounding boxes for a single image.
4) The majority of the bounding boxes present in the cell could be empty. The probability of
object type is used to filter these bounding boxes. Non
Non-max
max suppression comes into play
and its processes will remove the unnecessary bounding boxes, leaving only the highest
probability bounding boxes.
5) To assure the predictions, a confidence score is assumed (mostly above 40%) against
which the bounding box will predict the probability of the following class of confidence.
There is a probability of 80 classes for each cell but only one class probability is predicted
predi per
cell. YOLO's prediction is an S∗∗S∗(B∗5+C)
5+C) vector, which includes B box predictions for each
grid cell and C class predictions for each grid cell (C stands for number of classes).The working
of YOLO is illustrated in the fig 3 given below. The latelatest
st release of YOLOv4 is used in this
project. This application uses the COCO dataset which is prepre-loaded
loaded with 80 classes of different
day to day objects.
Figure 3: Object Detection using YOLO
V. RESULTS
This framework is designed as a proof of concept for an object detection module that has been
trained on a specific dataset. The model displays the confidence scores while recording. Here we
have taken an instance where the user is using the model and the following observations are
recorded.

Class Laptop, Mouse, Cell Phone
Confidence 93.98% (Laptop), 91.73% (Mouse),

91.74% (Cell Phone)
Answer “The detected objects are mid center

laptop, bottom right cell phone, mid
right mouse”
Accordingly, as per the objects detected by model, the name of the object is verbally conveyed to
the user by the speech module i.e., “The detected objects are Laptop, Mouse, Cell Phone”
Figure 4: Result Obtained
VI. CONCLUSION
This project proposes a system to assist visually challenged people by giving them an audio
feedback of any object detected with the help of an efficient algorithm that is triggered only
when the system detects a trigger word in order to activate the object detection model. This idea
is achieved by exploring a list of algorithms that can prove to be effective for object detection
modules and get an insight on how to implement voice activation modules. Finally, we observe
that YOLO proves to be the most effective algorithm for object detection with higher accuracy
and faster response compared to other algorithms.

References
[1] K. Supriya, "Trigger Word Recognition using LSTM.," International Journal of

Engineering Research and Technology, vol. 9, no. 6, p. 8, 2020.
[2] A. Krizhevsky, "ImageNet classification with deep convolutional neural networks,"
Communications of the ACM, vol. 60, no. 6, pp. 84-90, 2017.
[3] R. Girshick, J. Donahue, T. Darrell and J. Malik, "Rich Feature Hierarchies for Accurate
Object Detection and Semantic Segmentation," 2014 IEEE Conference on Computer Vision
and Pattern Recognition, 2014.
[4] N. Jmour, S. Zayen and A. Abdelkrim, "Convolutional neural networks for image
classification," 2018 International Conference on Advanced Systems and Electric
Technologies (IC_ASET), 2018.
[5] J. Choi, D. Chun , H. Kim and H.-J. Lee , "Gaussian YOLOv3: An Accurate and Fast
Object Detector Using Localization Uncertainty for Autonomous Driving," 2019 IEEE/CVF
International Conference on Computer Vision (ICCV), 2019.
[6] X. Wang, A. Shrivastava and A. Gupta, "A-Fast-RCNN: Hard Positive Generation via
Adversary for Object Detection," 2017 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2017.
[7] S. Ren, K. He, R. Girshick and J. Sun, "Faster R-CNN: Towards Real-Time Object
Detection with Region Proposal Networks," IEEE Transactions on Pattern Analysis and
Machine Intelligence, 2017.
[8] M. J. Shaifee, B. Chywl, F. Li and A. Wong , "Fast YOLO: A Fast You Only Look Once
System for Real-time Embedded Object Detection in Video," Journal of Computational
Vision and Imaging Systems, 2017.
[9] T. Giannakopoulos, N.-A. Tatlas, T. Ganchev and I. Potamitis, "A practical, real-time
speech-driven home automation front-end," IEEE Transactions on Consumer Electronics,
2005.
[10] K. Kumatani, S. Panchapagesan, M. Wu, M. Kim, N. Strom, G. Tiwari and A. Mandai,
"Direct modeling of raw audio with DNNS for wake word detection," 2017 IEEE Automatic
Speech Recognition and Understanding Workshop (ASRU), 2017.
[11] B. Balasuriya, N. Lokuhettiarachchi, A. R. M. D. N. Ranasinghe, K. D. C. Shiwantha and C.
Jayawardena, "Learning platform for visually impaired children through artificial
intelligence and computer vision," 2017 11th International Conference on Software,
Knowledge, Information Management and Applications (SKIMA), 2017.

Object Detection With Audio Feedback Using Trigger Word

Uploaded by

Copyright:

Available Formats

Object Detection With Audio Feedback Using Trigger Word

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Object Detection With Audio Feedback Using Trigger Word

Uploaded by

Copyright:

Available Formats

OBJECT DETECTION WITH AUDIO FEEDBACK USING

D.DHEERAJ [Reg No: RA1711003010040]

Under the guidance of

In partial fulfillment of the requirements for the degree of

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

FACULTY OF ENGINEERING AND TECHNOLOGY

Ms. M. Vaidhehi Dr. B. AMUTHA

Signature of the Internal Examiner Signature of the External Examiner

To be completed by the student for all assessments

Degree/ Course : B.Tech. / Computer Science and Engineering

Student Name : D.Dheeraj

Registration Number : RA1711003010040

● Clearly references / listed all sources as appropriate

To be completed by the student for all assessments

Degree/ Course : B.Tech. / Computer Science and Engineering

Student Name : Ankit Kumar

Registration Number : RA1711003010086

● Clearly references / listed all sources as appropriate

We extend our sincere thanks to Dr. Revathi Venkataraman, Professor &

We are extremely grateful to our Academic Advisor Dr. P.Vishalakshi, Assistant

We register our immeasurable thanks to our Faculty Advisor, Dr. C.Vijayakumaran,

Our inexpressible respect and thanks to my guide, Ms. M.Vaidhehi, Assistant

LIST OF TABLES ........................................................................................................... x

LIST OF FIGURES ......................................................................................................... xi

LIST OF SYMBOLS..................................................................................................... xiv

CHAPTER 1: INTRODUCTION .................................................................................... 1

1.2 Scope ................................................................................................................... 2

1.3 Present Systems .................................................................................................. 2

CHAPTER 2: LITERATURE SURVEY ......................................................................... 3

2.1 Review ................................................................................................................ 3

2.2 Inference from Survey ........................................................................................ 5

CHAPTER 3: FUNDAMENTALS .................................................................................. 7

3.1 Machine Learning ............................................................................................... 7

3.2 Computer Vision ................................................................................................. 8

3.3 Deep Learning..................................................................................................... 8

3.4 Natural language processing ............................................................................... 9

CHAPTER 4: REAL-TIME TRIGGER WORD DETECTION .................................... 11

4.1 Introduction ....................................................................................................... 11

4.2 Module Outline ................................................................................................. 11

4.3 Algorithm .......................................................................................................... 12

CHAPTER 5: YOLO: YOU ONLY LOOK ONCE ...................................................... 16

5.1 Introduction ....................................................................................................... 16

5.2 Object Detection ............................................................................................... 17

5.3 Convolutional Neural Networks ....................................................................... 18

5.4 YOLO Algorithm.............................................................................................. 21

CHAPTER 6: WHY CHOOSE YOLOv4? .................................................................... 28

CHAPTER 7: DARKNET ............................................................................................. 32

7.1 Introduction ....................................................................................................... 32

7.2 Installation ........................................................................................................ 33

CHAPTER 8: PROPOSED SYSTEM ........................................................................... 35

8.1 Architecture Diagram ....................................................................................... 35

8.2 Importing Libraries ........................................................................................... 37

8.3 System Overview .............................................................................................. 39

8.3.1 Real-Time Trigger Word Detection .......................................................... 39

CHAPTER 9: CHALLENGES FACED AND SOLUTIONS ....................................... 43

9.1 Challenges Faced .............................................................................................. 43

9.2 Solutions ........................................................................................................... 43

CHAPTER 10: LIMITATIONS ..................................................................................... 44

CHAPTER 11: RESULTS AND DISCUSSIONS ......................................................... 45