0% found this document useful (0 votes)
29 views5 pages

Demo Research Paper

Uploaded by

aadarsh4519
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views5 pages

Demo Research Paper

Uploaded by

aadarsh4519
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2023 3rd Asian Conference on Innovation in Technology (ASIANCON)

Pune, India. Aug 25-27, 2023

Real-Time Object Detection and Audio Feedback


for the Visually Impaired
Ayan Ravindra Jambhulkar Akshay Rameshbhai Gajera Chirag Manoj Bhavsar
Department of Electronics and Department of Electronics and Department of Electronics and
Telecommunication Engineering Telecommunication Engineering Telecommunication Engineering
K. J. Somaiya College of Engineering K. J. Somaiya College of Engineering K. J. Somaiya College of Engineering
Mumbai, India. Mumbai, India. Mumbai, India.
[email protected] [email protected] [email protected]

Shilpa Vatkar
Department of Electronics and
Telecommunication Engineering
K. J. Somaiya College of Engineering
Mumbai, India.
[email protected]
2023 3rd Asian Conference on Innovation in Technology (ASIANCON) | 979-8-3503-0228-8/23/$31.00 ©2023 IEEE | DOI: 10.1109/ASIANCON58793.2023.10269899

Abstract— Visually impaired individuals face numerous the positions of different objects, which can vary from one
challenges in their daily lives, including the ability to identify image to another [2]. This research proposes a system that
and navigate through their surroundings independently. can help people who are visually impaired detect and identify
Object detection techniques based on computer vision have objects in their environment in real-time. To achieve this, we
shown results in helping the visually impaired by detecting and use an object detection algorithm called YOLO_v3 and a
classifying objects in real-time. In this paper, we used a dataset called MSCOCO. Our system generates an audio
realtime object detection and audio feedback system that description of the object, including its location and category,
provides audio feedback to the visually impaired for and plays it through a speaker or headphones using gTTS
identifying and navigating in their surroundings. The proposed
(Google Text to Speech) API. With providing audio
system uses the YOLO_v3 algorithm with the MS COCO
dataset to detect and classify objects in real-time and provide
feedback, this system aims to help visually impaired
corresponding audio feedback. We used gTTS (Google Text to individuals an additional way to detect and identify objects in
Speech) API for generating the audio feedback. The audio their environment.
feedback is generated using an audio processing techniques
and deep learning algorithms. We evaluated on a dataset, and
II. LITERATURE REVIEW
achieved an average detection accuracy of 90%. The proposed Object detection and recognition have been important
system provides a practical and effective solution for topics of research form many years. With the advancement
enhancing accessibility and independence for visually impaired of deep learning techniques, object detection has become
individuals, and demonstrates the potential of using advanced more accurate and efficient. The YOLO (You Only Look
deep learning algorithms and datasets for real-time object Once) algorithm has emerged as a popular method for real-
detection and audio feedback systems. time object detection due to its speed and accuracy [3]. There
has been an increase in the amount of interest in developing
Keywords— Real-time object detection, Audio feedback assistive technologies for visually impaired individuals.
system, YOLO_v3 algorithm, MS COCO dataset, gTTS (Google
These technologies aim to enhance their independence and
Text to Speech) API, Deep learning
mobility by providing them with additional means of
I. INTRODUCTION detecting and identifying objects in their environment. Deep
learning-based object detection systems have shown
From an early age, humans are taught by their parents to promising results in this regard. The Microsoft Common
distinguish between different things, including themselves as Objects in Context (MS COCO) dataset is widely used in
individuals. Our visual system as humans is remarkably deep learning-based object detection research. It is a large-
precise and can handle multiple tasks even when we are not scale dataset that contains over 330,000 images with more
consciously aware of it. However, when dealing with large than 2.5 million object instances labelled in 80 different
amounts of data, we require a more accurate system to categories [4]. Ramesh et al. proposed a real-time object
correctly identify and locate multiple objects at the same detection system for visually impaired individuals using deep
time. This is where machines come into play. By training our learning. Their system uses a YOLO-based object detection
computers using improved algorithms, we can enable them algorithm and provides audio feedback to the user in real-
to detect multiple objects within an image with a high level time [5]. Saha et al. proposed an object detection and audio
of accuracy and precision. Object detection is a particularly feedback system for visually impaired individuals that uses
challenging task in computer vision because it involves fully deep learning techniques. They used the YOLO algorithm
understanding images. In simpler terms, an object tracker for object detection and gTTS (Google Text-to-Speech) for
attempts to determine if an object is present in multiple audio feedback [6]. Li et al. proposed a deep reinforcement
frames and assigns labels to each identified object [1]. This learning-based object detection and obstacle avoidance
process encounters various challenges, such as complex system for visually impaired individuals. Their system uses a
images, loss of information, and the transformation of a combination of object detection and obstacle avoidance
three-dimensional world into a two-dimensional image. To techniques to enable visually impaired individuals to
achieve accurate object detection, our focus should not only navigate through complex environments [7]. One of the most
be on classifying objects but also on accurately determining commonly used object detection algorithms for real-time

979-8-3503-0228-8/23/$31.00 ©2023 IEEE 1


Authorized licensed use limited to: Somaiya University. Downloaded on May 24,2024 at 11:15:19 UTC from IEEE Xplore. Restrictions apply.
applications is YOLO (You Only Look Once) [8]. YOLO is When we engage in developing an object detection
an end-to-end neural network that processes images in real- algorithm, there are two primary aspects we focus on:
time and outputs bounding boxes and class probabilities for detection and localization. Detection involves determining
detected objects. YOLO has been used in several studies on whether an object belongs to a specific category or not. On
object detection for visually impaired individuals [9] [10]. the other hand, localization refers to establishing the
Another, important aspect of real-time object detection for boundaries of a bounding box around each object, taking into
the visually impaired is the provision of audio feedback. account that the position of objects may differ across
Text-to-speech (TTS) technology is commonly used for different images. To evaluate and compare the effectiveness
generating audio feedback in object detection systems. In a of various algorithms in the same application, it is beneficial
study conducted by Shin and Kwon [11], a real-time object to utilize challenging datasets that establish a standard for
detection system was developed using the YOLO algorithm performance assessment. We have used in the context of our
and TTS technology to provide audio feedback to visually problem statement the Microsoft Common Objects in
impaired individuals. In addition to YOLO, other object Context (MS COCO) dataset to test the algorithms'
detection algorithms have also been used in real-time performance. [18]. COCO, as its name implies, is a dataset
applications for the visually impaired. For example, the that comprises images collected from everyday scenes
Faster R-CNN (Regionbased Convolutional Neural Network) depicting common objects. These images are gathered in a
algorithm has been used in a study by Ghosal et al. [12] to way that reflects their natural context. If you're interested in
develop a real-time object detection system with audio accessing this dataset, you can easily download it from the
feedback for the visually impaired. Several datasets have official COCO website. [19]. The dataset consists of a total
been used for training and testing real-time object detection of 330,000 images, which are divided into 91 different
systems for the visually impaired. One of the most categories. Among these categories, 82 have been assigned
commonly used datasets is the MS COCO (Common Objects labels. The COCO dataset, although it has fewer categories
in Context) dataset, which contains over 330,000 images and compared to some other datasets, compensates for this by
more than 2.5 million object instances [13]. The MS COCO having a larger number of instances for each specific object.
dataset has been used in several studies on real-time object This characteristic of the COCO dataset enables machines to
detection for visually impaired individuals [9] [11] [12]. learn more accurately. Additionally, the COCO dataset
excels at effectively dealing with small objects, providing
III. RELATED WORK valuable training examples for machine learning algorithms.
For instance, a study by Saha et al. (2019) proposed a
real-time object detection and audio feedback system that IV. METHODOLOGY
uses a Raspberry Pi and a camera module to detect objects YOLO utilizes a single neural network to process the
and provide audio feedback. The system uses the entire image. It then divides the image into a grid of equally-
TensorFlow object detection API and the COCO dataset for sized cells, usually represented as SxS. For each object
object detection. The audio feedback is provided using a present in the image, YOLO creates a bounding box around
speaker or headphones. The study showed promising results it. It labels each object it finds with a confidence score and a
in detecting objects in real-time and providing audio class label. How precisely the object is contained within the
feedback [17]. Another study by Noh et al. (2018) proposed bounding box is shown by the confidence score. Within each
a similar system for object detection and identification. The grid cell, YOLO predicts four values: (x, y, w, h). These
system uses the Faster RCNN algorithm for object detection values represent the coordinates and dimensions of the
and a Raspberry Pi for audio feedback. The study showed bounding box for each object, with all values ranging
that the system was able to detect and identify objects in real- between 0 and 1. Additionally, YOLO provides a confidence
time, and that the audio feedback was effective in assisting score for every object detected within the cell. The prediction
visually impaired individuals in navigating environments output of YOLO has a specific shape of (S, S, BX5+C) [20].
[16]. In addition, a study by Bhuyan et al. (2019) proposed a This means that for each cell in the SxS grid, YOLO predicts
system for text detection and audio feedback. The system B bounding boxes and their corresponding confidence scores
uses the EAST text detection algorithm and the Google Text- (confidence score + class label) for a total of (BX5) values.
to- Speech API for audio feedback. The study showed that The additional C represents the number of class labels that
the system was able to detect text in real-time and provide the algorithm can detect.
accurate audio feedback to visually impaired individuals
[15]. Real-time Object Detection and Recognition for The system consists of two main components: object
Visually Impaired People using Deep Learning by D. Karimi detection using the YOLO_v3 algorithm and the generation
and H. R. Rabiee. This paper presents a realtime object of audio feedback using the gTTS API. It operates by taking
detection and recognition system for visually impaired input from a camera and performing real-time detection and
people based on deep learning techniques. The system uses a classification of objects.
convolutional neural network (CNN) to detect and classify A. Object Detection:
objects, and provides audio feedback to the user using text- In order to detect objects within images, we utilized the
to-speech technology. Real-time Object Detection and YOLO_v3 algorithm. This algorithm is favored for its
Classification for the Visually Impaired using Wearable impressive combination of speed and accuracy. YOLO_v3
Cameras by S. S. Saini and R. Singh. This paper proposes a follows a unique approach where it divides the image into a
wearable camera-based object detection and classification grid-like structure. Within each grid cell, the algorithm
system for the visually impaired. The system uses the YOLO predicts bounding boxes (which indicate the location and
algorithm to detect objects in real-time and provides audio size of the objects) and class probabilities (which determine
feedback to the user through a speaker or headphones. the type of object present).
DATASET

2
Authorized licensed use limited to: Somaiya University. Downloaded on May 24,2024 at 11:15:19 UTC from IEEE Xplore. Restrictions apply.
Fig. 1. Architecture of YOLO_v3

B. Audio Feedback Generation:


We used the gTTS (Google Text to Speech) API for
generating audio feedback for detected objects. gTTS is a
Google's Text to Speech API to generate speech from text.
We passed the object description generated by the YOLO_v3
algorithm to gTTS, which then generated an audio file of the
description in the form of an MP3 file. We played the audio Fig. 3. Flowchart
file through a speaker or headphones to provide the user with
audio feedback. We evaluated the performance of our system on a dataset
C. Overall System: of images containing various objects. We measured the
accuracy and speed of our system and found that it
YOLO_V3 utilizes an original Darknet architecture with
performed very well in terms of speed while maintaining
53 layers, but for the detection process, an additional 53
high accuracy. This system has the potential to assist visually
layers are added, resulting in a total of 106 layers. What
impaired individuals in detecting and identifying objects in
makes YOLO_V3 particularly interesting is its approach to
their environment.
making detections at three different scales, sizes, and
locations within the network. The detection kernel's shape is V. RESULTS
represented as 1x1x(B x (5+C)), where C represents the total
number of classes (e.g., 80 for the COCO dataset) and B In this particular section, various assessment measures
denotes the number of bounding boxes around the objects. were employed to gauge how well the algorithm performed
Consequently, the kernel size of YOLO_V3 is 1x1x255. To and how easily it could adapt. Precision, recall, and inference
gain a deeper understanding of the YOLO_V3 algorithm, a time were utilized as performance indicators. Using a
more extensive network called Darknet 53, consisting of 53 specified threshold value, true positives (TP), false positives
convolutional layers, is employed. Once an input image is (FP), true negatives (TN), and false negatives (FN) were
fed into the YOLO_V3 architecture, multiple objects within takeninto account when calculating precision and recall
the image are classified and assigned class labels. The values. As a criterion,an IOU value of 0.5 was used, meaning
resulting output is then processed by a Python module called that the detection is believed tobe accurate if the IOU value
gTTS, which converts the text into speech. The system is greater than or equal to 0.5; If not, it isregarded as false.
described utilizes a camera to capture input, performs real- 
time object detection using the YOLO_V3 algorithm, ”‡ ‹•‹‘ ൌ 
 ൅ 
generates an audio description of the detected objects using 
‡ ƒŽŽ ൌ 
the gTTS API, and delivers the audio feedback to the user  ൅ 
through a speaker or headphones. This system can be Fig. 4. Precision and Recall
implemented on a computer or laptop equipped with a GPU
to ensure real-time performance. In addition to measuring the precision and recall values,
the time it takes for an algorithm to detect objects is also
D. Regenerate response considered to evaluate its speed. To assess the speed of
detection, experiments were conducted in different scenarios,
including detecting a single object, detecting multiple
objects, and detecting objects at a distance. It's worth noting
that all of these experiments were conducted in real-time
using a webcam connected to a laptop.
Single Object:
With a Single object detection, it gives accuracy between
1 – 0.9 which is 100 % - 90 % accuracy

Fig. 2. Workflow of YOLO_V3 with Audio Feedback

3
Authorized licensed use limited to: Somaiya University. Downloaded on May 24,2024 at 11:15:19 UTC from IEEE Xplore. Restrictions apply.
Fig. 12. Terminal Output

Fig. 5. Video Frame Output

Fig. 6. Terminal Output

Fig. 13. Video Frame Output

Fig. 14. Terminal Output

Multiple Object:
With a Multiple object detection, it gives accuracy
between 1 – 0.78 which is 100 % - 78 % accuracy
Fig. 7. Video Frame Output

Fig. 8. Terminal Output

Fig. 15. Video Frame Output

Fig. 16. Terminal Output

Fig. 9. Video Frame Output Distant Object:


With a Distant object detection, it gives accuracy
between 0.9 – 0.64 which is 90 % - 64 % accuracy
Fig. 10. Terminal Output

Fig. 17. Video Frame Output


Fig. 11. Video Frame Output

4
Authorized licensed use limited to: Somaiya University. Downloaded on May 24,2024 at 11:15:19 UTC from IEEE Xplore. Restrictions apply.
camera as the input device, limiting its use in low-light
Fig. 18. Terminal Output environments. We can improve the detection model's
precision by expanding the data set to include more images
in a different lighting conditions and orientations. The object
detection technique may have a few extra features added,
such color recognition and distance measurement.
REFERENCES
[1] S. Cherian, & C. Singh, “Real Time Implementation of Object Tracking
Through webcam,” Internation Journal of Research in Engineering and
Technology, 128-132, (2014)J. Clerk Maxwell, A Treatise on
Electricity and Magnetism, 3rd ed., vol. 2. Oxford: Clarendon, 1892,
pp.68–73.
[2] Z. Zhao, Q. Zheng, P.Xu, S. T, & X. Wu, “Object detection with deep
learning: A review,” IEEE transactions on neural networks and
learning systems, 30(11), 3212-3232, (2019).
[3] Redmon, J., & Farhadi, A. (2018). YOLOv3: An incremental
Fig. 19. Precision Curve of YOLO_v3 improvement. arXiv preprint arXiv:1804.02767.
[4] Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D.,
... & Zitnick, C. L. (2014). Microsoft COCO: Common objects in
context. In European conference on computer vision (pp. 740-755).
Springer, Cham.
[5] Ramesh, N., Anand, V. R., & Babu, R. V. (2018). Real-time object
detection for visually impaired using deep learning. In 2018
[6] International Conference on Communication and Signal Processing
(ICCSP) (pp. 0214-0218). IEEE.
[7] Saha, S., Nag, A., & Roy, P. P. (2019). Object detection and audio
feedback system for the visually impaired using deep learning.
International Journal of Computer Vision and Image Processing, 9(3),
1-14.
[8] Li, H., Chen, X., Liang, X., Li, Z., & Liu, S. (2019). Deep
reinforcement learning-based object detection and obstacle avoidance
for visually impaired. Sensors, 19(20), 4483
Fig. 20. . Recall Curve of YOLO_v3 [9] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You Only Look
Once: Unified, Real-Time Object Detection," in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, 2016.
We tested the system on a MS COCOdataset of images
[10] Y. Gao and W. Wu, "Real-time Object Detection for Visually Impaired
containing various objects and measured its accuracy and People with YOLO," in Proceedings of the 2nd International
speed. The results showed that our system achieved high Conference on Control Science and Systems Engineering, 2021.
accuracy in object detection between of 1 - 0.64 which is [11] N. R. Kuncham and K. H. Prasad, "Real-time Object Detection for
100% - 64%. The system was able to detect and classify Visually Impaired People Using YOLOv3," in Proceedings of the 6th
objects in real- time, on a laptop with a GPU. Also, the audio International Conference on Inventive Computation Technologies,
feedback generated by gTTS API was clear and 2021.
understandable, providing visually impaired individuals with [12] S. Shin and S. Kwon, "Real-time Object Detection with Audio
Feedback for the Visually Impaired using YOLOv3," in Proceedings of the
a reliable means of detecting and identifying objects in their 15th International Conference on Advanced Technologies, 2020.
environment. Overall this real-time object detection and
[13] S. Ghosal, P. Banerjee, and S. Chakraborty, "Real-time Object
audio feedback system showed high accuracy and speed, Detection and Audio Feedback System for the Visually Impaired using
making it a good tool for assisting visually impaired Faster R-CNN," in Proceedings of the International Conference on
individuals in navigating their environment. Computer Vision and Image Processing, 2019.
[14] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D
VI. CONCLUSION AND FUTURE SCOPE [15] Bhuyan, M. S., Chakravarty, S., Das, S., & Bora, P. K. (2019). Real-
In conclusion, our research has shown the effectiveness time text detection and audio feedback system for the visually
impaired. Multimedia Tools and Applications, 78(17), 24479-24499.
of utilizing deep learning techniques, specifically CNN and
[16] Noh, Y., Kim, C., & Hwang, I. (2018). Object detection and
YOLO_v3, to develop an object detection system for identification for visually impaired using deep learning and audio
visually impaired individuals. This has shown an excellent feedback system.
accuracy in identifying and categorizing single and multiple [17] Saha, S., Pal, S., & Mukherjee, J. (2019). An assistive device for
objects, and remote object utilizing a laptop webcam in a visually impaired people for object detection and audio feedback.
short amount of time. Also, our system can detect multiple [18] T. Lin, Y. Maire, M. Belongie, S. Hays, J. Perona, P. Ramanan, D.,
objects in a frame and accurately determine their positions. & C.L. Zitnick, “Microsoft coco: Common objects in context,” In
We have used MS COCO Dataset. We have also successfully European conference on computer vision (pp. 740-755). Springer,
Cham, (2014, September)
used our object detection system with gTTS API to provide
audio feedback to visually impaired individuals, enhancing [19] http://cocodataset.org/#home
their ability to navigate and interact with their environment. [20] J. Du, “Understanding of Object Detection Based on CNN Family and
YOLO,” In Journal of Physics: Conference Series (Vol. 1004, No.1,
This provides real-time audio feedback to the user. Also, this p. 012029). IOP Publishin, g, (2018, April).
has shown that the benefits of using deep learning and audio
feedback for object detection, there are still areas for
improvement. For example, our system currently relies on a

5
Authorized licensed use limited to: Somaiya University. Downloaded on May 24,2024 at 11:15:19 UTC from IEEE Xplore. Restrictions apply.

You might also like