0% found this document useful (0 votes)
152 views5 pages

Voice Assisted Object Detection For Visually Impaired

This research paper presents a Voice Assisted Object Detection (VAOD) system aimed at aiding visually impaired individuals by providing real-time object detection and navigation assistance using deep learning techniques. The proposed system utilizes MobileNet SSD architecture to ensure efficient processing on mobile devices, enabling it to detect objects, calculate their distance and direction, and provide audio output to users. The paper also reviews existing technologies and suggests future research directions to enhance the effectiveness of object recognition systems for the visually impaired.

Uploaded by

grspoorthy48
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
152 views5 pages

Voice Assisted Object Detection For Visually Impaired

This research paper presents a Voice Assisted Object Detection (VAOD) system aimed at aiding visually impaired individuals by providing real-time object detection and navigation assistance using deep learning techniques. The proposed system utilizes MobileNet SSD architecture to ensure efficient processing on mobile devices, enabling it to detect objects, calculate their distance and direction, and provide audio output to users. The paper also reviews existing technologies and suggests future research directions to enhance the effectiveness of object recognition systems for the visually impaired.

Uploaded by

grspoorthy48
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

VOICE ASSISTED OBJECT DETECTION FOR

2023 IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT) | 979-8-3503-3439-5/23/$31.00 ©2023 IEEE | DOI: 10.1109/CONECCT57959.2023.10234781

VISUALLY IMPAIRED
HITAISH KG DR.VANI KRISHNASWAMY VISHNU M
School of CSE School of CSE School of CSE
REVA UNIVERSITY REVA UNIVERSITY REVA UNIVERSITY
Bengaluru, India Bengaluru, India Bengaluru, India
[email protected] [email protected] [email protected]

B MAHIMA
School of CSE
REVA UNIVERSITY
Bengaluru, India
[email protected]

Abstract— Visual impairment can have a severe negative In this research paper, we seek to review the current
effect on a person's freedom, employment, and daily life. technologies in object detection for visual impairment and
Globally there are at least 2.2 billion individuals who highlight the most promising approaches and techniques. We
experience some form of visual impairment. Reading, propose an offline navigation system which identifies, tracks
writing, and navigating their environment is challenging the direction and distance of objects in real time. This system
for people with visual impairment. A lower quality of life uses low processing power, enabling it to be used in various
as well as feelings of loneliness and sadness may result mobile devices.
from this. There are a variety of deep learning object We also make recommendations for future research
detection models based on computer vision. In this paper, trajectories that can enhance the efficiency and usability of
we aim to integrate MobileNet SSD architecture to object recognition systems.
develop a navigation system which provides faster and The paper is structured as follows. In section 2 we present
accurate object detection to assist the visually impaired. the literature review on existing systems of objection
This project is a mobile-friendly architecture that is detection for visually impaired. Section 3 provides the
lightweight and optimized for embedded and mobile recommended system to detect objects using convolutional
devices with constrained CPU power. In this paper we neural networks. Section 4 of the paper offers an evaluation
also designed algorithms which provides direction and and examination of the proposed system's performance.
distance along with audio output. In the future works, this Ultimately, the paper concludes by presenting the findings
project can be further modified and implemented in and discussing future developments.
various domains.
2. RELATED WORKS
Keywords— Visual impairment, object detection, The rapid progression of technology has given rise to a
navigation, Deep learning, Computer vision. multitude of systems and technologies like Electronic travel
devices (ETD’s), Ultrasonic sensors and RFID’s [1,2,3].
Although these devices are capable of discerning the
1. INTRODUCTION existence of objects in their surroundings, they lack the ability
to identify the specific nature of those objects. Furthermore,
Visual impairment is a widespread state that touches lots of
these devices may exhibit a considerable degree of
individuals around the sphere and can range from minor to inaccuracy and imprecision, thereby compromising their
major, and it can considerably touch a person's value of life. overall effectiveness and practicality.
Individuals who are blind or partially sighted have numerous
The authors in [4] have used convolutional neural
difficulties every day, such as navigating unfamiliar
networks which have been found very useful and provided
environments, recognizing objects, and identifying obstacles.
promising results for object detection.
Several technologies, including white canes, guiding dogs,
Further the authors in [5] have used deep learning
and assistive devices, have been developed to address these algorithms like YOLOv3(You only look once) for object
difficulties. These technologies do, however, have inherent detection. In this paper, the authors have used logistic
restrictions and might not always deliver precise or timely
regression method to predict classes of bounding box which
information.
comprises the object.
Object detection for visual impairment has the potential to
The authors in [6] have recognized objects using Region-
address some of these limitations. By using computer vision based convolutional neural networks (R-CNN) algorithms. In
algorithms and machine learning techniques, object detection this paper guidance is provided through an application in
systems can help visually impaired individuals better
audio output to give the results of object detection. To train
understand their surroundings and make informed decisions.
the neural network model, the authors have used TensorFlow
For instance, an object detection system can identify
lite and XML. The application is built using java in Android
obstacles like poles, curbs, or other hazards and warn the user
Studio.
to stay clear of them.

979-8-3503-3439-5/23/$31.00 ©2023 IEEE


Authorized licensed use limited to: The Technology Library. Downloaded on May 30,2024 at 14:39:19 UTC from IEEE Xplore. Restrictions apply.
A deep novel architecture was proposed for the visually the objects, distance of the object from the user and the
impaired by the authors in [7] by using deep convolutional direction of the object is given to notify the user about the
neural networks (Deep CNN). In this paper the authors have location of the objects. This is done by giving out an audio
built a framework based on ‘Retina Net’. They have increased output using a text to speech engine.
the detection accuracy and computational power by using The complete explanation of each block in VAOD model
deep learning architectures. is specified from section 3.1 onwards.
The authors in [8] have used Single-Shot Detection (SSD)
for object recognition and classification. This model uses 3.1 Video input
Inception v3 system to recognize human faces and currency Continuous video input is taken through the camera which
notes. is converted into multiple frames using the imutils package
The authors in [9] have developed a method to identify which is provided in python3.
objects by using deep learning. They have used a pre-trained
model which is MobileNet SSD. This system is used for real- 3.2 Object Detection
time object detection. The webcam feed is used to detect the
object in the video stream. This paper also explains the 3.2.1 Feature extraction using CNN layer
accuracy of object recognition with SSD model and the For image recognition applications, convolutional neural
significance of MobieNet. networks (CNNs) are extensively applied [13]. Convolutional
The authors in [10] have proposed a model using image layers are used by CNNs to extract features from the input
segmentation and deep neural network that performs real time images. Each layer is made up of learnable filters that
identification of object. To build a compact, portable and construct feature maps highlighting structures and patterns in
marginal response time device, the authors used a the input image. As the image moves through the layers, the
combination mobileNet model with single-shot multibox features retrieved grow more sophisticated and abstract,
detection system. enabling precise item recognition and classification. These
learnt features can be applied to object detection.
3. PROPOSED SYSTEM
3.2.2 Object probability prediction using intersection over
What accounts for the popularity of deep learning? Even union
when taught with enormous amounts of data, it dominates In the object probability prediction method, Intersection
current systems in terms of "exactness" and in terms of over Union (IoU) score is used to gauge the likelihood that a
optimization. Voice Assisted Object Detection (VAOD) is predicted bounding box would correctly locate an item. The
built around figure 1. IoU metric, which considers both the overlap and the form of
the item, is used by several object detecting algorithms. By
calculating the overlap area between the predicted and ground
truth bounding boxes and dividing it by the union area of the
two boxes, the IoU-based probability prediction is calculated.
On the basis of the IoU score and a predetermined threshold
value, the likelihood of a successful prediction is then
determined.

3.2.3 Removal of redundant values using Non-Maxima


separation (NMS) algorithm
The object recognition post-processing method known as
Non-maximum suppression (NMS) removes redundant
detections and chooses the most precise bounding box that
covers the ground truth box. The object recognition system's
accuracy is increased, and false hits are decreased as a result
of this procedure. In order to limit the bounding boxes with
lower scores, NMS compares the overlap and confidence
scores of neighboring ones.
Fig 1. Proposed System model 3.3 Object distance prediction
The formula for calculating distance using the object size,
The camera takes video input which is further converted to focal length of the camera, and the bounding box area is given
multiple image frames. These individual frames are pre- in equation (1).
processed for better results. An image processing algorithm
for object detection is applied on every individual frame and Distance = (object_size * focal_length) /
the object is detected [11]. sqrt(bounding_box_area) ----------(1)
Once the object is detected and the boundaries for the
individual objects are made, the distance between camera and In equation (1), the object_size is the actual size of the object
object is given by the distance formula [12]. Further on the (in meters). To obtain the actual size of the objects, VAOD
object’s direction with respect to the camera/user is considers the average dimensional area of the objects in
determined using a classification algorithm which is meters. The focal_length in equation (1) is the focal length of
explained in further sections. Finally, the information about

Authorized licensed use limited to: The Technology Library. Downloaded on May 30,2024 at 14:39:19 UTC from IEEE Xplore. Restrictions apply.
the camera which is a physical parameter of the camera that convolutional, pooling and nonlinear transformational
can be obtained from the manufacturer's specifications of the operations are used to turn a picture into a set of features.
camera used. The bounding_box area in equation (1) is the Running an input image through a MobileNet base network
area of the object in the image that is detected by the object uses a number of additional layers that conduct object
detection algorithm. The obtained bounding box area is detection model to perform feature extraction using CNN.
converted from pixel square to meter square. Object identification datasets namely MS COCO is used to
train MobileNet-SSD caffe model and the model is further
3.4 Algorithm to find the position of the object fine-tuned on VOC0712. These datasets, which have many
Each frame is run through the location detection annotated photos and different object classifications, are
technique once the objects are detected and the distance has frequently used to test and improve object detection models
been determined. The identified items in the frame are [9]. Some of the objects detected by VAOD are bicycle, bird,
categorized into top left, top right, bottom left, and bottom boat, bottle, bus, car, cat, chair, cow, dining table, dog, horse,
right orientations. Section 4 provides more information on the motorbike, person, potted plant, sheep, sofa, train and tv
method. monitor.
A collection of feature maps representing the identified
3.5 Text to speech features of the input image make up the MobileNet base
The pyttsx3 tools are the primary packages utilized in this network's output.
conversion. Python's Pyttsx3 module is used to convert text The location and class of items in the input image are
to voice. This is the base for the engine's ability to provide an predicted by a collection of extra layers, which are applied to
audio output to the user. the feature maps produced by the MobileNet base network.
Convolutional layers that pull features from the input feature
4. IMPLEMENTATION maps are followed by object detection layers. The previous
box layer and the detection output layer are the two different
Figure 2. shows the implementation of our model. Once the sorts of layers that make up the layers MobileNet- SSD. The
live video gets converted to frames, we load the mobile-net 8732 anchor boxes (per class) generated by the prior box
SSD model. The mobile-net SSD model customs a feed- layer will be utilized to forecast the bounding boxes for the
forward convolutional system to create a fixed-size group of objects in the input image. Based on the information retrieved
bounding boxes and marks for the presence of object class by the convolutional layers and the anchor boxes, the
occurrences in those boxes. A non-maximum conquest step detection output layer predicts the location and kind of
is later utilised to provide the last detection [11]. objects in the input image [13].

Anchor box formula is used to determine the expected


bounding box coordinates for each anchor box:

a_k = [s_k * sqrt(ar_k), s_k / sqrt(ar_k)], for k in [0, ..., n]


----------(2)

In equation (2) a_k is an anchor box with aspect ratio ar_k


and scale s_k, n is the number of aspect ratios, and k
represents a counter that takes on integer values from 0 to n.

4.2 Object is detected


MobileNet-SSD and other object detection algorithms use
IoU as a crucial parameter to assess the precision of the
projected bounding boxes. IoU is described as the ratio of the
expected and actual bounding boxes in intersection and union
areas respectively. It is often used as a criterion for matching
predicted boxes to ground-truth items during training and
evaluation. It is used to measure the amount of overlap
between the two boxes.

intersection over union=(area of intersection)/(area of


union). ----------(3)

Equation (3) gives the formula of intersection over union.


Fig 2. Flowchart for object detection A predicted box is regarded as a valid detection if its IoU with
a ground-truth item above a predetermined threshold (usually
4.1 Image pre-processing 0.5), and it is regarded as a false detection if its IoU is below
First video is converted to frames (using python’s imutils the threshold [3].
library) and loaded to the mobile-net ssd model in 300*300 The predicted bounding boxes are subjected to a post-
pixels. Further, convolution neural networks are used to pre- processing step using the NMS algorithm. It functions by
process the image to extract features. In our study, we comparing the overlap (intersection over union) between
compute these frames at a speed of 49fps. A series of pairs of bounding boxes and suppressing the ones with lower

Authorized licensed use limited to: The Technology Library. Downloaded on May 30,2024 at 14:39:19 UTC from IEEE Xplore. Restrictions apply.
confidence ratings or it selects the boxes that have large the need for an internet connection. In situations when
amount of overlap with the grounding box based on its internet connectivity is erratic or non-existent, this can be
confidence level. Further, it sorts the projected bounding helpful. It enables more flexibility in the text-to-speech
boxes in descending order based on their confidence scores. output. The engine's speech rate, volume, and voice type may
It then selects and includes the box with the highest all be altered according to the user.
confidence score in the output and determines the IoU of this
box with other boxes [14]. This confidence score or the
accuracy of the object detected is presented as a percentage
value in the output.

4.3 Finding distance from camera to object


Equation (1) is derived based on the principle of
triangulation. Triangulation is a method of determining
distance to an object by measuring the angles between a
baseline and the object from two different positions. In
VAOD a formula for measuring distance using the size of the
object, the camera's focal length, and the area of the bounding
box is developed. The object is the apex of the triangle, and
the camera is its base. The actual size of the object serves as Fig 3. Detection of bus with distance and direction.
one side and the distance serves as the other side of the first
triangle. The size of the object in the image serves as one of Figure 3 illustrates detection of buses with 65.01% accuracy
the sides and the camera’s focal length serves as the other side and 89.69% accuracy respectively.
of the second triangle. The size of the object in the image is
equal to the square root of the bounding box area. Thus, by 5. RESULTS
the principle of similar triangles, equation (1) is derived.
Hence VAOD enables us to find the distance between the VAOD is useful for visually disabled people. It helps the
camera and the object in meters. visually impaired in object recognition and acts as a guidance
Therefore, this technique offers a precise distance system.
estimation and can be applied in various applications, VAOD includes mobile-net SSD which is faster and more
including robotics, surveillance, and self-driving cars. accurate than other object detection models.

4.4 Algorithm to find the position of the object ϳϲ͘ϬϬй


The top-left and bottom-right corners of the predicted ϳϰ͘ϬϬй
ϳϮ͘ϬϬй
object's bounding box are obtained once the object has been ϳϬ͘ϬϬй
spotted. The centre of the item is determined using those ϲϴ͘ϬϬй
coordinates by averaging the x- and y-coordinates of the top-
ĐĐƵƌĂĐLJ

ϲϲ͘ϬϬй
left and bottom-right corners of the bounding box. ϲϰ͘ϬϬй
ϲϮ͘ϬϬй
x_avg = (start_X + end_X) / 2 ----------(4) ϲϬ͘ϬϬй
ϱϴ͘ϬϬй
y_avg = (start_Y + end_Y) / 2 ----------(5) ϱϲ͘ϬϬй
zK>Kϰϱ ZͲEE ^^ ĨĂƐƚĞƌZͲ sK
In equation (4), x_avg is the centroid of the object with &W^ EE
respect to x axis, start_X and start_Y represents the x- sĂƌŝŽƵƐĚĞĞƉůĞĂƌŶŝŶŐŵŽĚĞůƐ
coordinate and y-coordinate of the top-left corner of the
bounding box respectively. In equation (5) y_avg is the Fig 4. Accuracy of detected object vs Deep learning models.
centroid of the object with respect to y axis, end_X and
end_Y represents the x-coordinate and y-coordinate of the Figure 4. illustrates accuracy of different deep learning
bottom-right corner of the bounding box respectively. models
The frame's centre is therefore believed to be at 200x112.5
and the imutils.resize function scales the frame down to ϲϬ
400x225. Depending on how faraway from the frame's centre ϱϬ
the object's centre is, the direction is determined. For
ϰϬ
^ƉĞĞĚ;ĨƉƐͿ

instance, the orientation is set to "top-right" if the object


centre is above the centre of the frame. The direction is set to ϯϬ
"bottom-left" if the object centre is below the frame centre. ϮϬ
ϭϬ
4.6 Text to speech conversion
Ϭ
The pyttsx3 package is used to transform the text output to
zK>K ƐƐĚ;ůŽǁͿ ZͲEE ZͲ&E sK
speech. The pyttsx3 engine must first be initialised, after
which the speaking rate must be configured appropriately, ǀĂƌŝŽƵƐ ĚĞĞƉůĞĂƌŶŝŶŐŵŽĚĞůƐ
and finally the text must be provided to the engine.
This library was chosen mostly because it is an offline Fig 5. Speed vs Deep learning models.
text-to-speech library that can convert text to speech without

Authorized licensed use limited to: The Technology Library. Downloaded on May 30,2024 at 14:39:19 UTC from IEEE Xplore. Restrictions apply.
Figure 5. illustrates the processing speeds(fps) of different
deep learning architectures. 6. CONCLUSION AND FUTURE WORKS
VAOD is light weight and hence it can be used in IOT or
any mobile application which has low processing power. A novel framework employing object detection,
VAOD not only identifies the objects but also guides the classification of objects, directions and distance prediction
visually impaired by providing directions and distance. has been presented to assist visually impaired people. Future
The following images show some of the results. works can be done to increase the number of objects detected
and this model can be further employed in different fields like
robotics, medical automation and automobile industry.

7. REFERENCES
[1] Cardillo, E. and Caddemi, A., 2019. Insight on electronic travel aids
for visually impaired people: A review on the electromagnetic
technology. Electronics, 8(11), p.1281.
[2] Gbenga, D.E., Shani, A.I. and Adekunle, A.L., 2017. Smart walking
stick for visually impaired people using ultrasonic sensors and
Arduino. International journal of engineering and technology, 9(5),
pp.3435-3447.
[3] Real, S. and Araujo, A., 2019. Navigation systems for the blind and
visually impaired: Past work, challenges, and open
problems. Sensors, 19(15), p.3404.
Fig 6. Detection of object.
[4] S. Shah, J. Bandariya, G. Jain, M. Ghevariya and S. Dastoor, "CNN
based Auto-Assistance System as a Boon for Directing Visually
Figure 6 illustrates the detection of object (sofa) with 52.86% Impaired Person," 2019 3rd International Conference on Trends in
accuracy towards downright and with a distance of 0.41m. Electronics and Informatics (ICOEI), Tirunelveli, India, 2019, pp. 235-
240, doi: 10.1109/ICOEI.2019.8862699.
[5] Wong, Y.C., Lai, J.A., Ranjit, S.S.S., Syafeeza, A.R. and Hamid, N.A.,
2019. Convolutional neural network for object detection system for
blind people. Journal of Telecommunication, Electronic and Computer
Engineering (JTEC), 11(2), pp.1-6.
[6] Afif, M., Ayachi, R., Said, Y., Pissaloux, E. and Atri, M., 2020. An
evaluation of retinanet on indoor object detection for blind and visually
impaired persons assistance navigation. Neural Processing Letters, 51,
pp.2265-2279.
[7] Yee, L.R., Kamaludin, H., Safar, N.Z.M., Wahid, N., Abdullah, N. and
Meidelfi, D., 2021. Intelligence Eye for Blinds and Visually Impaired
by Using Region-Based Convolutional Neural Network (R-
CNN). JOIV: International Journal on Informatics Visualization, 5(4),
pp.409-414.
[8] S. Bhole and A. Dhok, "Deep Learning based Object Detection and
Recognition Framework for the Visually-Impaired," 2020 Fourth
Fig 7. Detection of person. International Conference on Computing Methodologies and
Communication (ICCMC), Erode, India, 2020, pp. 725-728, doi:
Figure 7 illustrates the detection of people as person with 10.1109/ICCMC48092.2020.ICCMC-000135.
75.30% accuracy with distance of 0.11m and 99.27% [9] Younis, A., Shixin, L., Jn, S. and Hai, Z., 2020, January. Real-time
accuracy with distance of 0.12m respectively. object detection using pre-trained deep learning models MobileNet-
SSD. In Proceedings of 2020 the 6th international conference on
computing and data engineering (pp. 44-48).
[10] Arora, A., Grover, A., Chugh, R. et al. Real Time Multi Object
Detection for Blind Using Single Shot Multibox Detector. Wireless
Pers Commun 107, 651–661 (2019). https://doi.org/10.1007/s11277-
019-06294-1.
[11] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y. and
Berg, A.C., 2016. Ssd: Single shot multibox detector. In Computer
Vision–ECCV 2016: 14th European Conference, Amsterdam, The
Netherlands, October 11–14, 2016, Proceedings, Part I 14 (pp. 21-37).
Springer International Publishing.
[12] Rosebrock, A. (2015, January 19). Find Distance from Camera to
Object/Marker Using Python and OpenCV. Retrieved May 10, 2023,
from https://pyimagesearch.com/2015/01/19/find-distance-camera-
objectmarker-using-python-opencv/. DOI:
10.1109/ACCESS.2020.3009703. Type: Article.
Fig 8. Detection of motorbike and person. [13] Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W.,
Weyand, T., Andreetto, M. and Adam, H., 2017. Mobilenets: Efficient
convolutional neural networks for mobile vision applications. arXiv
Figure 8 illustrates detection of motorbike with 71.67% preprint arXiv:1704.04861
accuracy towards down-left direction with distance of 0.51m [14] Kim, K. and Lee, H.S., 2020. Probabilistic anchor assignment with iou
and a person (left) with 59.42% accuracy and person (right) prediction for object detection. In Computer Vision–ECCV 2020: 16th
with 83.50% accuracy respectively. European Conference, Glasgow, UK, August 23–28, 2020,
Proceedings, Part XXV 16 (pp. 355-371). Springer International
Publishing.

Authorized licensed use limited to: The Technology Library. Downloaded on May 30,2024 at 14:39:19 UTC from IEEE Xplore. Restrictions apply.

You might also like