0% found this document useful (0 votes)
161 views

Object Detection and Identification

Object Detection systems have been growing in the last few years for various applications. Since the hardware can not detect the smallest objects. Many algorithms are used for object detection like Yolo, R-CNN, Fast R-CNN, Faster R-CNN, etc. object detection using YOLO is faster than other algorithms and the YOLO scans the whole image completely at one time. Object detection, which is based on Convolutional Neural Networks (CNNs) and it's

Uploaded by

Velumani s
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
161 views

Object Detection and Identification

Object Detection systems have been growing in the last few years for various applications. Since the hardware can not detect the smallest objects. Many algorithms are used for object detection like Yolo, R-CNN, Fast R-CNN, Faster R-CNN, etc. object detection using YOLO is faster than other algorithms and the YOLO scans the whole image completely at one time. Object detection, which is based on Convolutional Neural Networks (CNNs) and it's

Uploaded by

Velumani s
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

ISSN 2278-3091

Volume 10, No.3, May - June 2021


Prinsi Patel et al., International Journal of Advanced Trends in Computer Science and Engineering, 10(3), May - June 2021, 1611 – 1618
International Journal of Advanced Trends in Computer Science and Engineering
Available Online at http://www.warse.org/IJATCSE/static/pdf/file/ijatcse181032021.pdf
https://doi.org/10.30534/ijatcse/2021/181032021

Object Detection and Identification


Prinsi Patel1, Barkha Bhavsar2
1
Researcher, LDRP Institute of Technology & Research,gandhinagar-382015, Gujarat,India
2
Assistant Professor, LDRP Institute of Technology & Research, gandhinagar-382015,Gujarat, India

this kind of image identification and object localization, object


ABSTRACT detection can be used to count objects in a scene and determine
and track their precise locations. We can then convert the
Object Detection systems have been growing in the last few annotated text into voice responses and give the basic
years for various applications. Since the hardware can not positions of the objects.
detect the smallest objects. Many algorithms are used for
object detection like Yolo, R-CNN, Fast R-CNN, Faster Object detection can be broken down into machine
R-CNN, etc. object detection using YOLO is faster than other learning-based approaches and deep learning-based
algorithms and the YOLO scans the whole image completely approaches for object detection and recognition,[1] such as
at one time. Object detection, which is based on Support vector machine (SVM), Convolutional Neural
Convolutional Neural Networks (CNNs) and it's based on Networks (CNNs), Regional Convolutional Neural Networks
classification and localization. An object is detected by (R-CNNs), You Only Look Once (YOLO) model etc., Since
extracting the features of an object like the color of the object, machines cannot detect the objects in an image instantly like
the texture of the object or shape, or some other features. Then humans, it is really necessary for the algorithms to be fast and
based on these features, objects are classified into many accurate and to detect the objects in real-time object detection
classes and each class is assigned a label. When we and recognition.
subsequently provide an image to the model, it will output
many objects it detects, the location of a bounding box that In real world there are many object detection systems available
contains every object with their label and score indicates the and they are providing such an accurate result too. By this
confidence. Text-To-Speech (TTS) conversion is a survey, we are trying to detect the smallest amount of object of
computer-based system that requires for the label are accuracy and label with voice. Text-To-Speech (TTS)
converted text-to-speech. The main motive is that the smallest conversion is a computer- based system that divide the two
amount of objects can be detected object and labeling the module image processing and voice processing module. In
object with voice for real-time object detection. The final image processing module, optimal character recognition
model architecture proposed is more accurate and provides (OCR) has convert the .jpg to .txt format. OCR has recognize
the fast result of object detection with voice as compared to the character automatically. In voice processing module has
previous researches. convert .txt to speech.

Key words: Object Detection, Object Recognition, 2. RELATED WORK


Text-to-Speech Convert, You Only Look Once(YOLO),CNN,
R-CNN. In table1 shows the summary of related work.

1. INTRODUCTION Table 1: Related work


Paper Method Limitation Output
-Architect
In previous research, there are various algorithm to the ure
detected object with their label. Object detection is the –Datasets
combination of image classification and object localization. In
image classification is used for classify or predict the class of Real-Time You only The algorithm YOLO
specifying the object in an image. In image classification main Object look once is simple to accesses to the
goal is accurately to identify the feature of an image. In object Detection (YOLO) build and can be entire image in
localization is locate the object on an image with the boundary with trained directly predicting
box. Object detection is highly capable to deal with multi-class Yolo[2] on a complete boundaries.
classification and localization. image. Region And also it
proposal predicts fewer
Object detection is a technology that detects the semantic strategies limit false positives
objects of an digital images. object detection is a computer the classifier to in background
vision technique that allows us to identify and locate objects in a particular areas.
an image and accurately labeling with voice and text[1]. With region. Comparing to
other classifier
1611
Prinsi Patel et al., International Journal of Advanced Trends in Computer Science and Engineering, 10(3), May - June 2021, 1611 – 1618

algorithms this computer then


algorithm is speech through
much more that text using
efficient and MATLAB. It is
fastest cost effective
algorithm to user friendly
use in real image to
time. speech
You Only You only YOLO imposes single neural conversion
Look look strong spatial network system
Once: once(YOL constraints on predicts
Unified, O), bounding box bounding 3. PROPOSED METHOD
Real-Time Convolutio predictions boxes and class
Object n neural since each grid probabilities First, consider the overview od datasets used and workflow of
Detection[ network(C cell only directly from proposed system will be introduced and then overview of
3] NN), predicts two full images in datasets used.
PASCAL boxes and can one evaluation.
VOC only have one It can be 3.1 Datasets
class. This optimized
Common Object in Context(COCO) is a large-scale object
spatial end-to-end
detection, segmentation, and captioning dataset. COCO has
constraint directly on
several features Object segmentation , Recognition in
limits the detection
context330K images (>200K labeled),1.5 million object
number of performance.
instances,80 object categories.
nearby objects
that our model This dataset is used for multiple challenges: caption
can predict. generation, object detection, key point detection and object
Object OpenCV, This method we It utilize segmentation. We focus on the COCO object detection
Detection TensorFL will be using tensorflow to challenge consisting in localizing the objects in an image with
and OW, Tensor Flow join bounding boxes and categorizing each one of them between 80
Tracking CNN, and OpenCV information categories. The dataset changes each year but usually is
using Common library and from different composed of more than 1,20,000 images for training and
Tensor object in CNN algorithm sources and our validation, and more than 40,000 images for testing.
Flow[4] context(C will be used and joint
OCO) we will be improvement
labelling the strategy to 3.2 Proposed system
detected layers prepare all the
with accuracy while on
being checked COCO. The
at the same dataset
time. measure hole
between
recognition
also
characterizatio
n.
Implement Text-To-S OCR system is The recognized
ation of peech implemented character is
Text to (TTS), for the saved as text in
Speech Optical recognition of notepad file. In
Conversio Character capital English this work a
n[5] Recognitio character A to Z text-to-speech
n and number 0 to conversion
(OCR) 9. Each system that can Figure 1: Workflow of proposed system
character is get the text
recognized at through image In figure 1 shows the proposed system workflow.
one time. and directly
input in the

1612
Prinsi Patel et al., International Journal of Advanced Trends in Computer Science and Engineering, 10(3), May - June 2021, 1611 – 1618

Input data: We will be using our webcam to feed images at 30


frames-per-second to this trained model and we can set it to
only process every other frame to speed things up.

Model: The model here is the You Only Look Once (YOLO)
algorithm that runs through a variation of an extremely
complex Convolutional Neural Network architecture called
the Darknet. Even though we are using a more enhanced and
complex YOLO v3 model, I will explain the original YOLO
algorithm. Also, the python cv2 package has a method to
setup Darknet from our configurations in the yolov3.cfg file.

Training data: The model is trained with the Common


Object in context (COCO) dataset. You can explore the
images that they labeled in the link, it’s pretty cool. Figure 2: Proposed system

API: The class prediction of the objects detected in every In Figure 2 shows the Proposed system steps followed:
frame will be a string e.g. “cat”. We will also obtain the 1. First of all we will be using our webcam/image to capture
coordinates of the objects in the image and append the the image as a input data.
position “top”/“mid”/“bottom” & “left”/“center”/“right” to 2. Input image are resize the according to the network
the class prediction “cat”. We can then send the text architecture.
description to the Google Text-to-Speech API using In Figure 3 shows the Convolutional neural network to scale
the gTTS package. back the spatial dimension to 7x7 with 1024 output channels
at every location. Convolution neural network has 24
Output: We will also obtain the coordinates of the bounding convolutional layers followed by 2 fully-connected layers[2].
box of every object detected in our frames, overlay the boxes Reduction layers with 1x1 filters followed by 3x3
on the objects detected with label and voice. convolutional layers replace the initial inception modules.
Most of convolution layer are pretrained using Imagenet
Datasets. By using two fully connected layers it performs a
linear regression to create a (7, 7, 2) bounding box prediction.
Finally, a prediction is made by considering the high
confidence score of a box. Convolution network check the
every grid separately and marks the label which has an object
in it and also mark its boundary boxes.

Figure 3: Convolution neural network[2]

3. Apply the YOLO algorithm for detect the multiple object In Figure 4 shows the TTS module has contain 2 parts, first is
using the trained datasets COCO. image processing module(OCR) and second is voice
4. The output become a object are detected with label and processing module(TTS). In first is image processing module,
confidence score. where OCR converts .jpg to .txt form. Second is voice
5. Detected label consider as a image for text-to-speech processing module which converts .txt to speech.
device. label are pass through the text-to-speech device.

1613
Prinsi Patel et al., International Journal of Advanced Trends in Computer Science and Engineering, 10(3), May - June 2021, 1611 – 1618

Each grid predicts the boundary boxes with the confidence


score. It shows the how accurate the predicted object. We
define our confidence score:
C = P r (Object) ∗
There is no object in grid the confidence score should be 0 to
1. If there in object in grid the confidence score should be
equal to IOU between ground truth bounding box and
predicted bounding box.

Figure 4: Block diagram of TTS[5] Ground truth bounding box ( t): represents the desired
output of an algorithm on an input, for example, the hand
6. The final output shows the objects are detected with their labeled bounding box from the testing set that specify where
label and confidence score with voice. the objects are in the image.

3.3 Confidence Score Predicted bounding box ( ): represents a rectangle region


generated from model detector that indicates the
To evaluate the performance of an object detection. Normally, location of the object predicted.
we focus on the accuracy (mean average precision, mAP) and
consecutive images’ average process rate of objects detection Intersection over union (IoU): an evaluation metric used to
per second (Frames per second, FPS). In terms of accuracy, measure the area encompassed by both the ground-truth
there are many different approaches used to evaluate the bounding box ( t) the predicted bounding box ( ). In
accuracy of a model or an algorithm for object detection, but figure 5 shows the equation of IOU.
mAP is the primary one. In terms of speed, the FPS is the
standard one. Before analyzing it, we, at first, have to
understand some basic concepts such as confidence score,
IoU, precision, recall and so on for accuracy, and the FPS for
the speed of performance.
A basic concept of mAP for an object detector’s accuracy
evaluation
Figure 5: understand the IOU equation[13]
Confidence score: The reflects the probability that an anchor
box contains an object. It is usually predicted by a classifier.

Figure 6: Example of ground truth box and predict box[13]

The figure 6 is an example of detection a person in an image. False positive (FP): A false positive test result is one that
Based on this image we can have a basic understanding of detects the condition when the condition is absent.
IoU. False Negative (FN): A false negative test result is one that
does not detect the condition when the condition is present.
Threshold: we predefine a threshold of IoU (for instance,
0.5) in classifying whether the prediction is a true positive or 4. IMPLEMENTATION
a false positive.
True Positive: A true positive test result is one that detects The main aim of the proposed system is smallest object are
the condition when the condition is present. detected and the detected object is convert to text to speech.
True Negative: A true negative test result is one that does not the whole implementation is done in python programming
detect the condition when the condition is absent. language.

1614
Prinsi Patel et al., International Journal of Advanced Trends in Computer Science and Engineering, 10(3), May - June 2021, 1611 – 1618

1. Input data image or webcam. camera starts capturing


frames with the rate of 45 frames per second to the
algorithm.

(b)
Figure 8: (a) object are detected (b)console result of detected
object
(a) 3. Object is detected of an image the apply for text-to speech
conversion
 If on object is detect then directly label are convert
text –to-speech.

(b)
Figure 7: Input data (a)single object (b) multiple object

2. Resize the image according to network architecture and Figure 9: Detected object is convert text-to-speech(TTS)
YOLO Algorithm for object detection and used OpenCV
python library.  If multiple on an image:
- Select a particular object

(a) Figure 10: select particular object which has convert TTS

- Object is cropped and assign the label

1615
Prinsi Patel et al., International Journal of Advanced Trends in Computer Science and Engineering, 10(3), May - June 2021, 1611 – 1618

In Real time object detection:

Figure 11: Particulate object are detected with the label

- Label is converted text-to-speech

Figure 13: real time capture the object cell phone is detected with
the label and accuracy along with voice

Figure 12: Particular object are convert text-to-speech(TTS)

5. COMPARISION AND DISCUSSION

5.1 Comparison of openCV-YOLO and tensorflow-YOLO

(a) (b)
Figure 14: (a) proposed system output using OpenCV-YOLO (b) proposed system graph of detected object

1616
Prinsi Patel et al., International Journal of Advanced Trends in Computer Science and Engineering, 10(3), May - June 2021, 1611 – 1618

(a) (b)
Figure 15: (a) proposed system output using Tensorflow-YOLO (b) proposed system graph of detected object

In figure 14 has shows the 13 object is detected with label and Where
accuracy in the image and figure 15 shows the only 9 object is
detected in the image. = 1 if the j th in the boundary box in cell i is a responsible
for detecting the object, otherwise 0.
5.2 Loss function in YOLO increase the weight for the loss in the boundary box
cooridinates.
It is used to correctness of the center and boundary box of each
prediction. In YOLO predicts multiple bounding boxes per
grid cell. To compute the loss for the true positive, we only Confidence Loss: If an object is detected in the box, the
want one of them to be responsible for the object. For this confidence loss (measuring the objectness of the box) is[14]:
purpose, we select the one with the highest IoU with the
ground truth. The loss function defined as follow:

Classification Loss + Localization Loss + Confidence Loss

Where
Classification Loss: If an object is detected in image, the
is the box confidence score of the box j in cell i.
classification loss at each cell is the squared error of the class
conditional probabilities for each class[14]: = 1 if the j th in the boundary box in cell i is a responsible
for detecting the object, otherwise 0.

If an object is not detected in the box, the confidence loss


where is:[14]

= 1 if object appears in cell I, otherwise 0.


(c) denotes the conditional class probability for class c in
cell i.
Where
Localization Loss: It measures the errors in the predicted is a complement of
boundary box with locations and sizes. We only count the box is the box confidence score of the box j in cell i.
responsible for detecting the object[14]:
Weight down the loss when detecting background.

1617
Prinsi Patel et al., International Journal of Advanced Trends in Computer Science and Engineering, 10(3), May - June 2021, 1611 – 1618

6. CONCLUSION Development Center of Hisense, Qingdao 266071,


China.
In this paper, we proposed about YOLO algorithm for the 9. Fushikida, Katsunobu; Mitome, Yukio; Inoue, Yuji, A
detecting objects using a convolution network layer. It is Text to Speech Synthesizer for the Personal
conclude the accurate results of object detection using YOLO Computer, IEEE Transactions on vol.CE-28, no.3,
are high compare to others. Object detection with YOLO pp.250-256, Aug. 1982 ICCV 2009
library which take less time for object detection and highly 10. https://towardsdatascience.com/getting-started-with-coc
accurate and also the label are convert text-to-speech(TTS) o-dataset-82def99fa0b8
conversion is fast. In section 2 shows the YOLO algorithm is 11. https://medium.com/zylapp/review-of-deep-learning-alg
best for object detection and YOLO has entire image in a orithms-for-object-detection-c1f3d437b852
single instance and high predict the boundary box 12. https://www.ijeat.org/wp-content/uploads/papers/v8i3S/
co-ordinates and class probabilities of the boxes. The C11240283S19.pdf
comparison of OpenCV-YOLO and Tensorflow-YOLO 13. https://manalelaidouni.github.io/manalelaidouni.github.
shows the proposed system are better than existing system. io/Evaluating-Object-Detection-Models-Guide-to-Perfor
mance-Metrics.html
ACKNOWLEDGEMENT 14. https://jonathan-hui.medium.com/real-time-object-detec
tion-with-yolo-yolov2-28b1b93e2088
We would like to thank the anonymous reviewers for their
valuable and insightful comments. We believe their
comments significantly improved the quality of this
manuscript.

The research activities described in this paper were funded by


LDRP Institute of Technology and Research, Gandhinagar,
Carloman Systems, Ahmedabad, Gujarat, India.

REFERENCES
1. https://www.researchgate.net/publication/337464355_O
BJECT_DETECTION_AND_IDENTIFICATION_A_P
roject_Report
2. Geethapriya. S, N. Duraimurugan, S.P. Chokkalingam
Real-Time Object Detection with Yolo, International
Journal of Engineering and Advanced Technology
(IJEAT), Volume-8, Issue-3S, February 2019
3. Joseph Redmon, Santosh Divvala, Ross Girshick, You
Only Look Once: Unified, Real-Time Object
Detection, The IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 2016, pp. 779-788.
4. R. Sujeetha, Vaibhav Mishra Object Detection and
Tracking using Tensor Flow, International Journal of
Recent Technology and Engineering (IJRTE) ISSN:
2277-3878, Volume-8, Issue-1, May 2019
5. Chaw Su Thu Thu, Theingi Zin Implementation of
Text to Speech Conversion, International Journal of
Engineering Research & Technology (IJERT) ISSN:
2278-0181 Vol. 3 Issue 3, March – 2014
6. S. Venkateswarlu , D. B. K. Kamesh , J. K. R. Sastry and
Radhika Rani Text to Speech Conversion, Indian
Journal of Science and Technology, Vol 9(38), DOI:
10.17485/ijst/2016/v9i38/102967, October 2016
7. Moonsik Kang Object Detection System for the Blind
with Voice Command and Guidance, IEIE
Transactions on Smart Processing and Computing, vol.
8, no. 5, October 2019
8. YOLO Juan Du1,Understanding of Object Detection
Based on CNN Family, New Research, and

1618

You might also like