Object Detection and Identification
Object Detection and Identification
1612
Prinsi Patel et al., International Journal of Advanced Trends in Computer Science and Engineering, 10(3), May - June 2021, 1611 – 1618
Model: The model here is the You Only Look Once (YOLO)
algorithm that runs through a variation of an extremely
complex Convolutional Neural Network architecture called
the Darknet. Even though we are using a more enhanced and
complex YOLO v3 model, I will explain the original YOLO
algorithm. Also, the python cv2 package has a method to
setup Darknet from our configurations in the yolov3.cfg file.
API: The class prediction of the objects detected in every In Figure 2 shows the Proposed system steps followed:
frame will be a string e.g. “cat”. We will also obtain the 1. First of all we will be using our webcam/image to capture
coordinates of the objects in the image and append the the image as a input data.
position “top”/“mid”/“bottom” & “left”/“center”/“right” to 2. Input image are resize the according to the network
the class prediction “cat”. We can then send the text architecture.
description to the Google Text-to-Speech API using In Figure 3 shows the Convolutional neural network to scale
the gTTS package. back the spatial dimension to 7x7 with 1024 output channels
at every location. Convolution neural network has 24
Output: We will also obtain the coordinates of the bounding convolutional layers followed by 2 fully-connected layers[2].
box of every object detected in our frames, overlay the boxes Reduction layers with 1x1 filters followed by 3x3
on the objects detected with label and voice. convolutional layers replace the initial inception modules.
Most of convolution layer are pretrained using Imagenet
Datasets. By using two fully connected layers it performs a
linear regression to create a (7, 7, 2) bounding box prediction.
Finally, a prediction is made by considering the high
confidence score of a box. Convolution network check the
every grid separately and marks the label which has an object
in it and also mark its boundary boxes.
3. Apply the YOLO algorithm for detect the multiple object In Figure 4 shows the TTS module has contain 2 parts, first is
using the trained datasets COCO. image processing module(OCR) and second is voice
4. The output become a object are detected with label and processing module(TTS). In first is image processing module,
confidence score. where OCR converts .jpg to .txt form. Second is voice
5. Detected label consider as a image for text-to-speech processing module which converts .txt to speech.
device. label are pass through the text-to-speech device.
1613
Prinsi Patel et al., International Journal of Advanced Trends in Computer Science and Engineering, 10(3), May - June 2021, 1611 – 1618
Figure 4: Block diagram of TTS[5] Ground truth bounding box ( t): represents the desired
output of an algorithm on an input, for example, the hand
6. The final output shows the objects are detected with their labeled bounding box from the testing set that specify where
label and confidence score with voice. the objects are in the image.
The figure 6 is an example of detection a person in an image. False positive (FP): A false positive test result is one that
Based on this image we can have a basic understanding of detects the condition when the condition is absent.
IoU. False Negative (FN): A false negative test result is one that
does not detect the condition when the condition is present.
Threshold: we predefine a threshold of IoU (for instance,
0.5) in classifying whether the prediction is a true positive or 4. IMPLEMENTATION
a false positive.
True Positive: A true positive test result is one that detects The main aim of the proposed system is smallest object are
the condition when the condition is present. detected and the detected object is convert to text to speech.
True Negative: A true negative test result is one that does not the whole implementation is done in python programming
detect the condition when the condition is absent. language.
1614
Prinsi Patel et al., International Journal of Advanced Trends in Computer Science and Engineering, 10(3), May - June 2021, 1611 – 1618
(b)
Figure 8: (a) object are detected (b)console result of detected
object
(a) 3. Object is detected of an image the apply for text-to speech
conversion
If on object is detect then directly label are convert
text –to-speech.
(b)
Figure 7: Input data (a)single object (b) multiple object
2. Resize the image according to network architecture and Figure 9: Detected object is convert text-to-speech(TTS)
YOLO Algorithm for object detection and used OpenCV
python library. If multiple on an image:
- Select a particular object
(a) Figure 10: select particular object which has convert TTS
1615
Prinsi Patel et al., International Journal of Advanced Trends in Computer Science and Engineering, 10(3), May - June 2021, 1611 – 1618
Figure 13: real time capture the object cell phone is detected with
the label and accuracy along with voice
(a) (b)
Figure 14: (a) proposed system output using OpenCV-YOLO (b) proposed system graph of detected object
1616
Prinsi Patel et al., International Journal of Advanced Trends in Computer Science and Engineering, 10(3), May - June 2021, 1611 – 1618
(a) (b)
Figure 15: (a) proposed system output using Tensorflow-YOLO (b) proposed system graph of detected object
In figure 14 has shows the 13 object is detected with label and Where
accuracy in the image and figure 15 shows the only 9 object is
detected in the image. = 1 if the j th in the boundary box in cell i is a responsible
for detecting the object, otherwise 0.
5.2 Loss function in YOLO increase the weight for the loss in the boundary box
cooridinates.
It is used to correctness of the center and boundary box of each
prediction. In YOLO predicts multiple bounding boxes per
grid cell. To compute the loss for the true positive, we only Confidence Loss: If an object is detected in the box, the
want one of them to be responsible for the object. For this confidence loss (measuring the objectness of the box) is[14]:
purpose, we select the one with the highest IoU with the
ground truth. The loss function defined as follow:
Where
Classification Loss: If an object is detected in image, the
is the box confidence score of the box j in cell i.
classification loss at each cell is the squared error of the class
conditional probabilities for each class[14]: = 1 if the j th in the boundary box in cell i is a responsible
for detecting the object, otherwise 0.
1617
Prinsi Patel et al., International Journal of Advanced Trends in Computer Science and Engineering, 10(3), May - June 2021, 1611 – 1618
REFERENCES
1. https://www.researchgate.net/publication/337464355_O
BJECT_DETECTION_AND_IDENTIFICATION_A_P
roject_Report
2. Geethapriya. S, N. Duraimurugan, S.P. Chokkalingam
Real-Time Object Detection with Yolo, International
Journal of Engineering and Advanced Technology
(IJEAT), Volume-8, Issue-3S, February 2019
3. Joseph Redmon, Santosh Divvala, Ross Girshick, You
Only Look Once: Unified, Real-Time Object
Detection, The IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 2016, pp. 779-788.
4. R. Sujeetha, Vaibhav Mishra Object Detection and
Tracking using Tensor Flow, International Journal of
Recent Technology and Engineering (IJRTE) ISSN:
2277-3878, Volume-8, Issue-1, May 2019
5. Chaw Su Thu Thu, Theingi Zin Implementation of
Text to Speech Conversion, International Journal of
Engineering Research & Technology (IJERT) ISSN:
2278-0181 Vol. 3 Issue 3, March – 2014
6. S. Venkateswarlu , D. B. K. Kamesh , J. K. R. Sastry and
Radhika Rani Text to Speech Conversion, Indian
Journal of Science and Technology, Vol 9(38), DOI:
10.17485/ijst/2016/v9i38/102967, October 2016
7. Moonsik Kang Object Detection System for the Blind
with Voice Command and Guidance, IEIE
Transactions on Smart Processing and Computing, vol.
8, no. 5, October 2019
8. YOLO Juan Du1,Understanding of Object Detection
Based on CNN Family, New Research, and
1618