0% found this document useful (0 votes)
354 views36 pages

Unified Real-Time Object Detection

YOLO is a unified approach to object detection that frames it as a single regression problem to predict bounding boxes and class probabilities directly from full images in one pass of a neural network. This approach makes it very fast, running in real-time at 45 frames per second. While YOLO's performance is currently lower than state-of-the-art methods, it represents an end-to-end detection system that is simple to construct and train directly on full images. YOLO has applications in areas like event detection, industrial automation, medical imaging, and self-driving vehicles.

Uploaded by

vrashikesh patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
354 views36 pages

Unified Real-Time Object Detection

YOLO is a unified approach to object detection that frames it as a single regression problem to predict bounding boxes and class probabilities directly from full images in one pass of a neural network. This approach makes it very fast, running in real-time at 45 frames per second. While YOLO's performance is currently lower than state-of-the-art methods, it represents an end-to-end detection system that is simple to construct and train directly on full images. YOLO has applications in areas like event detection, industrial automation, medical imaging, and self-driving vehicles.

Uploaded by

vrashikesh patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 36

YOLO:

You Only Look Once


Unified Real-Time Object Detection
PROBLEM STATEMENT
HUMANS glance at an image and instantly know what objects are in the image but
its difficult for computers.
Object detection using YOLO,which increases the accuracy and speed and it uses
single neural network to predict bounding boxes and class probabilities directly
from full image in one evaluation.
Human Vision VS ComputerVision

What wesee What a computersees


3
WHAT IS OBJECT DETECTION
LITERATURE SURVEY
PAPER AND AUTHOR OUTCOMES

P.F.Felzenszwalb,R.B.Girshick,D.McAllester,andD.Ramanan. Object Deformable parts models (DPM) use a sliding window


detection with discriminatively trained part based models. IEEE approach to object detection . DPM uses a disjoint pipeline
Transactions on Pattern Analysis and Machine Intelligence, to extract static features, classify regions, predict bounding
32(9):1627–1645, 2010. boxes for high scoring regions,etc. Our system replaces all
of these disparate parts with a single convolutional neural
network.

J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders. RCNN and its variants use region proposals instead of
Selective search for object recognition. International journal of sliding windows to find objects in images. Selective Search
computer vision, 104(2):154–171, 2013 generates potential bounding boxes, a convolutional
network extracts features,an SVM scores the boxes,a linear
model adjusts the bounding boxes,and non-max
suppression eliminates duplicate detections.
LITERATURE SURVEY
D.Erhan,C.Szegedy,A.Toshev,andD.Anguelov. Scalable object Unlike R-CNN, Szegedy et al. train a convolutional neural
detection using deep neural networks. In Computer Vision network to predict regions of interest instead of using
and Pattern Recognition (CVPR), 2014 IEEE Conference on, Selective Search. MultiBox can also perform single object
pages 2155–2162. IEEE, 2014 detection by replacing the confidence prediction with a
single class prediction. However, MultiBox cannot perform
general object detection and is still just a piece in a larger
detection pipeline, requiring further image patch
classification.
P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Sermanet et al. train a convolutional neural network to
Y. LeCun. Overfeat: Integrated recognition, localization and perform localization and adapt that localizer to perform
detection using convolutional networks. CoRR, detection . OverFeat efficiently performs sliding window
abs/1312.6229, 2013 detection but it is still a disjoint system. OverFeat optimizes
for localization, not detection performance.
J.Redmon and A.Angelova. Real-time grasp detection using YOLO work is similar in design to work on grasp detection by
convolutional neural networks. CoRR,abs/1412.3128,2014. Redmon et al. Our grid approach to bounding box prediction
is based on the Multi Grasp system for regression to grasps.
However,grasp detection is a much simpler task than object
detection.
OBJECTIVES
15 million people in india are blind ,so a smart phone with this
technology can help them navigate the world.

Under water Robot for navigation.

Drone Photography to track and find the objects in the battle field.
Detection Procedure
We split the image into an S*S grid

7*7 grid
Each cell predicts B boxes(x,y,w,h) and
confidences of each box: P(Object)
Each cell predicts B boxes(x,y,w,h) and
confidences of each box: P(Object)
Each cell predicts B boxes(x,y,w,h) and
confidences of each box: P(Object)
Each cell predicts boxes and confidences: P(Object)
Each cell also predicts a class probability.

Bicycle Car

Dog
Dining
Table
Conditioned on object: P(Car | Object)

Bicycle Car

Dog
Eg.
Dog = 0.8
Cat = 0 Dining
Bike = 0 Table
Then we combine the box and class predictions.

P(class|Object) * P(Object)
=P(class)
Finally we do threshold detections and NMS
During training, match example to the right cell
During training, match example to the right cell
Adjust that cell’s class prediction

Dog = 1
Cat = 0
Bike = 0
...
Look at that cell’s predicted boxes
Find the best one, adjust it, increase the confidence
Find the best one, adjust it, increase the confidence
Find the best one, adjust it, increase the confidence
Decrease the confidence of the other box
Decrease the confidence of the other box
Some cells don’t have any ground truth detections!
Some cells don’t have any ground truth detections!
Decrease the confidence of boxes boxes
Decrease the confidence of these boxes
YOLO generalizes well to new domains (like art)
It outperforms methods like DPM and R-CNN when
generalizing to person detection in artwork

S. Ginosar, D. Haas, T. Brown, and J. Malik. Detecting people in cubist art. In Computer Vision-ECCV 2014 Workshops, pages 101–116.
Springer, 2014.

H. Cai, Q. Wu, T. Corradi, and P. Hall. The cross-depiction problem: Computer vision algorithms for recognising objects in artwork and in
photographs.
Strengths and Weaknesses
● Strengths:
○ Fast: 45fps, smaller version 155fps
○ End2end training
○ Background error is low

● Weaknesses:
○ Performance is lower than state-of-art
○ Makes more localization errors
APPLICATIONS
• Event Detection
• Industrial Automation
• Medical image processing
• Self driving Vehicles
• Military Applications
CONCLUSION
YOLO ,a unified model for object detection and its simple to construct
and can be trained directly on full images .Unlike classifier-based
approaches, YOLO is trained on a loss function that directly
corresponds to detection performance and the entire model is trained
jointly.
REFERENCES

[1] M. B. Blaschko and C. H. Lampert. Learning to localize objects with structured output regression.
In Computer Vision–ECCV 2008, pages 2–15. Springer, 2008.
[2] L. Bourdev and J. Malik. Poselets: Body part detectors trained using 3d human pose annotations.
In International Conference on Computer Vision (ICCV), 2009.
[3] H. Cai, Q. Wu, T. Corradi, and P. Hall. The crossdepiction problem: Computer vision algorithms for
recognising objects in artwork and in photographs.2015
[4] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Computer Vision
and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages
886–893. IEEE, 2005.
[5] T. Dean, M. Ruzon, M. Segal, J. Shlens, S. Vijayanarasimhan, J. Yagnik, et al. Fast, accurate
detection of 100,000 object classes on a single machine. In Computer Vision and Pattern
Recognition (CVPR), 2013 IEEE Conference on, pages 1814–1821. IEEE, 2013
[6] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep
convolutional activation feature for generic visual recognition. arXiv preprint arXiv:1310.1531, 2013.

You might also like