0% found this document useful (0 votes)
77 views3 pages

Deep Learning YOLOv2

The document summarizes two object detection models: YOLOv1 and YOLO9000. YOLOv1 was the original model that divided images into grids and had each grid cell predict bounding boxes. It helped improve detection speed but had limitations around small objects. YOLO9000 aimed to improve accuracy and tackle YOLOv1's limitations by detecting over 9,000 categories using a technique called "wordtrees and combination" during training. The document reviews the design, training process, limitations and performance of each model.

Uploaded by

Pedro Antonio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views3 pages

Deep Learning YOLOv2

The document summarizes two object detection models: YOLOv1 and YOLO9000. YOLOv1 was the original model that divided images into grids and had each grid cell predict bounding boxes. It helped improve detection speed but had limitations around small objects. YOLO9000 aimed to improve accuracy and tackle YOLOv1's limitations by detecting over 9,000 categories using a technique called "wordtrees and combination" during training. The document reviews the design, training process, limitations and performance of each model.

Uploaded by

Pedro Antonio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

A Review of YOLO: Real-Time Object Detection

1st João António


Data Science Master Degree
Instituto Politécnico de Leiria
Leiria, Portugal
[email protected] / ORCID 0000-0001-6306-1992

Abstract—While object detection has been a widely docu- model name. They claim the model only needs to look once
mented branch of computer vision, such applications mainly at an image to predict what objects are present and what they
consist on re-utilizing region proposal classification networks (R- are [1].
CNNs) to perform detection on many proposed regions and, as a
result, predicting multiple times for any given image. You Only
Look Once (YOLO) aims to provide a new approach to object
detection by having a single neural network performing bounding
box proposal and class prediction directly on the input images,
acting closer to what is referred to as a Fully Connected Neural
Network. This revision paper aims to cover the main features of
the YOLO network by providing the reader with a comprehensive
look into the model’s architecture, performance and real-life
application scenarios throughout its multiple iterations in recent
years.
Index Terms—YOLO, Object Detection, Real-time, FCNN,
YOLO9000,

I. I NTRODUCTION
Fig. 1. Example of a typical R-CNN Architecture. [3]
With the human eye being the epitome of object detec-
tion, it is only natural for science to seek development of The main premise behind the YOLO networks is the use of
strategies that aim to replicate this function and apply it to a single convolutional neural network capable enough to work
day-to-day living. Thoughts of self driving cars that can scan on full images and predict bounding boxes as well as class
the road ahead and perform decisions based on solely their probabilities for those boxes. This unified approach to object
immediate surroundings were once far fetched ideas that, detection brings several benefits, as described by the authors
nowadays, are an ever evolving reality that may very well in the first paper, but also introduces a series of limitations
be the new norm. In 2016, with the light-speed evolution of involving spatial constraints, where smaller objects that appear
the neural network scientific department, the authors of the in clusters, such as bird flocks, are often missed entirely. Other
YOLO network understood typical object detection systems presented limitations in the original paper include sensitivity
as re-purposed classifiers [1] which would evaluate a test to image aspect ratios and incorrect localizations [1]. Months
image for different objects at variable scales and locations. later, in December of 2016, two of the four original authors
The deformable parts models (DPM) documented in 2011 by presented a new iteration of the YOLO network, aptly named
P. Felzenszwalb outlines an approach which uses a sliding YOLO9000, due to its ability to detect over 9000 object
window technique, where a filter of a specified size is run categories. This new publication aims not only to tackle the
at evenly spaced locations over the target image, essentially main difficulties encountered with the first model, but also to
treating object detection as a binary classification problem [2]. massively increase performance and accuracy when compared
Another well documented approach to object detection are the to existing models at the time [5]. The ambitious YOLO9000
R-CNN (Region-Based Convolutional Networks) which utilize was huge step in the direction to make object detection be
region proposal methods on a given image and perform object comparable to the scale of object classification, as object
detection via a two-step function as depicted in Figure 1. detection datasets were typically limited to less than a few
Firstly, potential bounding boxes are generated around likely hundred possible tags, such as Microsoft’s COCO challenge
objects, and afterwards, a classifier is used on each region [7] or the Pascal visual object classes (voc) challenge [6],
of interest (ROI), ultimately predicting whether a region is while classification datasets commonly reached upwards of
an object or not [4]. While these examples demonstrate a 100.000 possible categories spanning across millions upon
vast understanding of computer vision and, overall, are very millions of entries. One notable example of the sheer scale
valid approaches to object detection in a timely fashion, the of classification datasets is the largest multimedia collection
authors of the YOLO network present a straightforward, 3-step currently available, YFCC100M, containing around 99 million
mechanism to identify objects in an image that originated the images and 1 million videos [8]. This bibliographic review
article aims to introduce the reader to both iterations of
the YOLO network, focusing each version’s features, limita-
tions and performance when put to the test against common
challenge datasets and other networks designed for object
detection.

II. YOLOV 1
The first iteration of the YOLO network, referred to in this
paper as YOLOv1, is heavily focused towards an ”unified”
way of performing detection on real-time images, using the
entire image frame, rather than smaller sized filters [9] as
a means to obtain contextual information about the objects
in any given image. It works by dividing the input frame
into a grid of size S × S, making each resulting grid cell
responsible by identifying whatever object falls within it. Each
cell is also responsible by returning 6 prediction values [1], Fig. 2. YOLOv1 Model Functionality. The original image is split into a
namely x, y, w, h, confidence, and C, respectively associated S × S grid of individually capable cells which predict B bounding boxes and
with the position (x, y) of the bounding box in relation to C probabilities. [1]
the bounds of the grid cell, the dimensions (w, h) when put
into perspective with the whole frame, the confidence value
1 × 1 reduction layers as a means to reduce dimensionality
(Intersection over Union) between the predicted box and the
before the more expensive 3 × 3 convolutions, an approach
ground truth box, and finally, the conditional class probability
documented by M. Lin, Q. Chen and S. Yan in the 2013 paper
vector (C), Pr(Classi |Object). The formula used to obtain
’Network in Network’ [12]. A graphic depiction of the full
class-specific confidence scores for each bounding box is given
network architecture may be consulted in figure 3
by:

truth
P r(Classi |Object) ∗ P r(Object) ∗ IOUpred =
truth
(1)
P r(Classi ) ∗ IOUpred

A representation of the YOLO workflow can be consulted


in figure 2, where the steps mentioned earlier can be easily
identified. The input image, in this scenario, was divided into
a grid of size 7 × 7 for ease of interpretation. Each of the 49
resulting cells is fully capable of predicting bounding boxes
for all classes that the network is trained to identify, and does
so with an associated confidence level, which in turn enables Fig. 3. The YOLOv1 detection network containing 24 convolutional layers
class probability mapping based on what bounding boxes are and 2 fully connected layers. Note that some convolutional layers contain 1
the most repeated and with most confidence. The end result is times 1 filters used to reduce feature dimensionality. [1]
a reduced amount of bounding boxes, this time labelled with
a specific class, encoded as an S × S × (B ∗ 5 + C) tensor. B. Training Process
While the method with which to select values for S and B
is mostly arbitrary, C is directly associated with the number C. Limitations
of labelled classes the target dataset contains. For instance, a D. Performance
dataset with 100 labelled classes where B = 2 and S = 7 will III. YOLO9000
output a 7 × 7 × 110 tensor.
A. Better, Faster, Stronger
A. Design B. Design
The original YOLO network was implemented as a convo- C. Wordtrees and Combination
lutional neural network (CNN) and evaluted on the P ASCAL D. Training
VOC detection dataset [10]. Large inspiration was taken
from another popular image classification model, GoogLeNet, E. Performance
which introduces a concept referred to as Inception Modules, IV. C ONCLUSION
essentially allowing the network to choose between multiple R EFERENCES
convolutional filters in each block and thus improving adapt-
[1] J. Redmon, A. Farhadi, S. Divvala, R. Girshick, “You Only Look Once:
ability to the dataset that it is being tested on [11]. The authors Unified, Real-Time Object Detection´´, University of Washington, Allen
of the YOLO network opted instead by running data through Institute for AI, May 2016
[2] Pedro F. Felzenszwalb, “Object Detection with Deformable Part Models
(DPM)´´, School of Engineering and Department of Computer Science,
Brown University, Rhode Island, December 2011
[3] Investigations of Object Detection in Images/Videos Using
Various Deep Learning Techniques and Embedded Platforms—A
Comprehensive Review - Scientific Figure on ResearchGate.
Available from: https://www.researchgate.net/figure/RCNN-architecture-
17fig4341099304 [accessed 17 Apr, 2022]
[4] R. Girshick, J. Donahue, T. Darrell and J. Malik, “Rich Feature Hier-
archies for Accurate Object Detection and Semantic Segmentation´´,
2014 IEEE Conference on Computer Vision and Pattern Recognition,
2014, pp. 580-587, doi: 10.1109/CVPR.2014.81.
[5] J. Redmon, A. Farhadi, “YOLO9000: Better, Faster, Stronger´´, Uni-
versity of Washington, Allen Institute for AI, December 2016
[6] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisser-
man. “The pascal visual object classes (voc) challenge´´. International
journal of computer vision, 88(2):303–338, 2010
[7] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll
ar, and C. L. Zitnick.“ Microsoft coco: Common objects in context´´. In
European Conference on Computer Vision, Springer, 2014. pp. 740–755.
[8] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D.
Poland, D. Borth, and L.J. Li. “Yfcc100m: The new data in multimedia
research´´. Communications of the ACM, 59(2):64–73, 2016
[9] I. Goodfellow, Y. Bengio, A. Courville. “Deep Learning (Adaptive
Computation and Machine Learning Series)“, 2015, pp. 330-348, ISBN-
13: 978-0262035613
[10] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I.Williams, J.
Winn, and A. Zisserman. The pascal visual object classes challenge: A
retrospective. International Journal of Computer Vision, 111(1):98–136,
Jan. 2015.
[11] C. Szegedy, W. Liu, Y. Jia, P.Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, A. Rabinovich, “Going Deeper with Convolutions´´,
arXiv:1409.4842v1, Sep. 2014.
[12] Lin, Q. Chen, and S. Yan. Network in network. CoRR, abs/1312.4400,
2013

You might also like