Deep Learning YOLOv2
Deep Learning YOLOv2
Abstract—While object detection has been a widely docu- model name. They claim the model only needs to look once
mented branch of computer vision, such applications mainly at an image to predict what objects are present and what they
consist on re-utilizing region proposal classification networks (R- are [1].
CNNs) to perform detection on many proposed regions and, as a
result, predicting multiple times for any given image. You Only
Look Once (YOLO) aims to provide a new approach to object
detection by having a single neural network performing bounding
box proposal and class prediction directly on the input images,
acting closer to what is referred to as a Fully Connected Neural
Network. This revision paper aims to cover the main features of
the YOLO network by providing the reader with a comprehensive
look into the model’s architecture, performance and real-life
application scenarios throughout its multiple iterations in recent
years.
Index Terms—YOLO, Object Detection, Real-time, FCNN,
YOLO9000,
I. I NTRODUCTION
Fig. 1. Example of a typical R-CNN Architecture. [3]
With the human eye being the epitome of object detec-
tion, it is only natural for science to seek development of The main premise behind the YOLO networks is the use of
strategies that aim to replicate this function and apply it to a single convolutional neural network capable enough to work
day-to-day living. Thoughts of self driving cars that can scan on full images and predict bounding boxes as well as class
the road ahead and perform decisions based on solely their probabilities for those boxes. This unified approach to object
immediate surroundings were once far fetched ideas that, detection brings several benefits, as described by the authors
nowadays, are an ever evolving reality that may very well in the first paper, but also introduces a series of limitations
be the new norm. In 2016, with the light-speed evolution of involving spatial constraints, where smaller objects that appear
the neural network scientific department, the authors of the in clusters, such as bird flocks, are often missed entirely. Other
YOLO network understood typical object detection systems presented limitations in the original paper include sensitivity
as re-purposed classifiers [1] which would evaluate a test to image aspect ratios and incorrect localizations [1]. Months
image for different objects at variable scales and locations. later, in December of 2016, two of the four original authors
The deformable parts models (DPM) documented in 2011 by presented a new iteration of the YOLO network, aptly named
P. Felzenszwalb outlines an approach which uses a sliding YOLO9000, due to its ability to detect over 9000 object
window technique, where a filter of a specified size is run categories. This new publication aims not only to tackle the
at evenly spaced locations over the target image, essentially main difficulties encountered with the first model, but also to
treating object detection as a binary classification problem [2]. massively increase performance and accuracy when compared
Another well documented approach to object detection are the to existing models at the time [5]. The ambitious YOLO9000
R-CNN (Region-Based Convolutional Networks) which utilize was huge step in the direction to make object detection be
region proposal methods on a given image and perform object comparable to the scale of object classification, as object
detection via a two-step function as depicted in Figure 1. detection datasets were typically limited to less than a few
Firstly, potential bounding boxes are generated around likely hundred possible tags, such as Microsoft’s COCO challenge
objects, and afterwards, a classifier is used on each region [7] or the Pascal visual object classes (voc) challenge [6],
of interest (ROI), ultimately predicting whether a region is while classification datasets commonly reached upwards of
an object or not [4]. While these examples demonstrate a 100.000 possible categories spanning across millions upon
vast understanding of computer vision and, overall, are very millions of entries. One notable example of the sheer scale
valid approaches to object detection in a timely fashion, the of classification datasets is the largest multimedia collection
authors of the YOLO network present a straightforward, 3-step currently available, YFCC100M, containing around 99 million
mechanism to identify objects in an image that originated the images and 1 million videos [8]. This bibliographic review
article aims to introduce the reader to both iterations of
the YOLO network, focusing each version’s features, limita-
tions and performance when put to the test against common
challenge datasets and other networks designed for object
detection.
II. YOLOV 1
The first iteration of the YOLO network, referred to in this
paper as YOLOv1, is heavily focused towards an ”unified”
way of performing detection on real-time images, using the
entire image frame, rather than smaller sized filters [9] as
a means to obtain contextual information about the objects
in any given image. It works by dividing the input frame
into a grid of size S × S, making each resulting grid cell
responsible by identifying whatever object falls within it. Each
cell is also responsible by returning 6 prediction values [1], Fig. 2. YOLOv1 Model Functionality. The original image is split into a
namely x, y, w, h, confidence, and C, respectively associated S × S grid of individually capable cells which predict B bounding boxes and
with the position (x, y) of the bounding box in relation to C probabilities. [1]
the bounds of the grid cell, the dimensions (w, h) when put
into perspective with the whole frame, the confidence value
1 × 1 reduction layers as a means to reduce dimensionality
(Intersection over Union) between the predicted box and the
before the more expensive 3 × 3 convolutions, an approach
ground truth box, and finally, the conditional class probability
documented by M. Lin, Q. Chen and S. Yan in the 2013 paper
vector (C), Pr(Classi |Object). The formula used to obtain
’Network in Network’ [12]. A graphic depiction of the full
class-specific confidence scores for each bounding box is given
network architecture may be consulted in figure 3
by:
truth
P r(Classi |Object) ∗ P r(Object) ∗ IOUpred =
truth
(1)
P r(Classi ) ∗ IOUpred