Real Time Object Detection Using YOLO
Real Time Object Detection Using YOLO
Abstract—With the availability of enormous amounts of data Different strategies have been proposed to solve the
and the need to computerize visual-based systems, research on problem of object identification throughout the years. These
object detection has been the focus for the past decade. This need techniques focus on the solution through multiple stages.
has been accelerated with the increasing computational power Namely, these core stages include recognition, classification,
and Convolutional Neural Network (CNN) advancements since localization, and object detection. Along with the
2012. With various CNN network architectures available, the technological progression over the years, these techniques
You Only Look Once (YOLO) network is popular due to its have been facing challenges such as output accuracy, resource
many reasons, mainly its speed of identification applicable in
cost, processing speed and complexity issues. With the
real-time object identification. Followed by a general
invention of the first Convolutional Neural Network (CNN)
introduction of the background and CNN, this paper wishes to
review the innovative, yet comparatively simple approach
algorithm in the 1990s inspired by the Neocognitron by Yann
YOLO takes at object detection. LeCun et al. [1] and significant inventions like AlexNet [2],
which won the ImageNet Large Scale Visual Recognition
Keywords—YOLO, CNN, object detection, image classification Challenge (ILSVRC) in 2012 (thus later referred to as
ImageNet) CNN algorithms have been capable of providing
solutions for the object detection problem in various
approaches. With the purpose of improving accuracy and
I. INTRODUCTION
speed of recognition, optimization focused algorithms such as
Although the human eye is capable of instantly and VGGNet [3], GoogLeNet [4] and Deep Residual Learning
precisely identifying a given visual, including its content, (ResNet) [5] have been invented over the years.
location, and visuals close by interacting with it, the human
made, computer vision-enabled systems are relatively low in Although these algorithms improved over time, window
accuracy and speed. Any advancements leading to selection or identifying multiple objects from a single image
improvements in efficiency and performance in this field was still an issue. To bring solutions to this issue, algorithms
could pave paths to creating more intelligent systems, much with region proposals, crop/warp features, SVM
like humans. These advancements, in turn, would ease human classifications and bounding box regression such as Regions
life through systems such as assistive technologies that allow with CNN (R-CNN) were introduced. Although R-CNN was
humans to complete tasks with little to no conscious thought. comparatively high in accuracy with the previous inventions,
For instance, driving a car equipped with a computer vision- its high usage of space and time later led to the invention of
enabled assistive technology could predict and notify a driving Spatial Pyramid Pooling Network (SPPNet) [6]. Despite
crash prior to the incident, even if the driver is not conscious SPPNet's speed, to reduce the similar drawbacks it shared with
of their actions. Therefore, real-time object detection has R-CNN; Fast R-CNN was introduced. Although Fast R-CNN
become a highly required subject in continuing the automation could reach real-time speeds using very deep networks, it held
or replacement of human tasks. Computer vision and object a computational bottleneck. Later Faster R-CNN, an
detection are prominent fields under machine learning and are algorithm based on ResNet, was introduced. Due to Faster R-
eventually expected to aid unlocking the potential general- CNN not yet capable of surpassing state of the art detection
responsive robotic systems. systems, YOLO was introduced. This paper reviews the
dominating real-time object detection algorithm You Only
With the current technological advancements, creating Look Once (YOLO).
openness and attainability of data to and from everyone
connected to it has become an easy task. Most human lives Consisting of layers in the basic CNN architecture and
revolved around mainstream personal computers (PCs), and YOLO networks, each layer's characteristics and the two
smartphones have made this process even more accessible. versions of YOLO; YOLO-V1 and YOLO-V2 would be
Along with this process, the expansion of information and reviewed under this paper. The strengths and weaknesses of
images available on the internet/cloud has become to the point YOLO would be exposed, finally being followed by a
of millions per day. Usage of computerized systems to utilize summarized conclusion.
this information and make necessary recognitions and
processes is vital due to humans' impracticality performing the
same iterative tasks. The initial step of most such processes II. CONVOLUTIONAL NEURAL NETWORK (CNN)
may include recognizing a specific object or area on an image. A Convolutional Neural Network (CNN) could be taken
Due to the unpredictability of the availability, location, size,
as a subcategory under Deep Neural Networks specifically
or shape of an item in each image, the recognition process is
invented for image processing and object detection. CNN
inconceivably hard to be performed through a traditional
programmed computer algorithm. Factors such as the algorithms can be utilized without requiring an enormous
complexity of the foundation, light intensities too contribute amount of predefined substantial parameters for the provided
to this. image. This ease at training a model and the vast amount of
information available through the internet has made CNN As the layer, which is why CNN received its name, the
algorithms possible. The mechanism CNN algorithms follow convolutional layer is the most critical layer in a CNN
to express and extract features of the input data is entirely structure. Comprised of multiple element maps and many
mathematical. This mechanism involves a weight sharing neurons inside them, each of these neurons is created to
process that recognizes and identifies information that holds untangle nearby qualities of various positions in the previous
similar features. This process enables networks to analyze layer [9]. Many nearby associations and many mutual
high data dimensions to achieve the final output of excellent attributes use a filter called CONV kernel, which slides on the
classification in the end. One of the apparent obstacles in original image inputted to it. The CONV kernel calculates the
moving forward with getting better results using CNN image's component portrayal by multiplying and adding the
models is the processing capabilities of available hardware values of each pixel of the local correlated data within it
and the scope of parameters in datasets. before being added to the convolutional result. This so-called
The invention of the CNN [7] in 1998 with LeNet and its rule of convolution enables the features of the image to be
bloom in 2012 with AlexNet was at the error rate of 15.3% extracted using the CONV kernel. The reason for filtering the
followed by ZF-net. The inventions of GoogLeNe and various parts of an image with the same CONV kernel is that
VGGNet has made the error rate lower over time. An this refers to shared weights. This usage of shared weights
exceptional milestone in this timeline was when ResNet enables neutral cells with the same features to be recognized
surpassed the error rate of 3.6%, which was lower than that and classified into the same object type. Parameters such as
of the human eye (5.1%) in 2015, proving that deep learning kernel size, depth, stride, zero-padding, and filter quantity can
models could surpass human capabilities. be inputted onto this.
3) Active Layer
A. Structure of CNN
The active layer is the layer used to solve the problem of
A typical CNN is structured with multiple layers: an the vanishing gradient due to underfitting. This underfitting,
input layer, a convolutional layer, an active layer, a pooling nonlinear problem is caused by the previous convolutional
layer, a fully connected layer and finally, an output layer. layer. One of the active layer functions such as Sigmoid,
Some types of CNN models might include other layers for Tanh, the rectified Linear Unit (ReLu), the exponential
different purposes too. Figure 1 shows the basic structure of Linear Unit (ELU), Leaky ELU, or Maxout could be used in
a CNN architecture. solving underfitting, following their usage [10]. Considering
the converging speed, ReLu function has been the most
popular although Sigmoid and Tanh functions are still
commonly used due to their simplicity and efficiency.
4) Pooling Layer
Figure 1: The typical CNN structure with seven layers The pooling layer's job is to efficiently reduce the
dimensions of the results sent from the convolutional layer.
Source:https://www.researchgate.net/publication/340102110_Hier This is achieved by joining the neurons' outcome at one layer
archical_Multi-View_Semi-Supervised_Learning_for_Very_High- into a single neuron in the following layer, thus diminishing
Resolution_Remote_Sensing_Image_Classification the elements of the component maps and incrementing the
strength of selected extractions. Pooling layers are usually
This multi-layered architecture is diverse in layers and situated between two convolutional layers and can be
uses forward pass and error backpropagation calculations to categorized into three distinct types based on their width:
achieve the target's proficiency. Training this architecture to general pooling, overlapping pooling and Spatial Pyramid
become a model is a directed procedure that requires a Pooling (SPP). A pooling layer is called a general pooling
collection of imagery data and their labels. Eventually, at the layer when its width is mainly equal to its stride. General
end of the training process, the most suitable weights would pooling's activities include max pooling and normal pooling.
be calculated to be used at the testing phase. These layers, as When the most extreme incentives from each neuron group
mentioned above, could be further explained as follows. from the previous layer are utilized, it is called max pooling.
When it is done for the normal incentives, it is referred to as
normal pooling. Overlapping pooling is when the width is
1) Input Layer longer than the stride. Therefore, abnormal state attributes
The input layer is used to initialize the input image data from the input layer can be extracted and acquired by
and make all the available dimensions zero-centered. This structuring a few convolutional layers along with a final
layer is also responsible for normalizing the scale of all input pooling layer.
data to a range within 0 and 1, which would help in
accelerating the speed of converging. This normalization is 5) Fully Connected Layer
also helpful in reducing redundancy by whitening the data. Often the last layer before the output layer, the fully
Principal Component Analysis (PCA) is done to degrade and connected layer transmits data to the output layer while being
decorate the available dimensions of the extracted data while the completely associated layer amongst the CNN layers. By
focusing on key dimensions.[8] utilizing each neuron in the past layer and interfacing them to
each neuron on its own, it simplifies and speeds up the data
2) Convolutional Layer calculation process. It being a completely associated layer
saves no spatial data and is constantly trailed by a yield layer.
bounding box regression at the same time. With YOLO, the
6) Other Layers class label containing objects, their location can be predicted
Apart from the different layers used in structuring a CNN in one glance. Entirely deviating from the typical CNN
model mentioned above, some CNN models need additional pipeline, YOLO treats object detection as a regression
layers to achieve the expected output. Layers such as dropout problem by spatially separating bounding boxes and their
layers, regression layers come under this. Dropout layers are related class probabilities, which are predicted using a single
often used to solve overfitting by avoiding majorly subjective neural network. This process of performing both bounding
weights by updating weights of the neural cell knot with a box prediction and class probability calculations is a unified
network architecture that YOLO initially introduced.
certain probability (which is decided by the stochastic
policy). Whereas, regression layer is used to classify features YOLO algorithm extends GoogLeNet equations to be used
using a method such as logistic regression (LR), Bayesian as their base forwarding transport computation, assumably the
Linear Regression (BLR) and Gaussian Processes for reason behind the speed and accuracy of YOLO's real-time
Regression (GPR). The output of a regression layer is the object detection. In comparison with R-CNN architectures,
probabilities of all the possible object types. unlike running a classifier on a potential bounding box, then
reevaluating probability scores, YOLO predicts bounding
boxes and class probability for those bounding boxes
III. TYPES OF OBJECT DETECTION ALGORITHMS simultaneously. This optimizes the YOLO algorithm and is
one of the significant reasons why YOLO is so fast and less
Algorithms available for object detection can be divided
likely to have errors to be utilizable for real-time object
into two categories: classification-based algorithms and
predictions.
regression-based algorithms.
YOLO's architecture is similar to a typical convolutional
1) Classification based algorithms neural network inspired by the GoogLeNet model for image
Classification based algorithms are implemented in two classification. The network's initial layer first extracts the
stages. The initial stage is the selection of region that is of image's features, and the fully connected layers predict the
interest (RoI) in the image. Then these regions are classified output probabilities and coordinates. With 24 convolutional
with the use of a convolutional neural network. This approach layers, two fully connected layers, 1x1 reduction layers and
of performing one stage prior to the other can be slow due to 3x3 convolutional layers, the full YOLO network model
created [12].
the need to run the prediction algorithms on each region
selected in the first stage. Few common examples for this
type of algorithms are the Retina Net, Region-based CNN
A. Unified Detection of YOLO
(RCNN), the Fast-RCNN, Faster R-CNN and Mask-RCNN
(which is known to be a state-of-art under regional-based YOLO is introduced as a unified algorithm as separate
CNN algorithms). components merge into a single neural network as the final
pipeline. For each bounding box to be predicted parallelly, the
2) Regression-based algorithms features of the entire image are globally reasoned. YOLO is
Regression-based algorithms are implemented so that designed in such a way that it does its own end-to-end training
instead of selecting and singling out regions of interest in an in real-time while keeping high-level average precision. To
image, they predict classes and their relevant bounding boxes achieve unified detection, YOLO first separates the input
image into a S X S size grids. If the Object's center is being
for the whole image in one run through the model. Since
placed into the grid cell; the grid cell tries object detection on
frame detection is treated as a regression problem, a complex
itself. Thus, every grid cell tries to estimate a bounding box
pipeline is not necessary for regression-based algorithms. and their confidence scores across all classes trained to
Famous examples of this type of algorithms are the Single predict. The predicted confidence scores will reflect how
Shot Multibox Detector (SSD) and YOLO algorithms. Due to confident it is to provide each label and bounding box to each
the simultaneousness of the detection and its nature of high object. Formally the confidence scores are defined as Pr
speed (achieved with a tradeoff with accuracy), these are (Object) x IOUtruthpred. If an object has been found inside
commonly used for real-time object detection. The detection the cell, this confidence score will be equal to the intersection
and understanding of the more popular YOLO algorithms over union (IOU) between the ground truth and the predicted
require an initial establishment of what will be predicted box. If not, the confidence score would be equal to zero. The
before the models are used. The prediction would result in a unified detection outputs each confidence score to have five
bounding box (specifying the Object's location) along with a parameters: w, y, w, h, and confidence. The (x, y) coordinates
class that has the highest probability amongst the established represent the center of the box with respect to the grid cell's
set of classes. boundaries. As mentioned above, if the box's center does not
fall inside the grid cell, then the cell is not responsible for its
prediction. With each coordinate being normalized to be
IV. YOU ONLY LOOK ONCE (YOLO) ALGORITHM contained inside the range of 0 and 1, the estimated Object's
YOLO is a novel approach to detect multiple objects height and width are calculated with respect to the entire
present in an image in real-time while drawing bounding image. According to Mauricio Menegaz in his article [11] the
boxes around them. It passes the image through the CNN prediction is of a few steps.
algorithm only once to get the output, thus the name. Although
comparatively similar to R-CNN, YOLO practically runs a lot
faster than Faster R-CNN because of its simpler architecture.
Unlike Faster R-CNN, YOLO can classify and perform
namely are pc, bx, by, bw, bh, c1, c2, c3. Pc shows if a
particular grid has an object or not. If an object is available,
the pc is assigned 1 else 0. bx, by, bh, bw are bounding box
parameters of a grid and are only defined if a proper object is
available in that grid. c1, c2, c3 are classes. If the object is a
car, then the value of c1, c2, c3 are 0,1,0 respectively [11].
It is better if the computed IOU is greater than a YOLOv1’s network has 24 convolutional layers as
threshold value (an assumed value for increasing the accuracy opposed to YOLOv2, which has 19 layers [10]. For
of the detected object.) 0.5 [11]. evaluating YOLO model on the PASCAL VOC detection
In Non-Max Suppression, the next method, high possibility dataset, these values are used: S=7, therefore a 7x7 grid. N=2,
boxes are used and the boxes with high IOU values are number of bounding boxes. The PASCAL VOC dataset has
suppressed [11]. This process is followed many times until a 20 labelled classes so c=20. Therefore YOLOv1’s final
box is considered as the bounding box for the object. Each prediction is a 7x7x(5x2+20)=7x7x30 tensor. Here only 98
grid cell also predicts ‘c’ conditional class probabilities for bounding boxes per image is used [10], [12]
the object in that grid. These probabilities are conditioned on
the grid cell containing an object. Only one set of class • YOLOv2
probabilities is predicted for a grid cell, regardless of the YOLO Version 2 is an improved version of the existing
number of bounding boxes for that grid cell. [12] YOLO algorithm. The speed of detection performance
remains same while the mAP value increased compared to
YOLOv1’s Map value of 63.4. New multi-scale training
method can be used to run the YOLOv2 run at various sizes
offering improvements in accuracy and speed in prediction.
YOLOv2 adds a list of significant solutions to increase mAP.
Batch Normalization preprocesses the input data. High
Resolution Classifier from YOLOv1’s 224x224 to 448x448
raises the mAP by 4%. Its neural network has 19
convolutional layers compared to the YOLOv1 which has 24.
YOLOv2 adopts convolutional with anchor boxes and
increases each grid cell’s resolution from YOLOv1’s 7x7 to
13x13. It also has only one bounding box for each grid cell.
Finally, YOLOv2 adds a pass-through layer to get the
extracted features from the former layer and combine them
with the original final output features, so that the ability of
detecting the small object would be enhanced. In this mean,
Figure 6: Complete process of Object detection by YOLO YOLOv2 raises the mAP by 1% [10].
Source: https://jespublication.com/upload/2020-110682.pdf