0% found this document useful (0 votes)
30 views

Real Time Object Detection Using YOLO

The document provides an in-depth review of real-time object detection using YOLO. It discusses background information on object detection and convolutional neural networks. The review then examines the YOLO network architecture and its two versions, highlighting strengths and weaknesses.

Uploaded by

abdul rahman
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Real Time Object Detection Using YOLO

The document provides an in-depth review of real-time object detection using YOLO. It discusses background information on object detection and convolutional neural networks. The review then examines the YOLO network architecture and its two versions, highlighting strengths and weaknesses.

Uploaded by

abdul rahman
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Real-Time Object Detection using YOLO: A review

Upulie H.D.I Lakshini Kuganandamurthy


IT18107074 IT17073592
Sri Lanka Institute of Information Technology Sri Lanka Institute of Information Technology
Malabe, Sri Lanka Malabe, Sri Lanka
[email protected] [email protected]

Abstract—With the availability of enormous amounts of data Different strategies have been proposed to solve the
and the need to computerize visual-based systems, research on problem of object identification throughout the years. These
object detection has been the focus for the past decade. This need techniques focus on the solution through multiple stages.
has been accelerated with the increasing computational power Namely, these core stages include recognition, classification,
and Convolutional Neural Network (CNN) advancements since localization, and object detection. Along with the
2012. With various CNN network architectures available, the technological progression over the years, these techniques
You Only Look Once (YOLO) network is popular due to its have been facing challenges such as output accuracy, resource
many reasons, mainly its speed of identification applicable in
cost, processing speed and complexity issues. With the
real-time object identification. Followed by a general
invention of the first Convolutional Neural Network (CNN)
introduction of the background and CNN, this paper wishes to
review the innovative, yet comparatively simple approach
algorithm in the 1990s inspired by the Neocognitron by Yann
YOLO takes at object detection. LeCun et al. [1] and significant inventions like AlexNet [2],
which won the ImageNet Large Scale Visual Recognition
Keywords—YOLO, CNN, object detection, image classification Challenge (ILSVRC) in 2012 (thus later referred to as
ImageNet) CNN algorithms have been capable of providing
solutions for the object detection problem in various
approaches. With the purpose of improving accuracy and
I. INTRODUCTION
speed of recognition, optimization focused algorithms such as
Although the human eye is capable of instantly and VGGNet [3], GoogLeNet [4] and Deep Residual Learning
precisely identifying a given visual, including its content, (ResNet) [5] have been invented over the years.
location, and visuals close by interacting with it, the human
made, computer vision-enabled systems are relatively low in Although these algorithms improved over time, window
accuracy and speed. Any advancements leading to selection or identifying multiple objects from a single image
improvements in efficiency and performance in this field was still an issue. To bring solutions to this issue, algorithms
could pave paths to creating more intelligent systems, much with region proposals, crop/warp features, SVM
like humans. These advancements, in turn, would ease human classifications and bounding box regression such as Regions
life through systems such as assistive technologies that allow with CNN (R-CNN) were introduced. Although R-CNN was
humans to complete tasks with little to no conscious thought. comparatively high in accuracy with the previous inventions,
For instance, driving a car equipped with a computer vision- its high usage of space and time later led to the invention of
enabled assistive technology could predict and notify a driving Spatial Pyramid Pooling Network (SPPNet) [6]. Despite
crash prior to the incident, even if the driver is not conscious SPPNet's speed, to reduce the similar drawbacks it shared with
of their actions. Therefore, real-time object detection has R-CNN; Fast R-CNN was introduced. Although Fast R-CNN
become a highly required subject in continuing the automation could reach real-time speeds using very deep networks, it held
or replacement of human tasks. Computer vision and object a computational bottleneck. Later Faster R-CNN, an
detection are prominent fields under machine learning and are algorithm based on ResNet, was introduced. Due to Faster R-
eventually expected to aid unlocking the potential general- CNN not yet capable of surpassing state of the art detection
responsive robotic systems. systems, YOLO was introduced. This paper reviews the
dominating real-time object detection algorithm You Only
With the current technological advancements, creating Look Once (YOLO).
openness and attainability of data to and from everyone
connected to it has become an easy task. Most human lives Consisting of layers in the basic CNN architecture and
revolved around mainstream personal computers (PCs), and YOLO networks, each layer's characteristics and the two
smartphones have made this process even more accessible. versions of YOLO; YOLO-V1 and YOLO-V2 would be
Along with this process, the expansion of information and reviewed under this paper. The strengths and weaknesses of
images available on the internet/cloud has become to the point YOLO would be exposed, finally being followed by a
of millions per day. Usage of computerized systems to utilize summarized conclusion.
this information and make necessary recognitions and
processes is vital due to humans' impracticality performing the
same iterative tasks. The initial step of most such processes II. CONVOLUTIONAL NEURAL NETWORK (CNN)
may include recognizing a specific object or area on an image. A Convolutional Neural Network (CNN) could be taken
Due to the unpredictability of the availability, location, size,
as a subcategory under Deep Neural Networks specifically
or shape of an item in each image, the recognition process is
invented for image processing and object detection. CNN
inconceivably hard to be performed through a traditional
programmed computer algorithm. Factors such as the algorithms can be utilized without requiring an enormous
complexity of the foundation, light intensities too contribute amount of predefined substantial parameters for the provided
to this. image. This ease at training a model and the vast amount of
information available through the internet has made CNN As the layer, which is why CNN received its name, the
algorithms possible. The mechanism CNN algorithms follow convolutional layer is the most critical layer in a CNN
to express and extract features of the input data is entirely structure. Comprised of multiple element maps and many
mathematical. This mechanism involves a weight sharing neurons inside them, each of these neurons is created to
process that recognizes and identifies information that holds untangle nearby qualities of various positions in the previous
similar features. This process enables networks to analyze layer [9]. Many nearby associations and many mutual
high data dimensions to achieve the final output of excellent attributes use a filter called CONV kernel, which slides on the
classification in the end. One of the apparent obstacles in original image inputted to it. The CONV kernel calculates the
moving forward with getting better results using CNN image's component portrayal by multiplying and adding the
models is the processing capabilities of available hardware values of each pixel of the local correlated data within it
and the scope of parameters in datasets. before being added to the convolutional result. This so-called
The invention of the CNN [7] in 1998 with LeNet and its rule of convolution enables the features of the image to be
bloom in 2012 with AlexNet was at the error rate of 15.3% extracted using the CONV kernel. The reason for filtering the
followed by ZF-net. The inventions of GoogLeNe and various parts of an image with the same CONV kernel is that
VGGNet has made the error rate lower over time. An this refers to shared weights. This usage of shared weights
exceptional milestone in this timeline was when ResNet enables neutral cells with the same features to be recognized
surpassed the error rate of 3.6%, which was lower than that and classified into the same object type. Parameters such as
of the human eye (5.1%) in 2015, proving that deep learning kernel size, depth, stride, zero-padding, and filter quantity can
models could surpass human capabilities. be inputted onto this.

3) Active Layer
A. Structure of CNN
The active layer is the layer used to solve the problem of
A typical CNN is structured with multiple layers: an the vanishing gradient due to underfitting. This underfitting,
input layer, a convolutional layer, an active layer, a pooling nonlinear problem is caused by the previous convolutional
layer, a fully connected layer and finally, an output layer. layer. One of the active layer functions such as Sigmoid,
Some types of CNN models might include other layers for Tanh, the rectified Linear Unit (ReLu), the exponential
different purposes too. Figure 1 shows the basic structure of Linear Unit (ELU), Leaky ELU, or Maxout could be used in
a CNN architecture. solving underfitting, following their usage [10]. Considering
the converging speed, ReLu function has been the most
popular although Sigmoid and Tanh functions are still
commonly used due to their simplicity and efficiency.

4) Pooling Layer
Figure 1: The typical CNN structure with seven layers The pooling layer's job is to efficiently reduce the
dimensions of the results sent from the convolutional layer.
Source:https://www.researchgate.net/publication/340102110_Hier This is achieved by joining the neurons' outcome at one layer
archical_Multi-View_Semi-Supervised_Learning_for_Very_High- into a single neuron in the following layer, thus diminishing
Resolution_Remote_Sensing_Image_Classification the elements of the component maps and incrementing the
strength of selected extractions. Pooling layers are usually
This multi-layered architecture is diverse in layers and situated between two convolutional layers and can be
uses forward pass and error backpropagation calculations to categorized into three distinct types based on their width:
achieve the target's proficiency. Training this architecture to general pooling, overlapping pooling and Spatial Pyramid
become a model is a directed procedure that requires a Pooling (SPP). A pooling layer is called a general pooling
collection of imagery data and their labels. Eventually, at the layer when its width is mainly equal to its stride. General
end of the training process, the most suitable weights would pooling's activities include max pooling and normal pooling.
be calculated to be used at the testing phase. These layers, as When the most extreme incentives from each neuron group
mentioned above, could be further explained as follows. from the previous layer are utilized, it is called max pooling.
When it is done for the normal incentives, it is referred to as
normal pooling. Overlapping pooling is when the width is
1) Input Layer longer than the stride. Therefore, abnormal state attributes
The input layer is used to initialize the input image data from the input layer can be extracted and acquired by
and make all the available dimensions zero-centered. This structuring a few convolutional layers along with a final
layer is also responsible for normalizing the scale of all input pooling layer.
data to a range within 0 and 1, which would help in
accelerating the speed of converging. This normalization is 5) Fully Connected Layer
also helpful in reducing redundancy by whitening the data. Often the last layer before the output layer, the fully
Principal Component Analysis (PCA) is done to degrade and connected layer transmits data to the output layer while being
decorate the available dimensions of the extracted data while the completely associated layer amongst the CNN layers. By
focusing on key dimensions.[8] utilizing each neuron in the past layer and interfacing them to
each neuron on its own, it simplifies and speeds up the data
2) Convolutional Layer calculation process. It being a completely associated layer
saves no spatial data and is constantly trailed by a yield layer.
bounding box regression at the same time. With YOLO, the
6) Other Layers class label containing objects, their location can be predicted
Apart from the different layers used in structuring a CNN in one glance. Entirely deviating from the typical CNN
model mentioned above, some CNN models need additional pipeline, YOLO treats object detection as a regression
layers to achieve the expected output. Layers such as dropout problem by spatially separating bounding boxes and their
layers, regression layers come under this. Dropout layers are related class probabilities, which are predicted using a single
often used to solve overfitting by avoiding majorly subjective neural network. This process of performing both bounding
weights by updating weights of the neural cell knot with a box prediction and class probability calculations is a unified
network architecture that YOLO initially introduced.
certain probability (which is decided by the stochastic
policy). Whereas, regression layer is used to classify features YOLO algorithm extends GoogLeNet equations to be used
using a method such as logistic regression (LR), Bayesian as their base forwarding transport computation, assumably the
Linear Regression (BLR) and Gaussian Processes for reason behind the speed and accuracy of YOLO's real-time
Regression (GPR). The output of a regression layer is the object detection. In comparison with R-CNN architectures,
probabilities of all the possible object types. unlike running a classifier on a potential bounding box, then
reevaluating probability scores, YOLO predicts bounding
boxes and class probability for those bounding boxes
III. TYPES OF OBJECT DETECTION ALGORITHMS simultaneously. This optimizes the YOLO algorithm and is
one of the significant reasons why YOLO is so fast and less
Algorithms available for object detection can be divided
likely to have errors to be utilizable for real-time object
into two categories: classification-based algorithms and
predictions.
regression-based algorithms.
YOLO's architecture is similar to a typical convolutional
1) Classification based algorithms neural network inspired by the GoogLeNet model for image
Classification based algorithms are implemented in two classification. The network's initial layer first extracts the
stages. The initial stage is the selection of region that is of image's features, and the fully connected layers predict the
interest (RoI) in the image. Then these regions are classified output probabilities and coordinates. With 24 convolutional
with the use of a convolutional neural network. This approach layers, two fully connected layers, 1x1 reduction layers and
of performing one stage prior to the other can be slow due to 3x3 convolutional layers, the full YOLO network model
created [12].
the need to run the prediction algorithms on each region
selected in the first stage. Few common examples for this
type of algorithms are the Retina Net, Region-based CNN
A. Unified Detection of YOLO
(RCNN), the Fast-RCNN, Faster R-CNN and Mask-RCNN
(which is known to be a state-of-art under regional-based YOLO is introduced as a unified algorithm as separate
CNN algorithms). components merge into a single neural network as the final
pipeline. For each bounding box to be predicted parallelly, the
2) Regression-based algorithms features of the entire image are globally reasoned. YOLO is
Regression-based algorithms are implemented so that designed in such a way that it does its own end-to-end training
instead of selecting and singling out regions of interest in an in real-time while keeping high-level average precision. To
image, they predict classes and their relevant bounding boxes achieve unified detection, YOLO first separates the input
image into a S X S size grids. If the Object's center is being
for the whole image in one run through the model. Since
placed into the grid cell; the grid cell tries object detection on
frame detection is treated as a regression problem, a complex
itself. Thus, every grid cell tries to estimate a bounding box
pipeline is not necessary for regression-based algorithms. and their confidence scores across all classes trained to
Famous examples of this type of algorithms are the Single predict. The predicted confidence scores will reflect how
Shot Multibox Detector (SSD) and YOLO algorithms. Due to confident it is to provide each label and bounding box to each
the simultaneousness of the detection and its nature of high object. Formally the confidence scores are defined as Pr
speed (achieved with a tradeoff with accuracy), these are (Object) x IOUtruthpred. If an object has been found inside
commonly used for real-time object detection. The detection the cell, this confidence score will be equal to the intersection
and understanding of the more popular YOLO algorithms over union (IOU) between the ground truth and the predicted
require an initial establishment of what will be predicted box. If not, the confidence score would be equal to zero. The
before the models are used. The prediction would result in a unified detection outputs each confidence score to have five
bounding box (specifying the Object's location) along with a parameters: w, y, w, h, and confidence. The (x, y) coordinates
class that has the highest probability amongst the established represent the center of the box with respect to the grid cell's
set of classes. boundaries. As mentioned above, if the box's center does not
fall inside the grid cell, then the cell is not responsible for its
prediction. With each coordinate being normalized to be
IV. YOU ONLY LOOK ONCE (YOLO) ALGORITHM contained inside the range of 0 and 1, the estimated Object's
YOLO is a novel approach to detect multiple objects height and width are calculated with respect to the entire
present in an image in real-time while drawing bounding image. According to Mauricio Menegaz in his article [11] the
boxes around them. It passes the image through the CNN prediction is of a few steps.
algorithm only once to get the output, thus the name. Although
comparatively similar to R-CNN, YOLO practically runs a lot
faster than Faster R-CNN because of its simpler architecture.
Unlike Faster R-CNN, YOLO can classify and perform
namely are pc, bx, by, bw, bh, c1, c2, c3. Pc shows if a
particular grid has an object or not. If an object is available,
the pc is assigned 1 else 0. bx, by, bh, bw are bounding box
parameters of a grid and are only defined if a proper object is
available in that grid. c1, c2, c3 are classes. If the object is a
car, then the value of c1, c2, c3 are 0,1,0 respectively [11].

Figure 2: Example of how to coordinate parameters are calculated


in a 448 X 448 image with S = 3
Source: https://hackernoon.com/understanding-yolo-f5a74bbc7967
Figure 3: Example image with 3x3 grids.
Figure 2 depicts how the x coordinate of (220-149)/149 is
Source: https://jespublication.com/upload/2020-110682.pdf
normalized as 0.48. And y coordinate of (190-149)/149 is
normalized as 0.28. The width (w) of 224 is calculated as
224/448 = 0.50 with respect to the entire image. And height In the example grid, a proper object cannot be
(h) of 143 is calculated as 143/448 = 0.32 with respect to the identified from the first grid. Therefore, pc value is 0 and
entire image. bounding box parameters need not be assigned as there is no
defined object. Class probability cannot be identified as there
The confidence score predicts the IOU among the is no proper object (Figure 4). The 6th grid has a proper object
prediction box and the ground truth box along with these and therefore pc value is assigned 1 and bounding boxes for
parameters. This confidence score reflects the presence or the object are bx, by, bw and bh. Since the object is a car, the
absence of an object of any class inside the bounding box. classes for the grid are 0,1,0 (Figure 5) [11].
Along with these calculations, every grid cell having an object
also estimates the conditional class probabilities, given as Pr
(Class(i) | Object). This probability is conditioned on the grid
cell containing one object. Therefore, if no object is present
on the grid cell, the loss function will not penalize it for a
wrong class prediction. Since the network will only predict
one set of class probabilities per cell regardless of the number
of boxes (B), the total number of class probabilities could be
taken as S X S X C. It is said that at the time of testing when
confidence scores for each box is individually calculated, the
conditional class probabilities and the individual box
confidence predictions are multiplied as; Pr (Class (i) |
Object) X Pr (Object) X IOUtruthpred = Pr (Class(i)) X
IOUtruthpred. The confidence scores for each box reflect the
class's possibility being shown inside the box and how exactly
the Object fits the estimated box. Figure 4:Bounding box and class values for grid 1
Source: https://jespublication.com/upload/2020-110682.pdf

B. How YOLO Algorithm works?


YOLO algorithm is an algorithm based on regression. It
predicts class probabilities of the object and bounding boxes
specifying the object’s location, for the entire image. The
bounding boxes of the object are described as: bx, by, the x,
y coordinates represent the center of the box relative to the
bounds of the grid cell. The bw, bh as the width and height
are predicted relative to the whole image and the value c is
representing the class of the object. YOLO takes the image as
input and divides it into S x S grids (3 x 3). Then, image
classification and object localization techniques are applied
to each grid of the image and each grid is given a label. The
YOLO algorithm then checks every grid for an object and
identifies its label and bounding boxes. The label of a grid Figure 5:Bounding box and class values for grid 6
that does not have an object is indicated as zero. Every Source: https://jespublication.com/upload/2020-110682.pdf
labelled grid is defined as S.S having 8 values. The 8 values
The matrix is defined as S x S x 8, where S x S resulting in class-specific confidence scores for each
represents the entire grid size, image gets divided into and 8 predicted box. These scores encode both the probability of
indicates the total count of pc, bx, by, bw, bh, c1, c2, c3 that class appearing in the box and how well the predicted
values. Bounding boxes differ for each grid depending on the box fits the object (Figure 6). These predictions are encoded
position of objects in the relative grid. If more than two grids as an S × S × (N ∗ 5 + c) tensor. The width(bw) and the
have the same object, then the grid cell that has the center of height(bh) are predicted relative to the whole image. And that
the object is used to detect that object. For a precise is why YOLOv1 uses Nx5 for calculating tensor. [10]
identification of the object, two methods can be used; 1.
Intersection over Union (IOU) 2. Non-Max Suppression [11]. Pr(Class i |Object) ∗ Pr(Object) ∗ IOU pred = Pr(Class i
In IOU, actual and estimated bounding box values are used ) ∗ IOU pred.
and the IOU of both values are computed using the following
formulae. These confidence scores reflect how confident the model
is in ensuring that the predicted box contains an object and
IOU = Intersection Area how accurate the model thinks the box around the different
------------------------ objects is. Confidence score is defined as:
Union Area Pr(Object) * IOUtruthpred

It is better if the computed IOU is greater than a YOLOv1’s network has 24 convolutional layers as
threshold value (an assumed value for increasing the accuracy opposed to YOLOv2, which has 19 layers [10]. For
of the detected object.) 0.5 [11]. evaluating YOLO model on the PASCAL VOC detection
In Non-Max Suppression, the next method, high possibility dataset, these values are used: S=7, therefore a 7x7 grid. N=2,
boxes are used and the boxes with high IOU values are number of bounding boxes. The PASCAL VOC dataset has
suppressed [11]. This process is followed many times until a 20 labelled classes so c=20. Therefore YOLOv1’s final
box is considered as the bounding box for the object. Each prediction is a 7x7x(5x2+20)=7x7x30 tensor. Here only 98
grid cell also predicts ‘c’ conditional class probabilities for bounding boxes per image is used [10], [12]
the object in that grid. These probabilities are conditioned on
the grid cell containing an object. Only one set of class • YOLOv2
probabilities is predicted for a grid cell, regardless of the YOLO Version 2 is an improved version of the existing
number of bounding boxes for that grid cell. [12] YOLO algorithm. The speed of detection performance
remains same while the mAP value increased compared to
YOLOv1’s Map value of 63.4. New multi-scale training
method can be used to run the YOLOv2 run at various sizes
offering improvements in accuracy and speed in prediction.
YOLOv2 adds a list of significant solutions to increase mAP.
Batch Normalization preprocesses the input data. High
Resolution Classifier from YOLOv1’s 224x224 to 448x448
raises the mAP by 4%. Its neural network has 19
convolutional layers compared to the YOLOv1 which has 24.
YOLOv2 adopts convolutional with anchor boxes and
increases each grid cell’s resolution from YOLOv1’s 7x7 to
13x13. It also has only one bounding box for each grid cell.
Finally, YOLOv2 adds a pass-through layer to get the
extracted features from the former layer and combine them
with the original final output features, so that the ability of
detecting the small object would be enhanced. In this mean,
Figure 6: Complete process of Object detection by YOLO YOLOv2 raises the mAP by 1% [10].
Source: https://jespublication.com/upload/2020-110682.pdf

VI. STRENGTHS AND WEAKNESSES OF YOLO


V. YOLO VERSIONS YOLO is the state-of-art real-time object detection
algorithm that surpasses the previous CNN detection speed
• YOLOv1
limits while maintaining a good balance between speed and
Base YOLO is also called YOLO Version 1[10]. It
accuracy. YOLOv2, the latest version of YOLO achieving a
detects the object basing it as a regression problem. A single
mean Average Precision (mAP) rate of 76.8 at 67 Frames per
convolutional network predicts multiple bounding boxes and
Second (FPS) and 78.6 mAP rate at 76 FPS, outperforms
class probabilities for all the grid cells simultaneously. The
regional-based algorithms such as Faster R-CNN in both
input image is divided into S x S grids. If the center of a
speed and accuracy. Another great strength in YOLO is its
proper object falls into a grid cell, that grid cell will be
global reasoning skills that encode the contextual information
considered in detecting that object. Each grid cell predicts N
about the whole image rather than a specific region. Along
bounding boxes and confidence scores for the boxes and c
with these global reasoning skills, the ability to predict false
class probabilities. At test time, the class probabilities and the
positives in the background increases, improving the
individual box confidence predictions are multiplied
algorithm's reasoning skills as a whole. Lastly, with YOLO's
ability to learn basic representations of the labelled objects, it REFERENCES
has outperformed other detection methods, including [1] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner,
(Deformable Part Model) DPM and R-CNN when “Gradient-Based Lerning Applied to Document
generalizing natural images amongst images like artwork. Recognition,” proc. IEEE, 1998, [Online]. Available:
Due to YOLO's generalizability and applicability in new http://ieeexplore.ieee.org/document/726791/#full-
domains and unexpected outputs, it is considered one of the text-section.
best object detection algorithms in the domain.
Although YOLO has many unique strengths, it also [2] T. F. Gonzalez, “Handbook of approximation
has weaknesses. One of the notable weaknesses of YOLO is algorithms and metaheuristics,” Handb. Approx.
its spatial constraints on bounding boxes. These spatial Algorithms Metaheuristics, pp. 1–1432, 2007, doi:
constraints are held due to each cell being able to predict only 10.1201/9781420010749.
two boxes and one class. It limits the number of predictable [3] K. Simonyan and A. Zisserman, “Very deep
objects nearby to each other in groups (such as the convolutional networks for large-scale image
recognition of a flock of birds, a basket of similar fruits). Due recognition,” 3rd Int. Conf. Learn. Represent. ICLR
to only being taught through input data, YOLO also has a 2015 - Conf. Track Proc., pp. 1–14, 2015.
weakness in generalizing objects in unusual or new aspect
ratios. However, this could be considered as more of a general [4] C. Szegedy et al., “Going Deeper with
problem in the object detection domain. Since the model only Convolutions,” 2015, doi: 10.1002/jctb.4820.
uses relatively coarse features for prediction, the architecture [5] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual
has a few down sampling layers from the input images, which Learning for Image Recognition,” doi:
can be mentioned as a general weakness of YOLO. Compared
10.1002/chin.200650130.
with an ideal algorithm, YOLO's loss function (which
approximates the detection performance) treats errors with [6] J. Hosang, R. Benenson, P. Dollar, and B. Schiele,
the same loss despite its object boxes' size. This is not “What Makes for Effective Detection Proposals?,”
advantageous since a small error in a small box is not IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no.
equivalent to a small error in a large box, which has a more 4, pp. 814–830, 2016, doi:
significant effect on the Intersection over Union (IOU), 10.1109/TPAMI.2015.2465908.
giving out incorrect localizations in the end. Therefore, these
[7] A. S. R. H. A. J. S. S. Carlsson, “CNN Features off-
could be taken as some of the areas YOLO algorithm could
the-shelf: an Astounding Baseline for Recognition,”
improve upon.
Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
Work., vol. 7389, pp. 806–813, 2014, doi:
VII. CONCLUSION AND FUTURE SCOPE 10.1117/12.827526.
This paper reviews the fundamental structure of CNN [8] A. Krizhevsky, I. Sutskever, and G. E. Hinton,
algorithms and an overview of YOLO's real-time object “ImageNet Classification with Deep Convolutional
detection algorithm. CNN architecture models can remove Neural Networks,” [Online]. Available:
highlights and discover objects in each given image. When https://www.cv-
properly used, CNN models can solve deformity foundation.org//openaccess/content_cvpr_workshop
identification, instructive/ learning application creation etc. s_2014/W15/papers/Razavian_CNN_Features_Off-
When in comparison with other CNN algorithms, YOLO has the-Shelf_2014_CVPR_paper.pdf.
many advantages in practice. Being a unified object detection
model that is simple to construct and train in correspondence [9] T. Guo, J. Dong, H. Li, and Y. Gao, “Simple
with its simple loss-function, YOLO can train the entire model convolutional neural network on image
in parallel. The second major version of YOLO, YOLOv2, classification,” 2017 IEEE 2nd Int. Conf. Big Data
provides the state-of-art, best tradeoff between speed and Anal. ICBDA 2017, pp. 721–724, 2017, doi:
accuracy for object detection. YOLO is also better at 10.1109/ICBDA.2017.8078730.
generalizing Object representation compared with other object
detection models and can be recommended for real-time [10] J. Du, “Understanding of Object Detection Based on
object detection as the state-of-art algorithm in object CNN Family and YOLO,” J. Phys. Conf. Ser., vol.
detection. With these marks, it is acknowledgeable that the 1004, no. 1, 2018, doi: 10.1088/1742-
field of object detection has an expanding, great future ahead. 6596/1004/1/012029.

ACKNOWLEDGEMENT [11] Mauricio Menegaz, “Understanding YOLO – Hacker


Noon,” Hackernoon. 2018.
The authors would like to thank Dr Dharshana
Kasthurirathna of the Faculty of Computing, SLIIT, to write [12] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi,
this review paper as an assignment. The expertise of everyone “You only look once: Unified, real-time object
who improved this study in numerous ways is also gratefully detection,” 2016, doi: 10.1109/CVPR.2016.91.
appreciated.

You might also like