YOLO-LITE: A Real-Time Object Detection Algorithm Optimized For Non-GPU Computers

Uploaded by

A. H

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

119 views8 pages

YOLO-LITE: A Real-Time Object Detection Algorithm Optimized For Non-GPU Computers

Uploaded by

A. H

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

2018 IEEE International Conference on Big Data (Big Data)

YOLO-LITE: A Real-Time Object Detection

Algorithm Optimized for Non-GPU Computers
Rachel Huang* Jonathan Pedoeem* Cuixian Chen
School of Electrical and Computer Engineering Electrical Engineering Mathematics and Statistics
Georgia Institute of Technology The Cooper Union UNC Wilmington
Atlanta, United States New York, United States North Carolina, United States
[email protected] [email protected] [email protected]

Abstract—This paper focuses on YOLO-LITE, a real-time

object detection model developed to run on portable devices
such as a laptop or cellphone lacking a Graphics Processing
Unit (GPU). The model was first trained on the PASCAL VOC
dataset then on the COCO dataset, achieving a mAP of 33.81%
and 12.26% respectively. YOLO-LITE runs at about 21 FPS on a
non-GPU computer and 10 FPS after implemented onto a website
with only 7 layers and 482 million FLOPS. This speed is 3.8×
faster than the fastest state of art model, SSD MobilenetvI. Based
on the original object detection algorithm YOLOV2, YOLO-
LITE was designed to create a smaller, faster, and more efficient
model increasing the accessibility of real-time object detection to
a variety of devices.
Index Terms—object detection; YOLO; neural networks; deep
learning; non-GPU; mobile

I. I NTRODUCTION
In recent years, object detection has become a significant
field of computer vision. The goal of object detection is to
detect and classify objects leading to many specialized fields
and applications such as face detection and face recognition.
Vision is not only the ability to see a picture in ones head
but also the ability to understand and infer from the image
that is seen. The ability to replicate vision in computers is
necessary to progress day to day technology. Object detection
addresses this issue by predicting the location of objects
through bounding boxes while simultaneously classifying each
object in a given image [1], [2], [3].
In addition, with recent developments in technology such
Fig. 1. Example images passed through our YOLO-LITE COCO model.
as autonomous vehicles, precision and accuracy are no longer
the only relevant factors. A model’s ability to perform object
detection in real-time is necessary in order to accommodate
for a vehicle’s real-time environment. An efficient and fast impractical for everyday applications. The general trend in
object detection algorithm is key to the success of autonomous computer vision is to make larger and deeper networks to
vehicles [4], augmented reality devices [5], and other intel- achieve higher accuracy [6], [7], [8], [9]. However, such
ligent systems. A lightweight algorithm can be applied to improvement in accuracy with heavy computational cost may
many everyday devices, such as an Internet connected doorbell not be helpful to face the challenge in many real world
or thermostat. Currently, the state-of-the-art object detection applications which require real-time performance carried out
algorithms used in cars rely heavily on sensor output from in a computationally limited platform.
expensive radars and depth sensors. Other techniques that Previous methods, such as You-Only-Look-Once (YOLO)
are solely computer based require immense amount of GPU [10] and Regional-based Convolutional Neural Networks (R-
power and even then are not always real-time, making them CNN) [11], have successfully achieved an efficient and accu-
rate model with high mean average precision (mAP); however,
*equal authorship their frames per second (FPS) on non-GPU computers render

978-1-5386-5035-6/18/$31.00 ©2018 IEEE 2503

Authorized licensed use limited to: UNIVERSITE DE MONASTIR. Downloaded on May 06,2021 at 10:54:25 UTC from IEEE Xplore. Restrictions apply.
them useless for real-time use. In this paper, YOLO-LITE is
presented to address this problem. Using the You Only Look
Once (YOLO) [10] algorithm as a starting point, YOLO-LITE
is an attempt to get a real time object detection algorithm on
a standard non-GPU computer.
II. R ELATED W ORK
There has been much work in developing object detection
algorithms using a standard camera with no additional sensors. Fig. 2. Illustration depicting the definitions of intersection and union.
State-of-the-art object detection algorithms use deep neural
networks. Simultaneously, while the bounding boxes are made, each
Convolutional Neural Networks (CNNs) is the main archi- grid cell also predicts C conditional class probability. The
tecture that is used for computer vision. Instead of having class-specific probability for each grid cell [10] is defined as:
fully-connected layers, a CNN has a convolution layer where
a filter is convolved with different parts of the input to create
the output. The use of a convolution layer allows for relational truth
P r (Classi |Object) ∗ P r (Object) ∗ IOUpred
patterns to be drawn from an input. In addition, a convolution
truth
layer tends to have less weights that need to be learned than a =P r (Classi ) ∗ IOUpred . (2)
fully connected layer as filters do not need an assigned weight
from every input to every output. YOLO uses the following equation below to calculate loss
and ultimately optimize confidence:
A. R-CNN
Regional-based convolutional neural networks (R-CNN)
[11] consider region proposals for object detection in images.
Loss =
From each region proposal, a feature vector is extracted and 2
s X
A
fed into a convolutional neural network. For each class, the
1obj
X
2 2
feature vectors are evaluated with Support Vector Machines λcoord ij [(bxi − bx̂i ) + (byi − bŷi ) ]
i=0 j=0
(SVM). Although R-CNN results in high accuracy, the model
2
is not able to achieve real-time speed even with Fast R-CNN s X
A
1obj
X p p p q
[12] and Faster R-CNN [13] due to the expensive training + λcoord ij [( bwi − bŵi )2 + ( bhi − bĥi )2 ]
process and the inefficiency of region proposition. i=0 j=0
s2 X
A
1obj
X
B. YOLO 2
+ ij (Ci − Ĉi )
You Only Look Once (YOLO) [10] was developed to i=0 j=0
2
create a one step process involving detection and classification. s X
A
1noobj
X
Bounding box and class predictions are made after one eval- + λnoobj ij (Ci − Ĉi )2
uation of the input image. The fastest architecture of YOLO i=0 j=0
is able to achieve 45 FPS and a smaller version, Tiny-YOLO, s2
1obj
X X
achieves up to 244 FPS (Tiny YOLOv2) on a computer with + i (pi (c) − p̂i (c))2 . (3)
a GPU. i=0 c∈classes
The idea of YOLO differs from other traditional systems
in that bounding box predictions and class predictions are The loss function is used to correct the center and the
done simultaneously. The input image is first divided into a bounding box of each prediction. Each image is divided into
S × S grid. Next, B bounding boxes are defined in every grid an S × S grid, with A bounding boxes for each grid. The bx
cell, each with a confidence score. Confidence here refers to and by variables refer to the center of each prediction, while bw
the probability an object exists in each bounding box and is and bh refer to the bounding box dimensions. The λcoord and
defined as: λnoobj variables are used to increase emphasis on boxes with
truth objects, and lower the emphasis on boxes with no objects. C
C = P r (Object) ∗ IOUpred (1)
refers to the confidence, and p(c) refers to the classification
where IOU, intersection over union, represents a fraction prediction. The 1obj
ij is 1 if the j
th
bounding box in the ith cell
between 0 and 1. Intersection is the overlapping area between is responsible for the prediction of the object, and 0 otherwise.
the predicted bounding box and ground truth, and union is 1obj
i is 1 if the object is in cell i and 0 otherwise. The loss
the total area between both predicted and ground truth as indicates the performance of the model, with a lower loss
illustrated in Figure 2. Ideally, the IOU should be close to indicating higher performance.
1, indicating that the predicted bounding box is close to the While loss is used to gauge performance of a model, the
ground truth. accuracy of predictions made by models in object detection

2504

Authorized licensed use limited to: UNIVERSITE DE MONASTIR. Downloaded on May 06,2021 at 10:54:25 UTC from IEEE Xplore. Restrictions apply.
Model Layers FLOPS (B) FPS mAP Dataset
YOLOv1 26 not reported 45 63.4 VOC
for the development of the architecture, since its small size
YOLOv1-Tiny 9 not reported 155 52.7 VOC allows for quicker training. The best performing model was
YOLOv2 32 62.94 40 48.1 COCO used as a platform for the next round of iterations.
YOLOv2-Tiny 16 5.41 244 23.7 COCO
YOLOv3 106 140.69 20 57.9 COCO
While there was a focus on trying to intuit what would
YOLOv3-Tiny 24 5.56 220 33.1 COCO improve mAP and FPS, it was hard to find good indicators.
TABLE I From the beginning, it was assumed that the FLOPS count
P ERFORMANCE OF EACH VERSION OF YOLO. would correlate with FPS; this proved to be true. However,
adding more filters, making filters bigger, and adding more
layers did not easily translate to an improved mAP.
are calculated through the average precision equation shown A. Setup
below: Darknet, the framework created to develop YOLO was used
n
X to train and test the models. The training was done on a
avgPrecision = P (k)∆r(k). (4) Alienware Aura R7, with a Intel i7 CPU, and a Nvidia 1070
k=1
GPU. Testing for the frames per second were done on a Dell
P(k) here refers to the precision at threshold k while ∆r(k) XPS 13 laptop, using Darkflow’s live demo example script.
refers to the change in recall.
The neural network architecture of YOLO contains 24 B. PASCAL VOC and COCO Datasets
convolutional layers and 2 fully connected layers. YOLO is YOLO-LITE was trained on two datasets. The model was
later improved with different versions such as YOLOv2 or first trained using a combination of PASCAL VOC 2007 and
YOLOv3 in order to minimize localization errors and increase 2012 [18]. It contains 20 classes with approximately 5,000
mAP. As seen in Table I, a condensed version of YOLOv2, training images in the dataset.
Tiny-YOLOv2 [14], has a mAP of 23.7% and the lowest The highest performing model trained on PASCAL VOC
floating point operations per second (FLOPS) of 5.41 billion. was then retrained on the second dataset, COCO 2014 [19],
When Tiny-YOLOv2 runs on a non-GPU laptop (Dell XPS containing 80 classes with approximately 40,000 training
13), the model speed decreases from 244 FPS to about 2.4 images. Figure 3 shows some example images with object
FPS. With this constriction, real-time object detection is not segmentation taken from the COCO dataset.
easily accessible on many devices without a GPU, such as
most cellphones or laptops. TABLE II
PASCAL VOC AND COCO DATSETS
III. YOLO-LITE A RCHITECTURE
Our goal with YOLO-LITE is to develop an architecture that Dataset Training Images Number of Classes
can run at a minimum of ∼ 10 frames per second (FPS) on a PASCAL VOC 2007 + 2012 5,011 20
COCO 2014 40,775 80
non-GPU powered computer with a mAP of 30% on PASCAL
VOC. This goal is determined from looking at the state-of-the-
art and creating a reasonable benchmark to reach. YOLO-LITE
C. Indicators for Speed and Precision
offers two main contributions to the field of object detection:
1) Demonstrates the capability of shallow networks with Table III reveals what was successful and what was not
fast non-GPU object detection applications. when developing YOLO-LITE. The loss which is reported in
2) Suggests that batch normalization is not necessary for Table III, was not a good indicator of mAP. While a high loss
shallow networks and, in fact, slows down the overall indicates a low mAP there is no exact relationship between the
speed of the network. two. This is due to the fact that the losses listed in Equation 3
are not defined exactly by the mAP but rather a combination of
While some works [15], [16], [17] focused on creating an different features. The training time, when taken in conjugation
original convolution layer or pruning methods in order to with the amount of epochs, was a very good indicator of FPS
shrink the size of the network, YOLO-LITE focuses on taking as seen from Trials 3, 6 etc. The FLOPS count was also a good
what already existed and pushing it to its limits of accuracy indicator, but given that the FLOPS count does not take into
and speed. Additionally, YOLO-LITE focuses on speed and consideration the calculations and time necessary for batch
not overall physical size of the network and weights. normalization, it was not as good as considering at the epoch
Experimentation was done with an agile mindset. Using rate.
Tiny-YOLOv2 as a starting point, different layers were re- Trials 4, 5, 8, and 10 showed that there was no clear rela-
moved and added and then trained on Pascal VOC 2007 & tionship between adding more layers and filters and improving
2012 for about 10-12 hours. All of the iterations used the accuracy.
same last layer as Tiny-YOLOv2. This layer is responsible for
splitting the feature map into the SxS grid for predicting the
bounding boxes. The trials were then tested on the validation
set of Pascal 2007 to calculate mAP. Pascal VOC was used

2505

Authorized licensed use limited to: UNIVERSITE DE MONASTIR. Downloaded on May 06,2021 at 10:54:25 UTC from IEEE Xplore. Restrictions apply.
TABLE III
R ESULTS FOR E ACH T RIAL RUN ON PASCAL VOC.

Model Layers mAP FPS FLOPS Time Loss

Tiny-YOLOv2 (TY) 9 40.48% 2.4 6.97 B 12 hours 1.26
TY- No Batch Normal- 9 35.83% 3 6.97 B 12 hours 0.85
ization (NB)
Trial 1 7 12.64% 1.56 28.69 B 10 hours 1.91
Trial 2 9 30.24% 6.94 1.55 B 5 hours 1.37
Trial 2 (NB) 9 23.49% 12.5 1.55 B 6 hours 1.56
Trial 3 7 34.59% 9.5 482 M 10 hours 1.68
Trial 3 (NB) 7 33.57% 21 482 M 7 hours 1.64
Trial 4 8 2.35% 5.2 1.03 B 10 hours 1.93
Trial 5 7 .55% 3.5 426 M 7 hours 2.4
Trial 6 7 29.33% 9.7 618 M 11 hours 1.91
Trial 7 8 16.84% 5.7 482 M 7 hours 2.3
Trial 8 8 24.22% 7.8 490 M 13 hours 1.3
Trial 9 7 28.64% 21 846 M 12 hours 1.5
Trial 10 7 23.44% 8.2 1.661 B 10 hours 1.55
Trial 11 7 15.91% 21 118 M 12 hours 1.35
Trial 12 8 26.90% 6.9 71 M 9 hours 1.74
Trial 12 (NB) 8 25.16% 15.6 71 M 12 hours 1.35
Trial 13 8 39.04% 5.8 1.083 B 11 hours 1.42
Trial 13 (NB) 8 33.03% 10.5 1.083 B 16 hours 0.77

TABLE IV
A RCHITECTURE OF E ACH T RIAL ON PASCAL VOC

Trial Architecture Description

TY-NB Same architecture at TY, but with no batch normalization.
Trial 1 First 3 layers same as TY. Layer 4 has 512 1 3x3 filters. Layer 5 has 1024 3x3 layers. Layer 6 & 7 same as the last 2 layers of TY
Trial 2 Same Architecture as TYV, but input image size shrunk to 208x208
Trial 2 (NB) Same architecture as trial 2, but no batch normalization
Trial 3 First 4 layers same as TYV. Layer 5 has 128 3x3 filters. Layer 6 has 128 3x3 filters. Layer 7 has 256 3x3 filters. Layer 8 has 125
1x1 filters.
Trial 3 (NB) Same architecture as Trial 3, but no batch normalization
Trial 4 Layer 1 5 3x3 filters. Layer 2 5 3x3 filters. Layers 3 16 3x3 filters. Layer 4 64 2x2 filters. Layer 5 256 2x2 filters. Layer 6 128 2x2
filters. Layer 7 512 1x1 filters. Layer 8 125 1x1 filters.
Trial 5 L1 8 3x3 filters. L2 16 3x3 filters. L3 32 1x1 filters. L4 64 1x1 filters. L5 64 1x1 filters. L6 125 1x1 filters.
Trial 6 Trial 7 is the same as trial 3, but the activation functions were changed to ReLU instead of Leaky ReLU
Trial 7 L1 32 3x3 filters. L2 34 3x3 filters. L3 64 1x1. L4 128 3x3 filters. L5 256 3x3 filters. L6 1024 1x1 filters. L7 125 1x1 filters.
Trial 8 Trial 8 is the same as trial 3, but one more Layer before L7 with 256 3x3 filters.
Trial 9 Trial 9 is the same as trial 3 (NB), but with the input raised to 300x300.
Trial 10 Trial 10 is the same as trial 3 (NB), but with the input raised to 416x416.
Trial 11 Trial 11 is the same as trial 3 (NB), but wit the input lowered to 112x112.
Trial 12
Trial 12 (NB) Same architecture as trial 12, but no batch normalization
Trial 13 Trial 13 has the same architecture as TY but has one last layer. It does not have layer 8 of TY.
Trial 13 (NB) Same architecture as trial 13, but no batch normalization

D. Image Size E. Batch Normalization

It was determined when comparing Trial 2 to Tiny-YOLOv2 Batch normalization [20] offers many different improve-
that reducing the input image size by a half can more than ments namely speeding up the training time. Trials have shown
double the speed of the network (6.94 FPS vs 2.4 FPS) but will batch normalization improves accuracy over the same network
also effect the mAP (30.24% vs 40.48%). Reducing the input without it. YOLOv2 and v3 have also seen improvements in
image size means that less of the image is passed through the training and mAP by implementing batch normalization [14],
network. This allows the network to be leaner, but also means [21]. While there is much empirical evidence showing the
that some data was lost. We determined that, for our purposes, benefits of using batch normalization, it has been found to not
it was better to take the speed up over the mAP. be necessary while developing YOLO-LITE. To understand
why that is the case, it is necessary to understand what batch
normalization is.
Batch normalization entails taking the output of one layer

2506

Authorized licensed use limited to: UNIVERSITE DE MONASTIR. Downloaded on May 06,2021 at 10:54:25 UTC from IEEE Xplore. Restrictions apply.
It has also been suggested [22] that pruning along with
quantization of the weights and Huffman coding can greatly
shrink the network size and also speed up the network by 3
or 4 times (Alexnet, VGGnet) .
Pruning YOLO-LITE showed to have no improvement
in accuracy or speed. This is not surprising as the results
mentioned in the paper [22] are on networks that have many
fully-connected layers. YOLO-LITE consists mainly of con-
volutional layers. This explains the lack of results. A method
such as the one suggested by Li et al. for pruning convolutional
neural networks may be more promising [23]. They suggest
pruning whole filters instead of select weights.
IV. R ESULTS
There were 18 different trials that were attempted during
the experimentation phase. Figure 4 shows the mAP and the
FPS for each of these trials and Tiny-YOLOv2.
While all the development for YOLO-LITE was done on
PASCAL VOC, the best trial was also run on COCO. Table
V shows the top results achieved on both datasets.
Fig. 3. Example images with image segmentation from the COCO
dataset [19].
TABLE V
R ESULTS FROM TRIAL 3 N O BATCH N ORMALIZATION (NB)
and transforming it to have a mean zero variance and a
Dataset mAP FPS
standard deviation of one before inputting to the next layer.
The idea behind batch normalization is that during training PASCAL VOC 33.77% 21
COCO 12.26% 21
using mini-batches it is hard for the network to learn the true
ground-truth distribution of the data as each mini-batch may
have a different mean and variance. This issue is described A. Architecture
as a covariate shift. The covariate shift makes it difficult to
Table VI and Table VII show the architectures for Tiny-
properly train the model as certain features may be dispro-
YOLOv2 and the best performing trial of YOLO-LITE, Trial
portionately saturated by activation functions. This is what is
3-no batch. Tiny-YOLOv2 is composed of 9 convolutional
referred to as the vanishing gradient problem. By keeping the
layers, a total of 3,181 filters, and 6.97 billion FLOPS.
inputs in the same scale, batch normalization stabilizes the
In contrast, Trial 3-no batch of YOLO-LITE only consists
network which in turn allows the network to train quicker.
of 7 layers with a total of 749 filters and 482 FLOPS . When
During test time, estimated values from training are used to
comparing the FLOPS of the two models, Tiny-YOLOv2 has
batch normalize the test image.
14× more FLOPS than YOLO-LITE Trial 3-no batch. A
As YOLO-LITE is a small network, it does not suffer
lighter model containing reduced number of layers enables
greatly from covariate shift and in turn vanishing gradient
the faster performance of YOLO-LITE.
problem. Therefore, the assumptions made by Ioffe et al. that
batch normalization is necessary does not apply. In addition, V. C OMPARISON WITH OTHER FAST O BJECT D ETECTION
it seems that the batch normalization calculation that needs N ETWORKS
to take place in between each layer holds up the network When it comes to real-time object detection algorithms for
and slows down the whole feedforward process. During the non-GPU devices, the competition for YOLO-LITE is pretty
feedfoward each input value has to be updated. For example, slim. YOLO’s tiny architecture, which was the starting point
the initial input layer of 224 × 224 × 3 has over 150k for YOLO-LITE has some of the quickest object detection
calculations that needs to be made. This calculation happening algorithms. While they are much quicker than the bigger
at every layer builds the time necessary for the forward pass. YOLO architecture, they are hardly real-time on non-GPU
This has led us to do away with batch normalization. computers (∼ 2.4 FPS).
F. Pruning Google has an object detection API that has a model
zoo with several lightweight architectures [24]. The most
Pruning is the idea of cutting certain weights based on their
impressive was SSD Mobilenet V1. This architecture clocks
importance. It has been shown that a simple pruning method
in at 5.8 FPS on a non-GPU laptop with an mAP of 21%.
of removing anything less than a certain threshold can reduce MobileNet [25] uses depthwise separable convolutions, as
the amount of parameters in Alexnet by 9× and VGGnet by opposed to YOLO’s method, to lighten a model for real-time
13× with little effect on accuracy [22]. object detection. The idea of depthwise separable convolutions

2507

Authorized licensed use limited to: UNIVERSITE DE MONASTIR. Downloaded on May 06,2021 at 10:54:25 UTC from IEEE Xplore. Restrictions apply.
Fig. 4. Comparison of trials attempted while developing YOLO-LITE

combines depthwise convolution and pointwise convolution. eral contributions to the field of object detection. First, YOLO-
Depthwise convolution applies one filter on each channel then LITE shows that shallow networks have immense potential
pointwise convolution applies a 1x1 convolution [25]. This for lightweight real-time object detection networks. Running
technique aims to lighten a model while maintaining the same at 21 FPS on a non-GPU computer is very promising for
amount of information learned in each convolution. The ideas such a small system. Second, YOLO-LITE shows that the
of depthwise convolution in MobileNet potentially explain the use of batch normalization should be questioned when it
higher mAP results from SSD MobileNet COCO. comes to smaller shallow networks. Movement in this area
of lightweight real-time object detection is the last frontier in
TABLE VIII making object detection usable in everyday instances.
C OMPARISON OF S TATE OF THE A RT ON COCO DATASET
VII. F UTURE W ORK
Model mAP FPS Lightweight architectures in general have a significant drop
Tiny-YOLOV2 23.7% 2.4 off in accuracy from the original YOLO architecture. YOLOv2
SSD Mobilenet V1 21% 5.8 has an mAP of 48.1% with a decrease to 23.7% in YOLOv2-
YOLO-LITE 12.26% 21 Tiny. The YOLO-LITE architecture has a mAP decrease down
to 12.16%. There is a constant tradeoff for speed in lightweight
Table VIII shows how YOLO-LITE compares. YOLO-LITE models and accuracy in a larger models. Although YOLO-
is 3.6× faster than SSD and 8.8× faster than Tiny-YOLOV2. LITE achieves the fastest mAP compared to state of the
art, the accuracy prevents the model from succeeding in real
A. Web Implementation
applications such as an autonomous vehicle. Future work may
After successfully training models for both VOC and include techniques to increase the mAP for both COCO and
COCO, the architectures along with their respective weights VOC models.
files were converted and implemented as a web-based model1 While the FPS for YOLO-LITE is at the necessary level
also accessible from a cellphone. Although YOLO-LITE runs for real-time use with non-GPU computers, the accuracy needs
at about 21 FPS locally on a Dell XPS 13 laptop, once pushed improvement in order for it to be a viable model. As mentioned
onto the website, the model runs at around 10 FPS. The FPS in [26], [21], pre-training the network to classify on Imagenet
may differ depending on the device. has had some good results when transferring the network back
VI. C ONCLUSION to object detection.
Redmon et al. [10] use R-CNN in combination with YOLO
YOLO-LITE achieved its goal of bringing object detection
to increase mAP. As previously mentioned, R-CNN finds
to non-GPU computers. In addition, YOLO-LITE offers sev-
bounding boxes and classifies each bounding box separately.
1 https://reu2018dl.github.io/ Once classification is complete, post-processing is used to

2508

Authorized licensed use limited to: UNIVERSITE DE MONASTIR. Downloaded on May 06,2021 at 10:54:25 UTC from IEEE Xplore. Restrictions apply.
TABLE VI decrease computation while maintaining the same amount of
T INY-YOLOV 2-VOC ARCHITECTURE information during training [15]. Implementing group convo-
lution in YOLO-LITE may improve mAP.
Layer Filters Size Stride
Conv1 (C1) 16 3×3 1 VIII. R ELEVANT L INKS
Max Pool (MP) 2×2 2 More information on the web implementation of YOLO-
C2 32 3×3 1 LITE can be found at https://reu2018dl.github.io/. For the cfg
MP 2×2 2
C3 64 3×3 1 and weights files from training PASCAL VOC and COCO
MP 2×2 2 visit https://github.com/reu2018dl/yolo-lite.
C4 128 3×3 1
MP 2×2 2 IX. ACKNOWLEDGEMENTS
C5 256 3×3 1 This research was done through the National Science Foun-
MP 2×2 2 dation under DMS Grant Number 1659288. A special thanks
C6 512 3×3 1
MP 2×2 2 to Kha Gia Quach, Chi Nhan Duong, Khoa Luu, Yishi Wang,
C7 1024 3×3 1 and Summerlin Thompson for their support.
C8 1024 3×3 1
C9 125 1×1 1 R EFERENCES
Region [1] J. Schmidhuber, “Deep learning in neural networks: An overview,”
Neural networks, vol. 61, pp. 85–117, 2015. 1
[2] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
TABLE VII hierarchies for accurate object detection and semantic segmentation,”
YOLO-LITE: T RIAL 3 A RCHITECTURE in Proceedings of the IEEE conference on computer vision and pattern
recognition, 2014, pp. 580–587. 1
Layer Filters Size Stride [3] C. Szegedy, A. Toshev, and D. Erhan, “Deep neural networks for object
detection,” in Advances in neural information processing systems, 2013,
C1 16 3×3 1 pp. 2553–2561. 1
MP 2×2 2 [4] L. Fridman, D. E. Brown, M. Glazer, W. Angell, S. Dodd, B. Jenik,
C2 32 3×3 1 J. Terwilliger, J. Kindelsberger, L. Ding, S. Seaman et al., “Mit au-
MP 2×2 2 tonomous vehicle technology study: Large-scale deep learning based
C3 64 3×3 1 analysis of driver behavior and interaction with automation,” arXiv
preprint arXiv:1711.06976, 2017. 1
MP 2×2 2
[5] O. Akgul, H. I. Penekli, and Y. Genc, “Applying deep learning in
C4 128 3×3 1 augmented reality tracking,” in Signal-Image Technology & Internet-
MP 2×2 2 Based Systems (SITIS), 2016 12th International Conference on. IEEE,
C5 128 3×3 1 2016, pp. 47–54. 1
MP 2×2 2 [6] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking
C6 256 3×3 1 the inception architecture for computer vision,” in Proceedings of the
C7 125 1×1 1 IEEE conference on computer vision and pattern recognition, 2016, pp.
Region 2818–2826. 1
[7] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4,
inception-resnet and the impact of residual connections on learning.” in
AAAI, vol. 4, 2017, p. 12. 1
[8] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
clean localization errors [10]. Although less efficient, an recognition,” in Proceedings of the IEEE conference on computer vision
improved version of R-CNN (Fast R-CNN) yields a higher and pattern recognition, 2016, pp. 770–778. 1
accuracy of 66.9% while YOLO achieves 63.4%. When com- [9] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
bining R-CNN and YOLO, the model achieves 75% mAP 1
trained on PASCAL VOC. [10] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look
Another possible improvement can be to attempt using the once: Unified, real-time object detection,” in Proceedings of the IEEE
conference on computer vision and pattern recognition, 2016, pp. 779–
feature in YOLOv3 where there are multiple locations for 788. 1, 2, 6, 7
making a prediction. This has helped improve the mAP in [11] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich
YOLOv3 [21] and could potentially help improve mAP in feature hierarchies for accurate object detection and semantic
segmentation,” CoRR, vol. abs/1311.2524, 2013. [Online]. Available:
YOLO-LITE. http://arxiv.org/abs/1311.2524 1, 2
Filter pruning set out by Li et al. [23] can be another [12] R. B. Girshick, “Fast R-CNN,” CoRR, vol. abs/1504.08083, 2015.
possible improvement. While standard pruning did not help [Online]. Available: http://arxiv.org/abs/1504.08083 2
[13] S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: towards
YOLO-LITE, pruning out filters can potentially make the real-time object detection with region proposal networks,” CoRR, vol.
network more lean and allow for a more guided training abs/1506.01497, 2015. [Online]. Available: http://arxiv.org/abs/1506.
process to learn better weights. While it is not clear if this 01497 2
[14] J. Redmon and A. Farhadi, “Yolo9000: Better, faster, stronger,” arXiv
will greatly improve the mAP, it can potentially speed up the preprint, 2017. 3, 4
network. It will also make the overall size of the network in [15] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely ef-
memory smaller. ficient convolutional neural network for mobile devices,” arXiv preprint
arXiv:1707.01083, 2017. 3, 7
A final improvement can come from ShuffleNet. ShuffleNet
uses group convolution and channel shuffling in order to

2509

Authorized licensed use limited to: UNIVERSITE DE MONASTIR. Downloaded on May 06,2021 at 10:54:25 UTC from IEEE Xplore. Restrictions apply.
[16] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally,
and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer
parameters and¡ 0.5 mb model size,” arXiv preprint arXiv:1602.07360,
2016. 3
[17] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,
T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convo-
lutional neural networks for mobile vision applications,” arXiv preprint
arXiv:1704.04861, 2017. 3
[18] PASCAL, “The pascal visual object classes homepage,” http://host.
robots.ox.ac.uk/pascal/VOC/index.html, Last accessed on 2018-07-18. 3
[19] “COCO Dataset,” http://cocodataset.org/#home, accessed: 2018-07-23.
3, 5
[20] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
network training by reducing internal covariate shift,” arXiv preprint
arXiv:1502.03167, 2015. 4
[21] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,”
arXiv preprint arXiv:1804.02767, 2018. 4, 6, 7
[22] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing
deep neural networks with pruning, trained quantization and huffman
coding,” arXiv preprint arXiv:1510.00149, 2015. 5
[23] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning
filters for efficient convnets,” arXiv preprint arXiv:1608.08710, 2016.
5, 7
[24] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer,
Z. Wojna, Y. Song, S. Guadarrama et al., “Speed/accuracy trade-offs for
modern convolutional object detectors,” in IEEE CVPR, vol. 4, 2017. 5
[25] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand,
M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural
networks for mobile vision applications,” CoRR, vol. abs/1704.04861,
2017. [Online]. Available: http://arxiv.org/abs/1704.04861 5, 6
[26] J. Redmon, “Yolo: Real-time object detection,” https://pjreddie.com/
darknet/yolo/, Last accessed on 2018-06-24. 6

2510

Authorized licensed use limited to: UNIVERSITE DE MONASTIR. Downloaded on May 06,2021 at 10:54:25 UTC from IEEE Xplore. Restrictions apply.

Object Detection Technique (YOLO)
No ratings yet
Object Detection Technique (YOLO)
19 pages
"Object Detection With Yolo": A Seminar On
No ratings yet
"Object Detection With Yolo": A Seminar On
14 pages
Mastering All YOLO Models From YOLOv1 To YOLO
100% (1)
Mastering All YOLO Models From YOLOv1 To YOLO
58 pages
Yolo
No ratings yet
Yolo
32 pages
Yolo: You Only Look Once: Unified Real-Time Object Detection
No ratings yet
Yolo: You Only Look Once: Unified Real-Time Object Detection
60 pages
YOLO Object Detection Explained - A Beginner's Guide - DataCamp
No ratings yet
YOLO Object Detection Explained - A Beginner's Guide - DataCamp
14 pages
Yolo1 11
No ratings yet
Yolo1 11
38 pages
YOLO-LITE: A Real-Time Object Detection Algorithm Optimized For Non-GPU Computers
No ratings yet
YOLO-LITE: A Real-Time Object Detection Algorithm Optimized For Non-GPU Computers
8 pages
Project
100% (1)
Project
30 pages
Object Detection Using Yolo
No ratings yet
Object Detection Using Yolo
42 pages
Yolo Paper
No ratings yet
Yolo Paper
10 pages
Unified Real-Time Object Detection
No ratings yet
Unified Real-Time Object Detection
36 pages
Object Detection Using Image Processing
No ratings yet
Object Detection Using Image Processing
17 pages
YOLO_U1
No ratings yet
YOLO_U1
21 pages
YOLO V3 ML Project
No ratings yet
YOLO V3 ML Project
15 pages
Yolo
No ratings yet
Yolo
10 pages
SEMINAR
No ratings yet
SEMINAR
13 pages
Final Synopsis1
No ratings yet
Final Synopsis1
10 pages
Detection and Content Retrieval of Object in An Image Using YOLO
No ratings yet
Detection and Content Retrieval of Object in An Image Using YOLO
8 pages
Object Detection Using Yolo Algorithm-1
No ratings yet
Object Detection Using Yolo Algorithm-1
9 pages
Yolopdf
No ratings yet
Yolopdf
10 pages
Yolo
No ratings yet
Yolo
10 pages
723 Seminar Report
No ratings yet
723 Seminar Report
24 pages
Yolo Algorithm
No ratings yet
Yolo Algorithm
37 pages
You Only Look Once - Unified, Real-Time Object Detection
No ratings yet
You Only Look Once - Unified, Real-Time Object Detection
10 pages
Team 10
No ratings yet
Team 10
20 pages
10 - CPU Based YOLO A Real Time Object Detection Algorithm
No ratings yet
10 - CPU Based YOLO A Real Time Object Detection Algorithm
4 pages
Deep Learning YOLOv2
No ratings yet
Deep Learning YOLOv2
3 pages
YOLO Based Detection and Classification of Objects in Video Records
No ratings yet
YOLO Based Detection and Classification of Objects in Video Records
5 pages
MJEER-Volume 30-Issue 1 - Page 52-57
No ratings yet
MJEER-Volume 30-Issue 1 - Page 52-57
6 pages
Object Detection and Classification Using Yolov3 IJERTV10IS020078
No ratings yet
Object Detection and Classification Using Yolov3 IJERTV10IS020078
6 pages
Overview of YOLO ObjectDetectionAlgorithm
No ratings yet
Overview of YOLO ObjectDetectionAlgorithm
7 pages
YOLOV1论文-同济子豪兄批注You Only Look Once Unified Real-time Object Detection
No ratings yet
YOLOV1论文-同济子豪兄批注You Only Look Once Unified Real-time Object Detection
10 pages
Chapter 4 Language Acquisition and Language Learning
0% (1)
Chapter 4 Language Acquisition and Language Learning
17 pages
Algoritm For MOD
No ratings yet
Algoritm For MOD
32 pages
YOLO
No ratings yet
YOLO
7 pages
You Only Look Once - Object Detection Models A Review
No ratings yet
You Only Look Once - Object Detection Models A Review
8 pages
Ex No 06
No ratings yet
Ex No 06
4 pages
Professional Ethics F Pfe301 Final
No ratings yet
Professional Ethics F Pfe301 Final
77 pages
1 s2.0 S1877050924033301 Main
No ratings yet
1 s2.0 S1877050924033301 Main
7 pages
Enhancing Real-Time Object Detection With YOLO Alg
No ratings yet
Enhancing Real-Time Object Detection With YOLO Alg
9 pages
Yolo Vs RCNN
No ratings yet
Yolo Vs RCNN
5 pages
Design of A Real-Time Object Detection Prototype S
No ratings yet
Design of A Real-Time Object Detection Prototype S
6 pages
Object Detection
No ratings yet
Object Detection
11 pages
Paper 5
No ratings yet
Paper 5
13 pages
YOLO v2
No ratings yet
YOLO v2
9 pages
27 GSJ8976
No ratings yet
27 GSJ8976
16 pages
Base Paper (YOLO)
No ratings yet
Base Paper (YOLO)
6 pages
YOLO Based Object Detection Models: A Review and Its Applications
No ratings yet
YOLO Based Object Detection Models: A Review and Its Applications
40 pages
Improved Small-Object Detection Using YOLOv8 A Com
No ratings yet
Improved Small-Object Detection Using YOLOv8 A Com
9 pages
Deep Learning For Object Detection - 131124
No ratings yet
Deep Learning For Object Detection - 131124
35 pages
Presentation1 FINAL 1
No ratings yet
Presentation1 FINAL 1
11 pages
EPPP Psychological Assessment
100% (1)
EPPP Psychological Assessment
8 pages
Incremental Training For Image Classification of Unseen Objects
No ratings yet
Incremental Training For Image Classification of Unseen Objects
19 pages
Object Detection Document
No ratings yet
Object Detection Document
4 pages
Student Centered Learning: An Insight Into Theory and Practice
100% (2)
Student Centered Learning: An Insight Into Theory and Practice
47 pages
Unit-2 - Advanced Concepts of Modeling in AI
No ratings yet
Unit-2 - Advanced Concepts of Modeling in AI
4 pages
Jakobsonian's and Chomskyan's Distinctive Feature of Phonology
100% (1)
Jakobsonian's and Chomskyan's Distinctive Feature of Phonology
4 pages
Conservation of Resources Theory
No ratings yet
Conservation of Resources Theory
19 pages
Synopsis - Internship - Group-53
No ratings yet
Synopsis - Internship - Group-53
8 pages
Introduction To CBT
No ratings yet
Introduction To CBT
12 pages
MSC Psych 2020 Syllabus Edited by Aachal
No ratings yet
MSC Psych 2020 Syllabus Edited by Aachal
279 pages
(IJCST-V8I3P4) :sakshi Gupta, Dr. T. Uma Devi
No ratings yet
(IJCST-V8I3P4) :sakshi Gupta, Dr. T. Uma Devi
5 pages
Instruction Planning Models For Mother Tongue Instruction
No ratings yet
Instruction Planning Models For Mother Tongue Instruction
12 pages
Approaches To Managing Change: Click To Edit Master Subtitle Style
No ratings yet
Approaches To Managing Change: Click To Edit Master Subtitle Style
14 pages
Expository Writing Checklist
No ratings yet
Expository Writing Checklist
3 pages
Ha Yen Principles You PDF
No ratings yet
Ha Yen Principles You PDF
9 pages
You Only Look Once Model-Based Object Identification in Computer Vision
No ratings yet
You Only Look Once Model-Based Object Identification in Computer Vision
12 pages
Curriculum Overview Lesson Plan
No ratings yet
Curriculum Overview Lesson Plan
4 pages
Lawrence Kolhberg's Moral Development
No ratings yet
Lawrence Kolhberg's Moral Development
25 pages
Curriculum Analysis
No ratings yet
Curriculum Analysis
14 pages
Unit 3 Newsletter - The Move Toward Freedom
No ratings yet
Unit 3 Newsletter - The Move Toward Freedom
5 pages
Edtpa Lesson Plan
No ratings yet
Edtpa Lesson Plan
5 pages
Life Transitions 20/30: Conflicts & Action Plan - Mini-Lesson Date: Subject: Author: Grade Level: Time Duration: Overview of Lesson
No ratings yet
Life Transitions 20/30: Conflicts & Action Plan - Mini-Lesson Date: Subject: Author: Grade Level: Time Duration: Overview of Lesson
4 pages
Athifa Aura Kenza Aydin - The Importance of Reading in The Digital Age
No ratings yet
Athifa Aura Kenza Aydin - The Importance of Reading in The Digital Age
4 pages
Read Clinical Lesson Plan
No ratings yet
Read Clinical Lesson Plan
4 pages
How Tall Am I Lesson Plan
No ratings yet
How Tall Am I Lesson Plan
4 pages
Conceptual Clarification Questions
No ratings yet
Conceptual Clarification Questions
3 pages
The (Almost) Complete Machine Learning Roadmap: Milestone 0: Python 3 and Other Basic Stuff
No ratings yet
The (Almost) Complete Machine Learning Roadmap: Milestone 0: Python 3 and Other Basic Stuff
5 pages
Peer Editing Checklist
No ratings yet
Peer Editing Checklist
2 pages
Analytical Study On Object Detection Using Yolo Algorithm
No ratings yet
Analytical Study On Object Detection Using Yolo Algorithm
3 pages
Class Revision After Winter Holidays
No ratings yet
Class Revision After Winter Holidays
5 pages
Evaluate Two or More Research Methodes Used To Study A Cognitive Process
No ratings yet
Evaluate Two or More Research Methodes Used To Study A Cognitive Process
2 pages
Action Plan and Training Design Design Thinking
No ratings yet
Action Plan and Training Design Design Thinking
5 pages
I. Write True or False.: Gibson School Systems 2019/2020 Academic Year Answer Key Format
No ratings yet
I. Write True or False.: Gibson School Systems 2019/2020 Academic Year Answer Key Format
7 pages
J.Krishnamurti Selected Quotes PDF
No ratings yet
J.Krishnamurti Selected Quotes PDF
18 pages