0% found this document useful (0 votes)
40 views9 pages

VoVNet 论文An - Energy - and - GPU-Computation - Efficient - Backbone - Network - for - Real-Time - Object - Detection

Uploaded by

lvnttya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views9 pages

VoVNet 论文An - Energy - and - GPU-Computation - Efficient - Backbone - Network - for - Real-Time - Object - Detection

Uploaded by

lvnttya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

*&&&$7'$POGFSFODFPO$PNQVUFS7JTJPOBOE1BUUFSO3FDPHOJUJPO8PSLTIPQT

An Energy and GPU-Computation Efficient Backbone Network


for Real-Time Object Detection

Youngwan Lee* Joong-won Hwang* Sangrok Lee† Yuseok Bae


ETRI ETRI SK C&C ETRI
[email protected] [email protected] [email protected] [email protected]

Jongyoul Park
ETRI
[email protected]

Abstract

As DenseNet conserves intermediate features with di- ‫ܨ‬ ൈ ‫ܨ‬ ൈ ‫ܨ‬ ൈ ‫ܨ‬ ൈ

verse receptive fields by aggregating them with dense con-


nection, it shows good performance on the object detection
task. Although feature reuse enables DenseNet to produce (a) Dense Aggregation (DenseNet)
strong features with a small number of model parameters
and FLOPs, the detector with DenseNet backbone shows
rather slow speed and low energy efficiency. We find the
linearly increasing input channel by dense connection leads
to heavy memory access cost, which causes computation
‫ܨ‬ ‫ܨ‬ ‫ܨ‬ ‫ܨ‬
overhead and more energy consumption. To solve the ineffi- ൈ

ciency of DenseNet, we propose an energy and computation


efficient architecture called VoVNet comprised of One-Shot (b) One-Shot Aggregation (VoVNet)
Aggregation (OSA). The OSA not only adopts the strength
Figure 1. Aggregation methods. (a) Dense aggregation of
of DenseNet that represents diversified features with multi DenseNet [8] aggregates all previous features at every subsequent
receptive fields but also overcomes the inefficiency of dense layers, which increases linearly input channel size with only a few
connection by aggregating all features only once in the last new outputs. (b) Our proposed One-Shot Aggregation concate-
feature maps. To validate the effectiveness of VoVNet as a nates all features only once in the last feature map, which makes
backbone network, we design both lightweight and large- input size constant and enables enlarging new output channel. F
scale VoVNet and apply them to one-stage and two-stage represents convolution layer and ⊗ indicates concatenation.
object detectors. Our VoVNet based detectors outperform
DenseNet based ones with 2× faster speed and the energy Inception-V4 [18], ResNet [6], and DenseNet [8], it has
consumptions are reduced by 1.6× - 4.1×. In addition become mainstream in object detector to adopt the mod-
to DenseNet, VoVNet also outperforms widely used ResNet ern state-of-the-art CNN models as feature extractor. As
backbone with faster speed and better energy efficiency. In DenseNet is reported to achieve state-of-the-art perfor-
particular, the small object detection performance has been mance in the classification task recently, it is natural to at-
significantly improved over DenseNet and ResNet. tempt to expand its usage to detection tasks. In our exper-
iment (Table 4), we find that the DenseNet based detectors
with fewer parameters and FLOPs outperform the detectors
1. Introduction with ResNet, which is most widely used for the backbone
of object detections.
With the massive progress of convolutional neural net-
The main difference between ResNet and DenseNet is
works (CNN) such as VGGNet [17], GoogleNet [19],
the way they aggregate their features; ResNet aggregates
* equal contribution the features from shallower by summation while DenseNet
† This work was done when Sangrok Lee was an intern at ETRI does it by concatenation. As mentioned by Zhu et al. [26],

¥*&&& 
%0*$7138

Authorized licensed use limited to: Nanjing Institute Of Technology. Downloaded on December 08,2021 at 12:51:48 UTC from IEEE Xplore. Restrictions apply.
information carried by early feature maps would be washed show that VoVNet based detectors outperform DenseNet or
out as it is summed with others. On the other hand, by ResNet based ones with better energy efficiency and speed.
concatenation, information would last as it preserves orig-
inal forms. Several works [19, 12, 10] demonstrate that 2. Factors of Efficient Network Design
the abstracted feature with multiple receptive fields can
capture visual information in various scales. As detection When designing efficient network, many studies such as
task requires models to recognize an object in more vari- MobileNet v1 [7], MobileNet v2 [15], ShuffleNet v1 [25],
ous scale than classification, preserving information from ShuffleNet v2 [13], and Pelee [20] have focused mainly
various layers is especially important for detection as each on reducing FLOPs and model sizes by using depthwise
layer has different receptive fields. Therefore, preserving convolution and 1×1 convolution bottleneck architecture.
and accumulating feature maps of multiple receptive fields, However, reducing FLOPs and model sizes does not al-
DenseNet has better and diverse feature representation than ways guarantee the reduction of GPU inference time and
ResNet in terms of object detection task. real energy consumption. Ma et al. [13] shows an exper-
However, we also find in the experiment that detectors iment that ShuffleNet v2 with a similar number of FLOPs
with DenseNet which has fewer FLOPs and model parame- runs faster than MobileNet v2 on GPU. Chen et al. [2] also
ters spend more energy and time than those with ResNet. shows that while SqueezeNet has 50x fewer weights than
This is because there are other factors than FLOPs and AlexNet, it consumes more energy than AlexNet. These
model size that influence on energy and time consump- phenomena imply that FLOPs and model sizes are indirect
tion. First, memory access cost (MAC) required to access- metrics to measure practicality and designing the network
ing memory for intermediate feature maps is crucial factor based on the metrics should be reconsidered. To build ef-
of the consumptions [13, 22]. As illustrated in Figure 1(a), ficient network architectures that focus on a more practical
since all previous feature maps in DenseNet are used as in- and valid metrics such as energy per image and frame per
put to the subsequent layer by dense connection, it causes second (FPS), besides FLOPs and model parameters, it is
the memory access cost to increase quadratically with net- important to consider other factors that influence on energy
work depth and in turn leads to computation overhead and and time consumption.
more energy consumption.
Second, with respect to GPU parallel computation, 2.1. Memory Access Cost
DenseNet has the limitation of computation bottleneck. In
The first factor we point out is memory accesses cost
general, GPU parallel computing utilization is maximized
(MAC). The main source of energy consumption in CNN
when operand tensor is larger [14, 23, 10]. However, due
is memory accesses than computation [22]. Specifically,
to linearly increasing input channel, DenseNet is needed to
accessing data from the DRAM (Dynamic Random Access
adopt 1×1 convolution bottleneck architecture for reducing
Memory) for an operation consumes orders of magnitude
input dimension and FLOPs, which results in rather increas-
higher energy than the computation itself. Moreover, the
ing the number of layers with smaller operand tensor. As a
time budget on memory access accounts for a large pro-
result, GPU-computation becomes inefficiency.
portion of time consumption and can even be the bottle-
The goal of this paper is to improve DenseNet to be more
neck of the GPU process [13]. This implies that even under
efficient while preserving the benefit from concatenative ag-
the same number of computation and parameter if the total
gregation for object detection task. We first discuss about
number of memory access varies with model structure, the
MAC and GPU-computation efficiency and how to con-
energy consumption will be also different.
sider the factors in architecture designing stage. Secondly,
One reason that causes the discrepancy between model
we claim that the dense connections in intermediate layers
size and the number of memory access is the intermediate
of DenseNet are inducing the inefficiencies and hypothe-
activation memory footprint. As stated by Chen et al. [1],
size that the dense connections are redundant. With these
the memory footprint is attributed to both filter parameter
thoughts, we propose a novel One-Shot Aggregation (OSA)
and intermediate feature maps. If the intermediate feature
that aggregates intermediate features at once as shown in
maps are large, the cost for memory access increases even
Figure 1(b). This aggregation method brings great bene-
with the same model parameter. Therefore, we consider
fit to MAC and GPU computation efficiency while it pre-
MAC, which covers the memory footprint for filter param-
serves the strength of concatenation. With OSA modules,
eter and intermediate feature map size both, to an impor-
we build VoVnet1 , energy efficient backbone for real-time
tant factor for network design. Specifically, we follow the
detection. To validate the effectiveness of VoVNet as back-
method of Ma et al. [13] to calculate MAC of each convo-
bone network, we apply VoVNet to various object detectors
lutional layers as below
such as DSOD, RefineDet, and Mask R-CNN. The results
1 It means Variety of View Network MAC = hw(ci + co ) + k 2 ci co (1)



Authorized licensed use limited to: Nanjing Institute Of Technology. Downloaded on December 08,2021 at 12:51:48 UTC from IEEE Xplore. Restrictions apply.
The notations k, h, w,ci , co denote kernel size, height/width
of input and output response, the channel size of input, and
that of output response, respectively.
2.2. GPU-Computation Efficiency
The network architectures that reduce their FLOPs for
speed is based on the idea that every floating point operation
is processed on the same speed in a device. However, this
is incorrect when a network is deployed on GPU. This is
because of GPU parallel processing mechanism. As GPU
is able to process multiple floating processes in time, it is
important to utilize its computational ability efficiently. We
use the term GPU-computation efficiency for this concept.
GPU parallel computing power is utilized better as the
computed data tensor becomes larger [23, 10]. Splitting a
large convolution operation into several fragmented smaller
operations makes GPU computation inefficient as fewer
computations are processed in parallel. In the context of
network design, this implies that it is better to compose net-
work with fewer layers if the behavior function is same.
Moreover, adopting extra layers causes kernel launching
and synchronization which result in additional time over-
head [13].
Accordingly, although the technique such as depthwise
convolution and 1×1convolution bottleneck can reduce the
Figure 2. The average absolute filter weights of convolutional lay-
number of FLOPs, it is harmful to GPU-computation effi-
ers in trained DenseNet [8] (top) and VoVNet (middle, bottom).
ciency as it adopts additional 1×1 convolution. More gener- The color of pixel (i, j) encodes the average L1 norm of weights
ally, GPU-computation efficiency varies with the model ar- connecting layer s to l. OSA module (x/y) indicates that the OSA
chitecture. Therefore, for validating computation efficiency modules consist of x layers with y channels.
of network architectures, we introduce FLOPs per Second
(FLOP/s) which is computed by dividing the actual GPU boundary has its ground on mean value inequality, MAC
inference time from the total FLOPs. High FLOP/s implies can be minimized when the input and output have the same
the architecture utilize GPU power efficiently. channel size under fixed number of computation or model
parameter. Dense connections increase input channel size
3. Proposed Method while output channel size remains constant, and as a result,
each layer has imbalanced input and output channel sizes.
3.1. Rethinking Dense Connection
Therefore, DenseNet has high MAC among the models with
The dense connection that aggregates all intermediate the same number of computations or parameters and con-
layers induces inevitable inefficiency, which comes from sumes more energy and time.
that input channel size of each layer increases linearly as Second, the dense connection imposes the use of bot-
the layer proceed. Because of the intensive aggregation, the tleneck structure which harms the efficiency of GPU paral-
dense block can produce only a few features with FLOPs lel computation. The linearly increasing input size is criti-
or parameters constraint. In other words, DenseNet trades cally problematic when model size is big because it makes
the quantity of features for the quality of features via the the overall computation grows quadratically with respect to
dense connection. Although the performance of DenseNet depth. To suppress this growth, DenseNet adopts the bot-
seems to prove the trade is beneficial, there are some other tleneck architecture which adds 1×1 convolutional layers to
drawbacks of the trade in perspective of energy and time. maintain the input size of 3 × 3 convolutional layer con-
First, dense connections induce high memory access cost stant. Although this solution can reduce FLOPs and param-
which is paid by energy and time. As mentioned by Ma eters, it harms the GPU parallel computation efficiency as
et al. [13], the lower boundary of MAC, or the number discussed. Bottleneck architecture divides one 3 × 3 con-
of memory access operation, ofa convolutional layer can volutional layer into two smaller layers and causes more
hwB B
be represented by M AC ≥ 2 k2 + hw when B = sequential computations, which lowers the inference speed.
2
k hwci co is the number of computation. Because the lower Because of these drawbacks, DenseNet becomes ineffi-



Authorized licensed use limited to: Nanjing Institute Of Technology. Downloaded on December 08,2021 at 12:51:48 UTC from IEEE Xplore. Restrictions apply.
cient in terms of energy and time. To improve efficiency, we block which is used in DenseNet-40. First, we investigate
first investigate how dense connections actually aggregate the result on the OSA module with the same number of lay-
the features once the network is trained. Hu et al. [8] illus- ers with the dense block, which is 12 (Figure 2 (middle)).
trate the connectivity of the dense connection by evaluating The output is bigger than that of dense block as the input
normalized L1 norm of input weights to each layer. These size of each convolution layers is reduced. The network
values show the normalized influences of each preceding with OSA modules shows 93.6% accuracy on CIFAR-10
layer to corresponding layers. The figures are represented classification which is slightly dropped by 1.2% but still
in Figure 2 (top). higher than ResNet with similar model size. It can be ob-
In Dense Block3, the red boxes near the diagonal show served that the aggregations in final layers become more
that aggregations on intermediate layers are active. How- intense as the dense connections on intermediate layers are
ever, in the classification layer, only a small proportion of pruned.
intermediate features is used. In contrast, in Dense Block1 Moreover, the weights of transition layer of OSA mod-
transition layer aggregates the most of its input feature well ule show the different pattern with that of DenseNet: fea-
while intermediate layers do not. tures from shallow depth are more aggregated on the tran-
With the observations, we hypothesize that there is a neg- sition layer. Since the features from deep layer are not in-
ative relation between the strength of aggregation on inter- fluencing strongly on transition layers, we can reduce the
mediate layers and that of final layers. This can be true if layer without significant effect. Therefore, we reconfigure
the dense connection between intermediate layers induces OSA module to have 5 layers with 43 channels each (Fig-
correlation between features from each layer. This means ure 2 (bottom)). Surprisingly, with this module, we achieve
that dense connection makes later intermediate layer pro- error rate 5.44% which is similar to that of DenseNet-40
duce the features that are better but also similar to the fea- (5.24%). This implies that building deep intermediate fea-
tures from former layers. In this case, the final layer is not ture via dense connection is less effective than expected.
required to learn to aggregate both features because they Although the network with OSA module has slightly de-
are representing redundant information. As a result, the in- creased performance on CIFAR-10, which does not neces-
fluence of the former intermediate layer to the final layer sarily imply it will underperform on detection task, it has
becomes small. much less MAC than that with dense block. By follow-
As all intermediate features are aggregated to produce ing Eq. (1), it is estimated that substituting dense block of
final feature in the final layer, it is better to produce inter- DenseNet-40 to OSA module with 5 layers with 43 chan-
mediate features that can complement each other, or less nels reduces MAC from 3.7M to 2.5M. This is because
correlated. Therefore, we can extend our hypothesis to that the intermediate layers in OSA have the same size of input
the effect of dense connections in intermediate feature is and output which leads MAC to the lower boundary. This
relatively little with respect to the cost. To verify the hy- means that one can build faster and more energy efficient
potheses, we redesign a novel module that aggregates its network if the MAC is the dominant factor of energy and
intermediate features only on the final layer of each block. time consumption. Specifically, as detection is performed
on a higher resolution than classification, the intermediate
3.2. One-Shot Aggregation memory footprint will become larger and MAC will reflect
the energy and time consumption more appropriately.
We integrate previously discussed thoughts into efficient
architecture, one-shot aggregation (OSA) module which ag- Also, OSA improves GPU computation efficiency. The
gregates its feature in the last layer at once. Figure 1(b) input sizes of intermediate layers of OSA module are con-
illustrates the proposed OSA module. Each convolution stant. Hence, it is unnecessary to adopt additional 1×1 conv
layer is connected by two-way connection. One way is con- bottleneck to reduce dimension. Moreover, as the OSA
nected to the subsequent layer to produce the feature with a module aggregates the shallow features, it consists of fewer
larger receptive field while the other way is aggregated only layers. As a result, the OSA module is designed to have
once into the final output feature map. The difference with only a few layers that can be efficiently computed in GPU.
DenseNet is that the output of each layer is not routed to all
3.3. Configuration of VoVNet
subsequent intermediate layers which makes the input size
of intermediate layers constant. Due to the diversified feature representation and effi-
To verify our hypotheses that there is a negative relation ciency of the OSA modules, our VoVNet can be constructed
between the strength of aggregation on intermediate layers by stacking only a few modules with high accuracy and fast
and that on final layer, and that the dense connections are re- speed. Based on the confirmation that the shallow depth
dundant, we conduct the same experiment with Hu et al. [8] is more aggregated in Figure 2, we can configure the OSA
on OSA module. We designed OSA modules to have the module with a smaller number of convolutions with larger
similar number of parameter and computation with dense channel than DenseNet. There are two types of VoVNet:



Authorized licensed use limited to: Nanjing Institute Of Technology. Downloaded on December 08,2021 at 12:51:48 UTC from IEEE Xplore. Restrictions apply.
Type Output Stride VoVNet-27-slim VoVNet-39 VoVNet-57
2 3 × 3 conv, 64, stride=2 3 × 3 conv, 64, stride=2 3 × 3 conv, 64, stride=2
Stem
2 3 × 3 conv, 64, stride=1 3 × 3 conv, 64, stride=1 3 × 3 conv, 64, stride=1
Stage 1
2  3 × 3 conv, 128, stride=1
  3 × 3 conv, 128, stride=1  3 × 3 conv, 128, stride=1
OSA module 3 × 3 conv, 64, ×5 3 × 3 conv, 128, ×5 3 × 3 conv, 128, ×5
4 ×1 ×1 ×1
Stage 2 concat & 1×1 conv, 128 concat & 1×1 conv, 256 concat & 1×1 conv, 256
OSA module 3 × 3 conv, 80, ×5 3 × 3 conv, 160, ×5 3 × 3 conv, 160, ×5
8 ×1 ×1 ×1
Stage 3  concat & 1×1 conv, 256  concat & 1×1 conv, 512  concat & 1×1 conv, 512
OSA module 3 × 3 conv, 96, ×5 3 × 3 conv, 192, ×5 3 × 3 conv, 192, ×5
16 ×1 ×2 ×4
Stage 4  concat & 1×1 conv, 384  concat & 1×1 conv, 768   concat & 1×1 conv, 768 
OSA module 3 × 3 conv, 112, ×5 3 × 3 conv, 224, ×5 3 × 3 conv, 224, ×5
32 ×1 ×2 ×3
Stage 5 concat & 1×1 conv, 512 concat & 1×1 conv, 1024 concat & 1×1 conv, 1024
Table 1. Overall architecture of VoVNet. Downsampling is done by 3 × 3 max pooling with a stride of 2 at the end of each stage. Note
that each conv layer has the sequence Conv-BN-ReLU.

lightweight network, e.g., VoVNet-27-slim, and large-scale 4.1. Experimental setup


network, e.g., VoVNet-39/57. VoVNet consists of a stem
Speed Measurement. For fair speed comparison, we
block including 3 convolution layers and 4 stages of OSA
measure the inference time of all models in Table 2, 4 on
modules with output stride 32. An OSA module is com-
the same GPU workstation with TITAN X GPU (Pascal
prised of 5 convolution layers with the same input/output
architecture), CUDA v9.2, and cuDNN v7.3. It is noted
channel for minimizing MAC as discussed in Section 3.1.
that Pelee [20] merges batch normalization layer into
Whenever the stage goes up, the feature map is downsam-
convolution for accelerating the inference time. As the
pled by 3 × 3 max pooling with stride 2. VoVNet-39/57
other models also have batch normalization layers, we
have more OSA modules at the 4th and 5th stage where
compare Pelee without merge-bn trick for fair comparison.
downsampling is done in the last module.
Since the semantic information in high-level is more im-
Energy Consumption Measurement. We measure the
portant for object detection task, we increase the proportion
energy consumption of both lightweight and large-scale
of high-level features relative to low-level ones by growing
models during object detection evaluation of VOC2007
the output channels at different stages. Contrary to the lim-
test images (e.g., 4952 images) and COCO minival
itation of only a few new outputs in DenseNet, our strategy
images (e.g., 5000 images), respectively. GPU power us-
allows VoVNet to express better feature representation with
age is measured with Nvidia’s system monitor interface
fewer total layers (e.g., VoVNet-57 vs. DenseNet-161). The
(nvidia-smi). We sample the power value with an in-
details of VoVNet architecture are shown in Table 1.
terval of 100 millisecond and compute average of the mea-
sured power. The energy consumption per image can be
4. Experiments calculated as below
In this section, we validate the effectiveness of the Average Power [Joule/Second]
proposed VoVNet as backbone for object detection in (2)
Inference speed [Frame/Second]
terms of GPU-computation and energy efficiency. At first,
for comparison with lightweight DenseNet, we apply our We also measure total memory usage that includes not only
lightweight VoVNet-27-slim to DSOD [16] that is the first model parameters but also intermediate activation maps.
detector using DenseNet. Then, we compare with state- The measured energy and memory footprint in Table 2.
of-the-art lightweight object detectors such as Pelee [20]
that also uses a DenseNet-variant backbone and SSD- 4.2. DSOD
MobileNet [7]. To validate the effectiveness of backbone part, except
Furthermore, to validate the possibility of generalization for replacing DenseNet-67 (referred to DSOD [16] as
to large-scale models, we extend the VoVNet to state-of- DS-64-64-16-1) with our VoVNet-27-slim, we follow the
the-art one-stage detector, e.g., RefineDet [24], and two- same hyper-parameters such as default box scale, aspect
stage detector, e.g., Mask R-CNN [5], on more challeng- ratio, and dense prediction and the training protocol such
ing COCO [11] dataset. Since ResNet is the most widely as 128 total batch size, 100k max iterations, initial learning
used backbone for object detection and segmentation task, rate, and learning rate schedule. DSOD with VoVNet
we compare VoVNet with ResNet as well as DenseNet. In is trained on the union of VOC2007 trainval and
particular, we compare the speed and accuracy of VoVNet- VOC2012 trainval(”07+12”) following [16]. As the
39/57 with DenseNet-201/161 and ResNet-50/101 as they original DSOD with DenseNet-67 is trained from scratch,
have similar model sizes. we also train our model without ImageNet pretrained



Authorized licensed use limited to: Nanjing Institute Of Technology. Downloaded on December 08,2021 at 12:51:48 UTC from IEEE Xplore. Restrictions apply.
       
           
   

    


           
           

   
  

  
  
   

   

   

" " " 

! ! ! 

                         

              

(a) (b) (c) (d)


Figure 3. Comparisons of lightweight models in terms of the computation and energy efficiency. (a) shows speed vs. accuracy. (b), (c),
and (d) illustrate comparison of GPU-computation-efficiency, energy-efficiency and GPU-computation vs. energy efficiency, respectively.

Memory Energy Computation


FLOPs FPS #Param
Detector Backbone footprint Efficiency Efficiency mAP
(G) (img/s) (M)
(MB) (J/img) (GFLOP/s)
SSD300 MobileNet [7] 1.1 37 5.7 766 2.3 42 68.0
Pelee304 PeleeNet [20] 1.2 35 5.4 1104 2.4 43 70.9
DSOD300 DenseNet-67 [16] 5.3 35 5.9 1294 3.7 189 73.6
DSOD300 VoVNet-27-slim 5.6 71 5.9 825 0.9 400 74.8
Table 2. Comparison with lightweight object detectors. All models are trained on VOC 2007 and VOC 2012 trainval set and tested on
VOC 2007 test set.

GPU Memory volution in front of every 3×3 convolution operation in


FLOPs #Param
Backbone time footprint mAP
(G) (M) OSA module with half channel of the input. Table 3
(ms) (MB)
shows comparison results. VoVNet with 1×1 bottleneck
VoVNet-27-slim 5.6 14 5.9 825 74.8
+ w/ bottleneck 4.6 18 4.8 895 71.1
reduces FLOPs and the number of model parameters, but
Table 3. Ablation study on 1×1 convolution bottleneck.
conversely increases GPU inference time and memory
footprint compared to without one. The accuracy also
model. We implement DSOD with VoVNet-27-slim based drops by 3.69% mAP. This is the problem in the same
on DSOD original Caffe code2 . context as why Pelee is slower than DenseNet-67 despite
the fewer FLOPs. As the 1×1 bottleneck decomposes a
VoVNet vs. DenseNet. As shown in Table 2, the proposed large 3×3 convolution tensor into several smaller tensors,
VoVNet-27-slim based DSOD300 achieves 74.87%, which it rather hampers GPU parallel computations. Although
is better than DenseNet-67 based one even with comparable the 1×1 bottleneck decreases the number of parameters, it
parameters. In addition to accuracy, the inference speed increases the total number of layers in the network which
of VoVNet-27-slim is also two times faster than that of requires more intermediate activation maps and in turn
the counterpart with comparable FLOPs. The Pelee [20], increases overall memory footprint.
DenseNet-variant backbone, is designed to decompose a
dense block into a smaller two-way dense block, which GPU-Computation Efficiency. Although SSD-MobileNet
reduces FLOPs to about ×5 less than DenseNet-67. How- and Pelee have much fewer FLOPs compared to DSOD-
ever, despite the fewer FLOPs, Pelee has similar inference DenseNet-67, DenseNet-67 shows comparable inference
speed with DSOD with DenseNet-67. We conjecture that speed on GPU. In addition, even with similar FLOPs,
decomposing a dense block into smaller fragmented layers VoVNet-27-slim runs twice as fast as DenseNet-67. These
deteriorates GPU computing parallelism. The VoVNet-27- results suggest that FLOPs can not sufficiently reflect the
slim based DSOD also outperforms Pelee by a large margin inference time as GPU-computation efficiencies of models
of 3.97% at much faster speed. differ significantly. Thus, we set FLOP/s, which means
how well the network utilizes GPU computing resources,
Ablation study on 1×1 conv bottleneck. To check the in- as GPU-computation efficiency. From this valid metric,
fluence of 1×1 convolution bottleneck on model-efficiency, VoVNet-27-slim achieves the highest 400 GFLOP/s among
we conduct an ablation study where we add a 1×1 con- other methods as described in Figure 3(b). The computation
efficiency of VoVNet-27-slim is about 10× higher than
2 https://github.com/szq0214/DSOD
those of MobileNet and Pelee, which demonstrates that



Authorized licensed use limited to: Nanjing Institute Of Technology. Downloaded on December 08,2021 at 12:51:48 UTC from IEEE Xplore. Restrictions apply.
                   





  

   

  
  

  
   

   

   

                 

     

(a) (b) (c) (d)

Figure 4. Comparisons of large-scale models on RefineDet320 [24] in terms of the computation and energy efficiency. (a) shows speed
vs. accuracy. (b), (c), and (d) illustrate comparison of GPU-computation-efficiency and energy-efficiency, respectively.

Memory Energy Computation


FLOPs FPS #param COCO AP
Backbone footprint Efficiency Efficiency
(G) (img/s) (M) AP/APS /APM /APL
(MB) (J/img) (GFLOP/s)
ResNet-50 [6] 25.43 23.2 63.46 2229 5.3 591.3 30.3/10.2/32.8/46.9
DenseNet-201 (k=32) [8] 24.65 12.0 56.13 3498 9.9 296.9 32.5/11.3/35.4/50.1
VoVNet-39 32.6 25.0 56.28 2199 4.8 815.0 33.5/12.8/36.8/49.2
ResNet-101 [6] 33.02 17.5 82.45 3013 7.5 579.2 32.0/10.5/34.7/50.4
DenseNet-161 (k=48) [8] 32.74 12.8 66.76 3628 10.0 419.7 33.5/11.6/36.6/51.4
VoVNet-57 36.45 21.2 70.32 2511 5.9 775.5 33.9/12.8/37.1/50.3
Table 4. Comparison backbone networks on RefineDet320 [24] on COCO test-dev set.

the depthwise convolution and decomposing a convolution located in the left-upper direction, which means it is the
into the smaller fragmented operations are not an efficient most efficient model in terms of both GPU-computation and
way in terms of GPU computation-efficiency. Given these energy efficiency.
results, it is worth noting that VoVNet makes full use of
GPU computation resource most efficiently. As a result, 4.3. RefineDet
VoVNet achieves a significantly better speed-accuracy From this section, we validate the generalization to
tradeoff as shown in Figure 3(a). large-scale VoVNet, e.g.,VoVNet-39/57, in RefineDet [24]
which is the state-of-the-art one-stage object detector.
Energy Efficiency. When validating the efficiency of net- Without any bells-and-whistles, we simply plug VoVNet-
work, another important thing to consider is energy effi- 39/57 into RefineDet, following same hyper-parameters
ciency (Joule/frame). The metric is the amount of energy and training protocols for fair comparison. We train Re-
consumed to process an image; the lower value means bet- fineDet320 for 400k iterations with a batch size of 32 and
ter energy efficiency. We measure energy consumption and an initial learning rate of 0.001 which is decreased by 0.1
obtain the energy efficiencies of VoVNet and other models at 280k and 360k iterations. All models are implemented
based detectors. Table 2 shows a tendency between energy by RefineDet original Caffe code3 base. The results are
efficiency and memory footprint. VoVNet based DSOD summarized in Table 4.
consumes only 0.9J per image, which is 4.1× less than
DenseNet based one. We can note that the excessive inter- Accuracy vs. Speed. Figure 4(a) illustrates speed vs. ac-
mediate activation maps of DenseNet increase the memory curacy. VoVNet-39/57 outperform DenseNet-201/161 and
footprint, which results in more energy consumption. It is ResNet50/101 both with faster speed. While VoVNet-39
also notable that MobileNet shows worse energy efficiency achieves similar accuracy of 33.5 AP with DenseNet-161, it
than VoVNet although its memory footprint is lower. This is runs about two times faster than the counterpart with much
because depthwise convolution requires fragmented mem- fewer parameters and less memory footprint. VoV-39 also
ory access and in turn increases memory access costs [9]. outperforms ResNet-50 by a large margin of 3.3% absolute
Figure 3(c) describes accuracy vs. energy efficiency AP at comparable speed. These results demonstrate with
where with two times better energy efficiency than Mo- fewer parameters and memory footprint, the proposed
bileNet and Pelee, VoVNet outperforms the counterparts by VoVNet is the most efficient backbone network in terms of
a large margin of 6.87% and 3.97%, respectively. In addi- both accuracy and speed.
tion, Figure 3(d) shows a tendency of efficiency with respect
to computation and energy consumption both. VoVNet is 3 https://github.com/sfzhang15/RefineDet



Authorized licensed use limited to: Nanjing Institute Of Technology. Downloaded on December 08,2021 at 12:51:48 UTC from IEEE Xplore. Restrictions apply.
GPU-Computation Efficiency. Figure 4(b) shows that Backbone APbbox APbbox APbbox APseg APseg seg
50 70 50 AP75 GPU time
VoVNet-39/57 outperform DenseNet and ResNet back- ResNet-50-GN 39.5 59.8 43.6 35.2 56.9 37.6 157 ms
bones with higher computation efficiency. In particular, ResNet-101-GN 41.0 61.1 44.9 36.4 58.2 38.7 185 ms
since VoVNet-39 runs faster than DenseNet-201 having VoVNet-39-GN 41.7 62.2 45.8 36.8 59.0 39.5 152 ms
fewer FLOPs, VoVNet-39 achieves about three times VoVNet-57-GN 41.9 62.1 46.0 37.0 59.3 39.7 159 ms
higher computation efficiency than DenseNet-201 with Table 5. Detection and segementation results using Mask R-CNN
better accuracy. One can note that although DenseNet-201 with Group Normalization [21] trained from scratch for 3×
(k=32) has fewer FLOPs, it runs slower than DenseNet-161 schedule and evaluted on COCO val set.
(k=48), which means lower computation efficiency. We
speculate that deeper and thinner network architecture is ResNet with GN backbone for VoVNet with GN in Mask
computationally in-efficient in terms of GPU parallelism. R-CNN, following same hyperparameters and training
protocols [3]. We train VoVNet with GN based Mask
Energy Effficiency. As illustrated in Figure 4(c), with R-CNN from scratch with batch size 16 for 3× schedule
higher or comparable accuracy, VoV-39/57 consume only in an end-to-end manner as like [21]. Meanwhile, due to
4.8J and 5.9J per image, which are less than DenseNet- extreme memory footprint of DenseNet and larger input
201/161 and ResNet-50/101, respectively. Compared to size of Mask R-CNN, we cannot train DenseNet based
DenseNet161, the energy consumption of VoVNet-39 is Mask R-CNN even on the 32GB V100 GPUs. The results
two times less with comparable accuracy. Table 4 shows are listed in Table 5.
that the positive relation between memory footprint and
energy consumption. From this observation, it can be seen Accuracy vs. Speed. For object detection task, with faster
that VoVNet with relatively fewer memory footprint is the speed, VoVNet-39 obtains 2.2%/0.9% absolute AP gains
most energy efficient. In addition, Figure 4(d) shows that compared to ResNet-50/101, respectively. The extended
our VoVNet-39/57 are located in the most efficient position version of VoVNet, VoVNet-57 also achieves state-of-the-
in terms of energy and computation. art performance compared to ResNet-101 at faster inference
speed. For instance segmentation task, VoVNet-39 also im-
Small Object Detection. In Table 4, we find that VoVNet proves 1.6%/0.4% AP from ResNet-50/101. These results
and DenseNet obtain higher AP than ResNet on small and support the fact that VoVNet can also provide better diverse
medium objects. This supports that conserving the diverse feature representation for object detection and simultane-
feature representations with multi-receptive fields by con- ously instance segmentation efficiently.
catenative aggregation has the advantage of small object de-
tection. Furthermore, VoVNet improves 1.9%/1.2% small 5. Conclusion
object AP gain from DenseNet121/161, which suggests that For real-time object detection, in this paper, we propose
generating more features by OSA is better than generating an efficient backbone network called VoVNet that makes
deep features by dense connection on small object detec- good use of the diversified feature representation with multi
tion. receptive fields and improves the inefficiency of DenseNet.
The proposed One-Shot Aggregation (OSA) addresses the
4.4. Mask R-CNN from scratch problem of linearly increasing the input channel of the
In this section, we also validate the efficiency of VoVNet dense connection by aggregating all features in the final
as a backbone for a two-stage object detector, Mask R- feature map only at once. This results in constant input
CNN. Recent works [16, 4] are studied on training with- size which reduces memory access cost and makes GPU-
out ImageNet pretraining. DSOD is the first one-stage ob- computation more efficient. Extensive experimental results
ject detector trained from scratch and achieves significant demonstrate that not only lightweight but also large-scale
performance due to the deep supervision trait of DenseNet. VoVNet based detectors outperform DenseNet based ones
He et al. [4] also prove that when trained from scratch for at much faster speed. For future works, we have plans to
longer training iterations, Mask R-CNN with Group nor- apply VoVNet to other detection meta-architecture or se-
malization (GN) [21] achieves comparable or higher accu- mantic segmentation, etc.
racy than that with ImageNet pretraining. We also already
confirmed our VoVNet with DSOD achieves good perfor- 6. Acknowledgement
mance when training from scratch in Section 4.2. This work was supported by Institute of Information &
Thus we also apply VoVNet backbone to Mask R-CNN Communications Technology Planning & Evaluation (IITP)
with GN, the state-of-the-art two-stage object detection grant funded by the Korea government (MSIT) (B0101-15-
and simultaneously instance segmentation. For fair com- 0266, Development of High Performance Visual BigData
parison, without any bells-and-whistles, we only exchange Discovery platform)



Authorized licensed use limited to: Nanjing Institute Of Technology. Downloaded on December 08,2021 at 12:51:48 UTC from IEEE Xplore. Restrictions apply.
References [17] Karen Simonyan and Andrew Zisserman. Very deep convo-
lutional networks for large-scale image recognition. arXiv
[1] Yu-Hsin Chen, Joel Emer, and Vivienne Sze. Eyeriss: A preprint arXiv:1409.1556, 2014. 1
spatial architecture for energy-efficient dataflow for convolu- [18] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and
tional neural networks. In ACM SIGARCH Computer Archi- Alexander A Alemi. Inception-v4, inception-resnet and the
tecture News, volume 44, pages 367–379. IEEE Press, 2016. impact of residual connections on learning. In AAAI, 2017.
2 1
[2] Yu-Hsin Chen, Tien-Ju Yang, Joel Emer, and Vivienne Sze. [19] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet,
Understanding the limitations of existing energy-efficient de- Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent
sign approaches for deep neural networks. Energy, 2(L1):L3, Vanhoucke, and Andrew Rabinovich. Going deeper with
2018. 2 convolutions. In CVPR, 2015. 1, 2
[3] Ross Girshick, Ilija Radosavovic, Georgia Gkioxari, Piotr [20] Robert J Wang, Xiang Li, and Charles X Ling. Pelee: A real-
Dollár, and Kaiming He. Detectron. https://github. time object detection system on mobile devices. In NIPS,
com/facebookresearch/detectron, 2018. 8 pages 1963–1972, 2018. 2, 5, 6
[4] Kaiming He, Ross Girshick, and Piotr Dollár. Rethinking im- [21] Yuxin Wu and Kaiming He. Group normalization. arXiv
agenet pre-training. arXiv preprint arXiv:1811.08883, 2018. preprint arXiv:1803.08494, 2018. 8
8 [22] Tien-Ju Yang, Yu-Hsin Chen, and Vivienne Sze. Designing
[5] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir- energy-efficient convolutional neural networks using energy-
shick. Mask r-cnn. In ICCV, 2017. 5 aware pruning. In CVPR, pages 5687–5695, 2017. 2
[6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. [23] Sergey Zagoruyko and Nikos Komodakis. Wide residual net-
Deep residual learning for image recognition. In CVPR, works. arXiv preprint arXiv:1605.07146, 2016. 2, 3
2016. 1, 7 [24] Shifeng Zhang, Longyin Wen, Xiao Bian, Zhen Lei, and
[7] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Stan Z Li. Single-shot refinement neural network for object
Kalenichenko, Weijun Wang, Tobias Weyand, Marco An- detection. In CVPR, 2018. 5, 7
dreetto, and Hartwig Adam. Mobilenets: Efficient convolu- [25] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun.
tional neural networks for mobile vision applications. arXiv Shufflenet: An extremely efficient convolutional neural net-
preprint arXiv:1704.04861, 2017. 2, 5, 6 work for mobile devices. In CVPR, 2018. 2
[8] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kil- [26] Ligeng Zhu, Ruizhi Deng, Michael Maire, Zhiwei Deng,
ian Q Weinberger. Densely connected convolutional net- Greg Mori, and Ping Tan. Sparsely aggregated convolutional
works. In CVPR, 2017. 1, 3, 4, 7 networks. In ECCV, 2018. 1
[9] Yunho Jeon and Junmo Kim. Constructing fast network
through deconstruction of convolution. In NIPS, pages
5955–5965, 2018. 7
[10] Youngwan Lee, Huieun Kim, Eunsoo Park, Xuenan Cui, and
Hakil Kim. Wide-residual-inception networks for real-time
object detection. In IV, pages 758–764. IEEE, 2017. 2, 3
[11] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
Zitnick. Microsoft coco: Common objects in context. In
ECCV, 2014. 5
[12] Songtao Liu, Di Huang, et al. Receptive field block net for
accurate and fast object detection. In ECCV, 2018. 2
[13] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun.
Shufflenet v2: Practical guidelines for efficient cnn architec-
ture design. In ECCV, 2018. 2, 3
[14] Nvidia. Gpu-based deep learning inference: A performance
and power analysis. Nvidia Whitepaper, 2015. 2
[15] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh-
moginov, and Liang-Chieh Chen. Inverted residuals and lin-
ear bottlenecks: Mobile networks for classification, detec-
tion and segmentation. arXiv preprint arXiv:1801.04381,
2018. 2
[16] Zhiqiang Shen, Zhuang Liu, Jianguo Li, Yu-Gang Jiang,
Yurong Chen, and Xiangyang Xue. Dsod: Learning deeply
supervised object detectors from scratch. In ICCV, 2017. 5,
6, 8



Authorized licensed use limited to: Nanjing Institute Of Technology. Downloaded on December 08,2021 at 12:51:48 UTC from IEEE Xplore. Restrictions apply.

You might also like