VoVNet 论文An - Energy - and - GPU-Computation - Efficient - Backbone - Network - for - Real-Time - Object - Detection
VoVNet 论文An - Energy - and - GPU-Computation - Efficient - Backbone - Network - for - Real-Time - Object - Detection
Jongyoul Park
ETRI
[email protected]
Abstract
As DenseNet conserves intermediate features with di- ܨ ൈ ܨ ൈ ܨ ൈ ܨ ൈ
¥*&&&
%0*$7138
Authorized licensed use limited to: Nanjing Institute Of Technology. Downloaded on December 08,2021 at 12:51:48 UTC from IEEE Xplore. Restrictions apply.
information carried by early feature maps would be washed show that VoVNet based detectors outperform DenseNet or
out as it is summed with others. On the other hand, by ResNet based ones with better energy efficiency and speed.
concatenation, information would last as it preserves orig-
inal forms. Several works [19, 12, 10] demonstrate that 2. Factors of Efficient Network Design
the abstracted feature with multiple receptive fields can
capture visual information in various scales. As detection When designing efficient network, many studies such as
task requires models to recognize an object in more vari- MobileNet v1 [7], MobileNet v2 [15], ShuffleNet v1 [25],
ous scale than classification, preserving information from ShuffleNet v2 [13], and Pelee [20] have focused mainly
various layers is especially important for detection as each on reducing FLOPs and model sizes by using depthwise
layer has different receptive fields. Therefore, preserving convolution and 1×1 convolution bottleneck architecture.
and accumulating feature maps of multiple receptive fields, However, reducing FLOPs and model sizes does not al-
DenseNet has better and diverse feature representation than ways guarantee the reduction of GPU inference time and
ResNet in terms of object detection task. real energy consumption. Ma et al. [13] shows an exper-
However, we also find in the experiment that detectors iment that ShuffleNet v2 with a similar number of FLOPs
with DenseNet which has fewer FLOPs and model parame- runs faster than MobileNet v2 on GPU. Chen et al. [2] also
ters spend more energy and time than those with ResNet. shows that while SqueezeNet has 50x fewer weights than
This is because there are other factors than FLOPs and AlexNet, it consumes more energy than AlexNet. These
model size that influence on energy and time consump- phenomena imply that FLOPs and model sizes are indirect
tion. First, memory access cost (MAC) required to access- metrics to measure practicality and designing the network
ing memory for intermediate feature maps is crucial factor based on the metrics should be reconsidered. To build ef-
of the consumptions [13, 22]. As illustrated in Figure 1(a), ficient network architectures that focus on a more practical
since all previous feature maps in DenseNet are used as in- and valid metrics such as energy per image and frame per
put to the subsequent layer by dense connection, it causes second (FPS), besides FLOPs and model parameters, it is
the memory access cost to increase quadratically with net- important to consider other factors that influence on energy
work depth and in turn leads to computation overhead and and time consumption.
more energy consumption.
Second, with respect to GPU parallel computation, 2.1. Memory Access Cost
DenseNet has the limitation of computation bottleneck. In
The first factor we point out is memory accesses cost
general, GPU parallel computing utilization is maximized
(MAC). The main source of energy consumption in CNN
when operand tensor is larger [14, 23, 10]. However, due
is memory accesses than computation [22]. Specifically,
to linearly increasing input channel, DenseNet is needed to
accessing data from the DRAM (Dynamic Random Access
adopt 1×1 convolution bottleneck architecture for reducing
Memory) for an operation consumes orders of magnitude
input dimension and FLOPs, which results in rather increas-
higher energy than the computation itself. Moreover, the
ing the number of layers with smaller operand tensor. As a
time budget on memory access accounts for a large pro-
result, GPU-computation becomes inefficiency.
portion of time consumption and can even be the bottle-
The goal of this paper is to improve DenseNet to be more
neck of the GPU process [13]. This implies that even under
efficient while preserving the benefit from concatenative ag-
the same number of computation and parameter if the total
gregation for object detection task. We first discuss about
number of memory access varies with model structure, the
MAC and GPU-computation efficiency and how to con-
energy consumption will be also different.
sider the factors in architecture designing stage. Secondly,
One reason that causes the discrepancy between model
we claim that the dense connections in intermediate layers
size and the number of memory access is the intermediate
of DenseNet are inducing the inefficiencies and hypothe-
activation memory footprint. As stated by Chen et al. [1],
size that the dense connections are redundant. With these
the memory footprint is attributed to both filter parameter
thoughts, we propose a novel One-Shot Aggregation (OSA)
and intermediate feature maps. If the intermediate feature
that aggregates intermediate features at once as shown in
maps are large, the cost for memory access increases even
Figure 1(b). This aggregation method brings great bene-
with the same model parameter. Therefore, we consider
fit to MAC and GPU computation efficiency while it pre-
MAC, which covers the memory footprint for filter param-
serves the strength of concatenation. With OSA modules,
eter and intermediate feature map size both, to an impor-
we build VoVnet1 , energy efficient backbone for real-time
tant factor for network design. Specifically, we follow the
detection. To validate the effectiveness of VoVNet as back-
method of Ma et al. [13] to calculate MAC of each convo-
bone network, we apply VoVNet to various object detectors
lutional layers as below
such as DSOD, RefineDet, and Mask R-CNN. The results
1 It means Variety of View Network MAC = hw(ci + co ) + k 2 ci co (1)
Authorized licensed use limited to: Nanjing Institute Of Technology. Downloaded on December 08,2021 at 12:51:48 UTC from IEEE Xplore. Restrictions apply.
The notations k, h, w,ci , co denote kernel size, height/width
of input and output response, the channel size of input, and
that of output response, respectively.
2.2. GPU-Computation Efficiency
The network architectures that reduce their FLOPs for
speed is based on the idea that every floating point operation
is processed on the same speed in a device. However, this
is incorrect when a network is deployed on GPU. This is
because of GPU parallel processing mechanism. As GPU
is able to process multiple floating processes in time, it is
important to utilize its computational ability efficiently. We
use the term GPU-computation efficiency for this concept.
GPU parallel computing power is utilized better as the
computed data tensor becomes larger [23, 10]. Splitting a
large convolution operation into several fragmented smaller
operations makes GPU computation inefficient as fewer
computations are processed in parallel. In the context of
network design, this implies that it is better to compose net-
work with fewer layers if the behavior function is same.
Moreover, adopting extra layers causes kernel launching
and synchronization which result in additional time over-
head [13].
Accordingly, although the technique such as depthwise
convolution and 1×1convolution bottleneck can reduce the
Figure 2. The average absolute filter weights of convolutional lay-
number of FLOPs, it is harmful to GPU-computation effi-
ers in trained DenseNet [8] (top) and VoVNet (middle, bottom).
ciency as it adopts additional 1×1 convolution. More gener- The color of pixel (i, j) encodes the average L1 norm of weights
ally, GPU-computation efficiency varies with the model ar- connecting layer s to l. OSA module (x/y) indicates that the OSA
chitecture. Therefore, for validating computation efficiency modules consist of x layers with y channels.
of network architectures, we introduce FLOPs per Second
(FLOP/s) which is computed by dividing the actual GPU boundary has its ground on mean value inequality, MAC
inference time from the total FLOPs. High FLOP/s implies can be minimized when the input and output have the same
the architecture utilize GPU power efficiently. channel size under fixed number of computation or model
parameter. Dense connections increase input channel size
3. Proposed Method while output channel size remains constant, and as a result,
each layer has imbalanced input and output channel sizes.
3.1. Rethinking Dense Connection
Therefore, DenseNet has high MAC among the models with
The dense connection that aggregates all intermediate the same number of computations or parameters and con-
layers induces inevitable inefficiency, which comes from sumes more energy and time.
that input channel size of each layer increases linearly as Second, the dense connection imposes the use of bot-
the layer proceed. Because of the intensive aggregation, the tleneck structure which harms the efficiency of GPU paral-
dense block can produce only a few features with FLOPs lel computation. The linearly increasing input size is criti-
or parameters constraint. In other words, DenseNet trades cally problematic when model size is big because it makes
the quantity of features for the quality of features via the the overall computation grows quadratically with respect to
dense connection. Although the performance of DenseNet depth. To suppress this growth, DenseNet adopts the bot-
seems to prove the trade is beneficial, there are some other tleneck architecture which adds 1×1 convolutional layers to
drawbacks of the trade in perspective of energy and time. maintain the input size of 3 × 3 convolutional layer con-
First, dense connections induce high memory access cost stant. Although this solution can reduce FLOPs and param-
which is paid by energy and time. As mentioned by Ma eters, it harms the GPU parallel computation efficiency as
et al. [13], the lower boundary of MAC, or the number discussed. Bottleneck architecture divides one 3 × 3 con-
of memory access operation, ofa convolutional layer can volutional layer into two smaller layers and causes more
hwB B
be represented by M AC ≥ 2 k2 + hw when B = sequential computations, which lowers the inference speed.
2
k hwci co is the number of computation. Because the lower Because of these drawbacks, DenseNet becomes ineffi-
Authorized licensed use limited to: Nanjing Institute Of Technology. Downloaded on December 08,2021 at 12:51:48 UTC from IEEE Xplore. Restrictions apply.
cient in terms of energy and time. To improve efficiency, we block which is used in DenseNet-40. First, we investigate
first investigate how dense connections actually aggregate the result on the OSA module with the same number of lay-
the features once the network is trained. Hu et al. [8] illus- ers with the dense block, which is 12 (Figure 2 (middle)).
trate the connectivity of the dense connection by evaluating The output is bigger than that of dense block as the input
normalized L1 norm of input weights to each layer. These size of each convolution layers is reduced. The network
values show the normalized influences of each preceding with OSA modules shows 93.6% accuracy on CIFAR-10
layer to corresponding layers. The figures are represented classification which is slightly dropped by 1.2% but still
in Figure 2 (top). higher than ResNet with similar model size. It can be ob-
In Dense Block3, the red boxes near the diagonal show served that the aggregations in final layers become more
that aggregations on intermediate layers are active. How- intense as the dense connections on intermediate layers are
ever, in the classification layer, only a small proportion of pruned.
intermediate features is used. In contrast, in Dense Block1 Moreover, the weights of transition layer of OSA mod-
transition layer aggregates the most of its input feature well ule show the different pattern with that of DenseNet: fea-
while intermediate layers do not. tures from shallow depth are more aggregated on the tran-
With the observations, we hypothesize that there is a neg- sition layer. Since the features from deep layer are not in-
ative relation between the strength of aggregation on inter- fluencing strongly on transition layers, we can reduce the
mediate layers and that of final layers. This can be true if layer without significant effect. Therefore, we reconfigure
the dense connection between intermediate layers induces OSA module to have 5 layers with 43 channels each (Fig-
correlation between features from each layer. This means ure 2 (bottom)). Surprisingly, with this module, we achieve
that dense connection makes later intermediate layer pro- error rate 5.44% which is similar to that of DenseNet-40
duce the features that are better but also similar to the fea- (5.24%). This implies that building deep intermediate fea-
tures from former layers. In this case, the final layer is not ture via dense connection is less effective than expected.
required to learn to aggregate both features because they Although the network with OSA module has slightly de-
are representing redundant information. As a result, the in- creased performance on CIFAR-10, which does not neces-
fluence of the former intermediate layer to the final layer sarily imply it will underperform on detection task, it has
becomes small. much less MAC than that with dense block. By follow-
As all intermediate features are aggregated to produce ing Eq. (1), it is estimated that substituting dense block of
final feature in the final layer, it is better to produce inter- DenseNet-40 to OSA module with 5 layers with 43 chan-
mediate features that can complement each other, or less nels reduces MAC from 3.7M to 2.5M. This is because
correlated. Therefore, we can extend our hypothesis to that the intermediate layers in OSA have the same size of input
the effect of dense connections in intermediate feature is and output which leads MAC to the lower boundary. This
relatively little with respect to the cost. To verify the hy- means that one can build faster and more energy efficient
potheses, we redesign a novel module that aggregates its network if the MAC is the dominant factor of energy and
intermediate features only on the final layer of each block. time consumption. Specifically, as detection is performed
on a higher resolution than classification, the intermediate
3.2. One-Shot Aggregation memory footprint will become larger and MAC will reflect
the energy and time consumption more appropriately.
We integrate previously discussed thoughts into efficient
architecture, one-shot aggregation (OSA) module which ag- Also, OSA improves GPU computation efficiency. The
gregates its feature in the last layer at once. Figure 1(b) input sizes of intermediate layers of OSA module are con-
illustrates the proposed OSA module. Each convolution stant. Hence, it is unnecessary to adopt additional 1×1 conv
layer is connected by two-way connection. One way is con- bottleneck to reduce dimension. Moreover, as the OSA
nected to the subsequent layer to produce the feature with a module aggregates the shallow features, it consists of fewer
larger receptive field while the other way is aggregated only layers. As a result, the OSA module is designed to have
once into the final output feature map. The difference with only a few layers that can be efficiently computed in GPU.
DenseNet is that the output of each layer is not routed to all
3.3. Configuration of VoVNet
subsequent intermediate layers which makes the input size
of intermediate layers constant. Due to the diversified feature representation and effi-
To verify our hypotheses that there is a negative relation ciency of the OSA modules, our VoVNet can be constructed
between the strength of aggregation on intermediate layers by stacking only a few modules with high accuracy and fast
and that on final layer, and that the dense connections are re- speed. Based on the confirmation that the shallow depth
dundant, we conduct the same experiment with Hu et al. [8] is more aggregated in Figure 2, we can configure the OSA
on OSA module. We designed OSA modules to have the module with a smaller number of convolutions with larger
similar number of parameter and computation with dense channel than DenseNet. There are two types of VoVNet:
Authorized licensed use limited to: Nanjing Institute Of Technology. Downloaded on December 08,2021 at 12:51:48 UTC from IEEE Xplore. Restrictions apply.
Type Output Stride VoVNet-27-slim VoVNet-39 VoVNet-57
2 3 × 3 conv, 64, stride=2 3 × 3 conv, 64, stride=2 3 × 3 conv, 64, stride=2
Stem
2 3 × 3 conv, 64, stride=1 3 × 3 conv, 64, stride=1 3 × 3 conv, 64, stride=1
Stage 1
2 3 × 3 conv, 128, stride=1
3 × 3 conv, 128, stride=1 3 × 3 conv, 128, stride=1
OSA module 3 × 3 conv, 64, ×5 3 × 3 conv, 128, ×5 3 × 3 conv, 128, ×5
4 ×1 ×1 ×1
Stage 2 concat & 1×1 conv, 128 concat & 1×1 conv, 256 concat & 1×1 conv, 256
OSA module 3 × 3 conv, 80, ×5 3 × 3 conv, 160, ×5 3 × 3 conv, 160, ×5
8 ×1 ×1 ×1
Stage 3 concat & 1×1 conv, 256 concat & 1×1 conv, 512 concat & 1×1 conv, 512
OSA module 3 × 3 conv, 96, ×5 3 × 3 conv, 192, ×5 3 × 3 conv, 192, ×5
16 ×1 ×2 ×4
Stage 4 concat & 1×1 conv, 384 concat & 1×1 conv, 768 concat & 1×1 conv, 768
OSA module 3 × 3 conv, 112, ×5 3 × 3 conv, 224, ×5 3 × 3 conv, 224, ×5
32 ×1 ×2 ×3
Stage 5 concat & 1×1 conv, 512 concat & 1×1 conv, 1024 concat & 1×1 conv, 1024
Table 1. Overall architecture of VoVNet. Downsampling is done by 3 × 3 max pooling with a stride of 2 at the end of each stage. Note
that each conv layer has the sequence Conv-BN-ReLU.
Authorized licensed use limited to: Nanjing Institute Of Technology. Downloaded on December 08,2021 at 12:51:48 UTC from IEEE Xplore. Restrictions apply.
! ! !
Authorized licensed use limited to: Nanjing Institute Of Technology. Downloaded on December 08,2021 at 12:51:48 UTC from IEEE Xplore. Restrictions apply.
Figure 4. Comparisons of large-scale models on RefineDet320 [24] in terms of the computation and energy efficiency. (a) shows speed
vs. accuracy. (b), (c), and (d) illustrate comparison of GPU-computation-efficiency and energy-efficiency, respectively.
the depthwise convolution and decomposing a convolution located in the left-upper direction, which means it is the
into the smaller fragmented operations are not an efficient most efficient model in terms of both GPU-computation and
way in terms of GPU computation-efficiency. Given these energy efficiency.
results, it is worth noting that VoVNet makes full use of
GPU computation resource most efficiently. As a result, 4.3. RefineDet
VoVNet achieves a significantly better speed-accuracy From this section, we validate the generalization to
tradeoff as shown in Figure 3(a). large-scale VoVNet, e.g.,VoVNet-39/57, in RefineDet [24]
which is the state-of-the-art one-stage object detector.
Energy Efficiency. When validating the efficiency of net- Without any bells-and-whistles, we simply plug VoVNet-
work, another important thing to consider is energy effi- 39/57 into RefineDet, following same hyper-parameters
ciency (Joule/frame). The metric is the amount of energy and training protocols for fair comparison. We train Re-
consumed to process an image; the lower value means bet- fineDet320 for 400k iterations with a batch size of 32 and
ter energy efficiency. We measure energy consumption and an initial learning rate of 0.001 which is decreased by 0.1
obtain the energy efficiencies of VoVNet and other models at 280k and 360k iterations. All models are implemented
based detectors. Table 2 shows a tendency between energy by RefineDet original Caffe code3 base. The results are
efficiency and memory footprint. VoVNet based DSOD summarized in Table 4.
consumes only 0.9J per image, which is 4.1× less than
DenseNet based one. We can note that the excessive inter- Accuracy vs. Speed. Figure 4(a) illustrates speed vs. ac-
mediate activation maps of DenseNet increase the memory curacy. VoVNet-39/57 outperform DenseNet-201/161 and
footprint, which results in more energy consumption. It is ResNet50/101 both with faster speed. While VoVNet-39
also notable that MobileNet shows worse energy efficiency achieves similar accuracy of 33.5 AP with DenseNet-161, it
than VoVNet although its memory footprint is lower. This is runs about two times faster than the counterpart with much
because depthwise convolution requires fragmented mem- fewer parameters and less memory footprint. VoV-39 also
ory access and in turn increases memory access costs [9]. outperforms ResNet-50 by a large margin of 3.3% absolute
Figure 3(c) describes accuracy vs. energy efficiency AP at comparable speed. These results demonstrate with
where with two times better energy efficiency than Mo- fewer parameters and memory footprint, the proposed
bileNet and Pelee, VoVNet outperforms the counterparts by VoVNet is the most efficient backbone network in terms of
a large margin of 6.87% and 3.97%, respectively. In addi- both accuracy and speed.
tion, Figure 3(d) shows a tendency of efficiency with respect
to computation and energy consumption both. VoVNet is 3 https://github.com/sfzhang15/RefineDet
Authorized licensed use limited to: Nanjing Institute Of Technology. Downloaded on December 08,2021 at 12:51:48 UTC from IEEE Xplore. Restrictions apply.
GPU-Computation Efficiency. Figure 4(b) shows that Backbone APbbox APbbox APbbox APseg APseg seg
50 70 50 AP75 GPU time
VoVNet-39/57 outperform DenseNet and ResNet back- ResNet-50-GN 39.5 59.8 43.6 35.2 56.9 37.6 157 ms
bones with higher computation efficiency. In particular, ResNet-101-GN 41.0 61.1 44.9 36.4 58.2 38.7 185 ms
since VoVNet-39 runs faster than DenseNet-201 having VoVNet-39-GN 41.7 62.2 45.8 36.8 59.0 39.5 152 ms
fewer FLOPs, VoVNet-39 achieves about three times VoVNet-57-GN 41.9 62.1 46.0 37.0 59.3 39.7 159 ms
higher computation efficiency than DenseNet-201 with Table 5. Detection and segementation results using Mask R-CNN
better accuracy. One can note that although DenseNet-201 with Group Normalization [21] trained from scratch for 3×
(k=32) has fewer FLOPs, it runs slower than DenseNet-161 schedule and evaluted on COCO val set.
(k=48), which means lower computation efficiency. We
speculate that deeper and thinner network architecture is ResNet with GN backbone for VoVNet with GN in Mask
computationally in-efficient in terms of GPU parallelism. R-CNN, following same hyperparameters and training
protocols [3]. We train VoVNet with GN based Mask
Energy Effficiency. As illustrated in Figure 4(c), with R-CNN from scratch with batch size 16 for 3× schedule
higher or comparable accuracy, VoV-39/57 consume only in an end-to-end manner as like [21]. Meanwhile, due to
4.8J and 5.9J per image, which are less than DenseNet- extreme memory footprint of DenseNet and larger input
201/161 and ResNet-50/101, respectively. Compared to size of Mask R-CNN, we cannot train DenseNet based
DenseNet161, the energy consumption of VoVNet-39 is Mask R-CNN even on the 32GB V100 GPUs. The results
two times less with comparable accuracy. Table 4 shows are listed in Table 5.
that the positive relation between memory footprint and
energy consumption. From this observation, it can be seen Accuracy vs. Speed. For object detection task, with faster
that VoVNet with relatively fewer memory footprint is the speed, VoVNet-39 obtains 2.2%/0.9% absolute AP gains
most energy efficient. In addition, Figure 4(d) shows that compared to ResNet-50/101, respectively. The extended
our VoVNet-39/57 are located in the most efficient position version of VoVNet, VoVNet-57 also achieves state-of-the-
in terms of energy and computation. art performance compared to ResNet-101 at faster inference
speed. For instance segmentation task, VoVNet-39 also im-
Small Object Detection. In Table 4, we find that VoVNet proves 1.6%/0.4% AP from ResNet-50/101. These results
and DenseNet obtain higher AP than ResNet on small and support the fact that VoVNet can also provide better diverse
medium objects. This supports that conserving the diverse feature representation for object detection and simultane-
feature representations with multi-receptive fields by con- ously instance segmentation efficiently.
catenative aggregation has the advantage of small object de-
tection. Furthermore, VoVNet improves 1.9%/1.2% small 5. Conclusion
object AP gain from DenseNet121/161, which suggests that For real-time object detection, in this paper, we propose
generating more features by OSA is better than generating an efficient backbone network called VoVNet that makes
deep features by dense connection on small object detec- good use of the diversified feature representation with multi
tion. receptive fields and improves the inefficiency of DenseNet.
The proposed One-Shot Aggregation (OSA) addresses the
4.4. Mask R-CNN from scratch problem of linearly increasing the input channel of the
In this section, we also validate the efficiency of VoVNet dense connection by aggregating all features in the final
as a backbone for a two-stage object detector, Mask R- feature map only at once. This results in constant input
CNN. Recent works [16, 4] are studied on training with- size which reduces memory access cost and makes GPU-
out ImageNet pretraining. DSOD is the first one-stage ob- computation more efficient. Extensive experimental results
ject detector trained from scratch and achieves significant demonstrate that not only lightweight but also large-scale
performance due to the deep supervision trait of DenseNet. VoVNet based detectors outperform DenseNet based ones
He et al. [4] also prove that when trained from scratch for at much faster speed. For future works, we have plans to
longer training iterations, Mask R-CNN with Group nor- apply VoVNet to other detection meta-architecture or se-
malization (GN) [21] achieves comparable or higher accu- mantic segmentation, etc.
racy than that with ImageNet pretraining. We also already
confirmed our VoVNet with DSOD achieves good perfor- 6. Acknowledgement
mance when training from scratch in Section 4.2. This work was supported by Institute of Information &
Thus we also apply VoVNet backbone to Mask R-CNN Communications Technology Planning & Evaluation (IITP)
with GN, the state-of-the-art two-stage object detection grant funded by the Korea government (MSIT) (B0101-15-
and simultaneously instance segmentation. For fair com- 0266, Development of High Performance Visual BigData
parison, without any bells-and-whistles, we only exchange Discovery platform)
Authorized licensed use limited to: Nanjing Institute Of Technology. Downloaded on December 08,2021 at 12:51:48 UTC from IEEE Xplore. Restrictions apply.
References [17] Karen Simonyan and Andrew Zisserman. Very deep convo-
lutional networks for large-scale image recognition. arXiv
[1] Yu-Hsin Chen, Joel Emer, and Vivienne Sze. Eyeriss: A preprint arXiv:1409.1556, 2014. 1
spatial architecture for energy-efficient dataflow for convolu- [18] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and
tional neural networks. In ACM SIGARCH Computer Archi- Alexander A Alemi. Inception-v4, inception-resnet and the
tecture News, volume 44, pages 367–379. IEEE Press, 2016. impact of residual connections on learning. In AAAI, 2017.
2 1
[2] Yu-Hsin Chen, Tien-Ju Yang, Joel Emer, and Vivienne Sze. [19] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet,
Understanding the limitations of existing energy-efficient de- Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent
sign approaches for deep neural networks. Energy, 2(L1):L3, Vanhoucke, and Andrew Rabinovich. Going deeper with
2018. 2 convolutions. In CVPR, 2015. 1, 2
[3] Ross Girshick, Ilija Radosavovic, Georgia Gkioxari, Piotr [20] Robert J Wang, Xiang Li, and Charles X Ling. Pelee: A real-
Dollár, and Kaiming He. Detectron. https://github. time object detection system on mobile devices. In NIPS,
com/facebookresearch/detectron, 2018. 8 pages 1963–1972, 2018. 2, 5, 6
[4] Kaiming He, Ross Girshick, and Piotr Dollár. Rethinking im- [21] Yuxin Wu and Kaiming He. Group normalization. arXiv
agenet pre-training. arXiv preprint arXiv:1811.08883, 2018. preprint arXiv:1803.08494, 2018. 8
8 [22] Tien-Ju Yang, Yu-Hsin Chen, and Vivienne Sze. Designing
[5] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir- energy-efficient convolutional neural networks using energy-
shick. Mask r-cnn. In ICCV, 2017. 5 aware pruning. In CVPR, pages 5687–5695, 2017. 2
[6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. [23] Sergey Zagoruyko and Nikos Komodakis. Wide residual net-
Deep residual learning for image recognition. In CVPR, works. arXiv preprint arXiv:1605.07146, 2016. 2, 3
2016. 1, 7 [24] Shifeng Zhang, Longyin Wen, Xiao Bian, Zhen Lei, and
[7] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Stan Z Li. Single-shot refinement neural network for object
Kalenichenko, Weijun Wang, Tobias Weyand, Marco An- detection. In CVPR, 2018. 5, 7
dreetto, and Hartwig Adam. Mobilenets: Efficient convolu- [25] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun.
tional neural networks for mobile vision applications. arXiv Shufflenet: An extremely efficient convolutional neural net-
preprint arXiv:1704.04861, 2017. 2, 5, 6 work for mobile devices. In CVPR, 2018. 2
[8] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kil- [26] Ligeng Zhu, Ruizhi Deng, Michael Maire, Zhiwei Deng,
ian Q Weinberger. Densely connected convolutional net- Greg Mori, and Ping Tan. Sparsely aggregated convolutional
works. In CVPR, 2017. 1, 3, 4, 7 networks. In ECCV, 2018. 1
[9] Yunho Jeon and Junmo Kim. Constructing fast network
through deconstruction of convolution. In NIPS, pages
5955–5965, 2018. 7
[10] Youngwan Lee, Huieun Kim, Eunsoo Park, Xuenan Cui, and
Hakil Kim. Wide-residual-inception networks for real-time
object detection. In IV, pages 758–764. IEEE, 2017. 2, 3
[11] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
Zitnick. Microsoft coco: Common objects in context. In
ECCV, 2014. 5
[12] Songtao Liu, Di Huang, et al. Receptive field block net for
accurate and fast object detection. In ECCV, 2018. 2
[13] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun.
Shufflenet v2: Practical guidelines for efficient cnn architec-
ture design. In ECCV, 2018. 2, 3
[14] Nvidia. Gpu-based deep learning inference: A performance
and power analysis. Nvidia Whitepaper, 2015. 2
[15] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh-
moginov, and Liang-Chieh Chen. Inverted residuals and lin-
ear bottlenecks: Mobile networks for classification, detec-
tion and segmentation. arXiv preprint arXiv:1801.04381,
2018. 2
[16] Zhiqiang Shen, Zhuang Liu, Jianguo Li, Yu-Gang Jiang,
Yurong Chen, and Xiangyang Xue. Dsod: Learning deeply
supervised object detectors from scratch. In ICCV, 2017. 5,
6, 8
Authorized licensed use limited to: Nanjing Institute Of Technology. Downloaded on December 08,2021 at 12:51:48 UTC from IEEE Xplore. Restrictions apply.