05 Tiny - SSD - A - Tiny - Single-Shot - Detection - Deep - Convolutional - Neural - Network - For - Real-Time - Embedded - Object - Detection
05 Tiny - SSD - A - Tiny - Single-Shot - Detection - Deep - Convolutional - Neural - Network - For - Real-Time - Embedded - Object - Detection
Tiny SSD: A Tiny Single-shot Detection Deep Convolutional Neural Network for
Real-time Embedded Object Detection
Authorized licensed use limited to: ULAKBIM UASL - Uluslararasi Kibris Universitesi. Downloaded on March 22,2023 at 04:59:01 UTC from IEEE Xplore. Restrictions apply.
purpose of autonomous driving. However, SqueezeDet has
only been demonstrated for objection detection with limited
object categories (only three) and thus its ability to handle
larger number of categories have not been demonstrated.
As such, the design of highly efficient deep neural network
architectures that are well-suited for real-time embedded
object detection while achieving improved object detection
accuracy on a variety of object categories is still a challenge
worth tackling.
In an effort to achieve a fine balance between object
detection accuracy and real-time embedded requirements
(i.e., small model size and real-time embedded inference
speed), we take inspiration by both the incredible efficiency
of the Fire microarchitecture introduced in SqueezeNet [5]
and the powerful object detection performance demonstrated Figure 2. An illustration of the Fire microarchitecture. The output of
by the single-shot detection macroarchitecture introduced previous layer is squeezed by a squeeze convolutional layer of 1 × 1 filters,
in SSD [9]. The resulting network architecture achieved which reduces the number of input channels to 3 × 3 filters. The result of
the squeeze convolutional layers is passed into the expand convolutional
in this paper is Tiny SSD, a single-shot detection deep layer which consists of both 1 × 1 and 3 × 3 filters.
convolutional neural network designed specifically for real-
time embedded object detection. Tiny SSD is composed
of a non-uniform highly optimized Fire sub-network stack, 2) reduce the number of input channels to 3 × 3 filters
which feeds into a non-uniform sub-network stack of highly where possible, and
optimized SSD-based auxiliary convolutional feature layers, 3) perform downsampling at a later stage in the network.
designed specifically to minimize model size while retaining This principled designed strategy led to the design of what
object detection performance. the authors referred to as the Fire module, which consists of
This paper is organized as follows. Section 2 describes the a squeeze convolutional layer of 1 × 1 filters (which realizes
highly optimized Fire sub-network stack leveraged in the Tiny the second design strategy of effectively reduces the number
SSD network architecture. Section 3 describes the highly of input channels to 3 × 3 filters) that feeds into an expand
optimized sub-network stack of SSD-based convolutional convolutional layer comprised of both 1 × 1 filters and 3 × 3
feature layers used in the Tiny SSD network architecture. filters (which realizes the first design strategy of effectively
Section 4 presents experimental results that evaluate the reducing the number of 3 × 3 filters). An illustration of the
efficacy of Tiny SSD for real-time embedded object detection. Fire microarchitecture is shown in Figure 2.
Finally, conclusions are drawn in Section 5. Inspired by the elegance and simplicity of the Fire
microarchitecture design, we design the first sub-network
II. O PTIMIZED F IRE S UB - NETWORK S TACK stack of the Tiny SSD network architecture as a standard
The overall network architecture of the Tiny SSD network convolutional layer followed by a set of highly optimized
for real-time embedded object detection is composed of two Fire modules. One of the key challenges to designing this
main sub-network stacks: i) a non-uniform Fire sub-network sub-network stack is to determine the ideal number of Fire
stack, and ii) a non-uniform sub-network stack of highly modules as well as the ideal microarchitecture of each of
optimized SSD-based auxiliary convolutional feature layers, the Fire modules to achieve a fine balance between object
with the first sub-network stack feeding into the second sub- detection performance and model size as well as inference
network stack. In this section, let us first discuss in detail speed. First, it was determined empirically that 10 Fire
the design philosophy behind the first sub-network stack modules in the optimized Fire sub-network stack provided
of the Tiny SSD network architecture: the optimized fire strong object detection performance. In terms of the ideal
sub-network stack. microarchitecture, the key design parameters of the Fire
A powerful approach to designing smaller deep neural microarchitecture are the number of filters of each size
network architectures for embedded inference is to take a (1 × 1 or 3 × 3) that form this microarchitecture. In the
more principled approach and leverage architectural design SqueezeNet network architecture that first introduced the
strategies to achieve more efficient deep neural network Fire microarchitecture [5], the microarchitectures of the Fire
microarchitectures [3], [5]. A very illustrative example of modules are largely uniform, with many of the modules
such a principled approach is the SqueezeNet [5] network ar- sharing the same microarchitecture configuration. In an effort
chitecture, where three key design strategies were leveraged: to achieve more optimized Fire microarchitectures on a per-
module basis, the number of filters of each size in each Fire
1) reduce the number of 3 × 3 filters as much as possible,
96
Authorized licensed use limited to: ULAKBIM UASL - Uluslararasi Kibris Universitesi. Downloaded on March 22,2023 at 04:59:01 UTC from IEEE Xplore. Restrictions apply.
Table I
T HE OPTIMIZED F IRE SUB - NETWORK STACK OF THE T INY SSD
NETWORK ARCHITECTURE . T HE NUMBER OF FILTERS AND INPUT SIZE TO
EACH LAYER ARE REPORTED FOR THE CONVOLUTIONAL LAYERS AND
F IRE MODULES . E ACH F IRE MODULE IS REPORTED IN ONE ROW FOR A
BETTER REPRESENTATION . ”x@S – y@E1 – z@E3" STANDS FOR x
NUMBERS OF 1 × 1 FILTERS IN THE SQUEEZE CONVOLUTIONAL LAYER , y
NUMBERS OF 1 × 1 FILTERS AND z NUMBERS OF 3 × 3 FILTERS IN THE
EXPAND CONVOLUTIONAL LAYER .
97
Authorized licensed use limited to: ULAKBIM UASL - Uluslararasi Kibris Universitesi. Downloaded on March 22,2023 at 04:59:01 UTC from IEEE Xplore. Restrictions apply.
Table II size reductions while having a negligible effect on object
T HE OPTIMIZED SUB - NETWORK STACK OF THE AUXILIARY detection accuracy.
CONVOLUTIONAL FEATURE LAYERS WITHIN THE T INY SSD NETWORK
ARCHITECTURE . T HE INPUT SIZES TO EACH CONVOLUTIONAL LAYER
AND KERNEL SIZES ARE REPORTED . V. E XPERIMENTAL R ESULTS AND D ISCUSSION
Type / Stride Filter Shape Input Size To study the utility of Tiny SSD for real-time embed-
Conv12-1 / s2 3 × 3 × 51 4×4 ded object detection, we examine the model size, object
Conv12-2 3 × 3 × 46 4×4 detection accuracies, and computational operations on the
Conv13-1 3 × 3 × 55 2×2 VOC2007/2012 datasets. For evaluation purposes, the Tiny
Conv13-2 3 × 3 × 85 2×2 YOLO network [10] was used as a baseline reference com-
Fire4-mbox-loc 3 × 3 × 16 37 × 37 parison given its popularity for embedded object detection,
Fire4-mbox-conf 3 × 3 × 84 37 × 37
and was also demonstrated to possess one of the smallest
Fire8-mbox-loc 3 × 3 × 24 18 × 18
Fire8-mbox-conf 3 × 3 × 126 18 × 18 model sizes in literature for object detection on the VOC
Fire9-mbox-loc 3 × 3 × 24 9×9 2007/2012 datasets (only 60.5MB in size and requiring
Fire9-mbox-conf 3 × 3 × 126 9×9 just 6.97 billion operations). The VOC2007/2012 datasets
Fire10-mbox-loc 3 × 3 × 24 4×4 consist of natural images that have been annotated with 20
Fire10-mbox-conf 3 × 3 × 126 4×4 different types of objects, with illustrative examples shown
Conv12-2-mbox-loc 3 × 3 × 24 2×2
in Figure 4. The tested deep neural networks were trained
Conv12-2-mbox-conf 3 × 3 × 126 2×2
Conv13-2-mbox-loc 3 × 3 × 16 1×1 using the VOC2007/2012 training datasets, and the mean
Conv13-2-mbox-conf 3 × 3 × 84 1×1 average precision (mAP) was computed on the VOC2007
test dataset to evaluate the object detection accuracy of the
deep neural networks.
98
Authorized licensed use limited to: ULAKBIM UASL - Uluslararasi Kibris Universitesi. Downloaded on March 22,2023 at 04:59:01 UTC from IEEE Xplore. Restrictions apply.
Figure 4. Example images from the Pascal VOC dataset. The ground-truth bounding boxes and object categories are shown for each image.
in object detection accuracy when compared to Tiny YOLO [4] Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu,
illustrates the efficacy of Tiny SSD for providing more Anoop Korattikara, Alireza Fathi, Ian Fischer, Zbigniew Wojna,
reliable embedded object detection performance. Furthermore, Yang Song, Sergio Guadarrama, et al. Speed/accuracy trade-
offs for modern convolutional object detectors. In IEEE CVPR,
as seen in Table IV, Tiny SSD requires just 571.09 million 2017.
MAC operations to perform inference, making it well-suited
for real-time embedded object detection. These experimental [5] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid
results show that very small deep neural network architectures Ashraf, William J Dally, and Kurt Keutzer. Squeezenet:
can be designed for real-time object detection that are well- Alexnet-level accuracy with 50x fewer parameters and< 0.5
suited for embedded scenarios. mb model size. arXiv preprint arXiv:1602.07360, 2016.
VI. C ONCLUSIONS [6] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Im-
In this paper, a single-shot detection deep convolutional agenet classification with deep convolutional neural networks.
neural network called Tiny SSD is introduced for real-time In NIPS, 2012.
embedded object detection. Composed of a highly optimized,
[7] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep
non-uniform Fire sub-network stack and a non-uniform sub- learning. Nature, 2015.
network stack of highly optimized SSD-based auxiliary
convolutional feature layers designed specifically to minimize [8] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He,
model size while maintaining object detection performance, Bharath Hariharan, and Serge Belongie. Feature pyramid
Tiny SSD possesses a model size that is ∼26X smaller than networks for object detection. In CVPR, volume 1, page 4,
Tiny YOLO, requires just 571.09 million MAC operations, 2017.
while still achieving an mAP of that is ∼4.2% higher than
Tiny YOLO on the VOC 2007 test dataset. These results [9] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian
Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg.
demonstrates the efficacy of designing very small deep neural
SSD: Single shot multibox detector. In European conference
network architectures such as Tiny SSD for real-time object on computer vision, pages 21–37. Springer, 2016.
detection in embedded scenarios.
ACKNOWLEDGMENT [10] J. Redmon. YOLO: Real-time object detection.
https://pjreddie.com/darknet/yolo/, 2016.
The authors thank Natural Sciences and Engineering Re-
search Council of Canada, Canada Research Chairs Program, [11] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali
DarwinAI, and Nvidia for hardware support. Farhadi. You only look once: Unified, real-time object
detection. In Proceedings of the IEEE conference on computer
R EFERENCES vision and pattern recognition, pages 779–788, 2016.
[1] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra
Malik. Rich feature hierarchies for accurate object detection [12] Joseph Redmon and Ali Farhadi. YOLO9000: better, faster,
and semantic segmentation. In Proceedings of the IEEE stronger. arXiv preprint, 1612, 2016.
conference on computer vision and pattern recognition, pages
580–587, 2014.
[13] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
[2] K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask r-cnn. Faster R-CNN: Towards real-time object detection with region
ICCV, 2017. proposal networks. In Advances in neural information
processing systems, pages 91–99, 2015.
[3] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry
Kalenichenko, Weijun Wang, Tobias Weyand, Marco An- [14] Mohammad Javad Shafiee, Brendan Chywl, Francis Li, and
dreetto, and Hartwig Adam. Mobilenets: Efficient convo- Alexander Wong. Fast YOLO: A fast you only look once
lutional neural networks for mobile vision applications. arXiv system for real-time embedded object detection in video. arXiv
preprint arXiv:1704.04861, 2017. preprint arXiv:1709.05943, 2017.
99
Authorized licensed use limited to: ULAKBIM UASL - Uluslararasi Kibris Universitesi. Downloaded on March 22,2023 at 04:59:01 UTC from IEEE Xplore. Restrictions apply.
Input Image Tiny YOLO Tiny SSD
Figure 5. Example object detection results produced by the proposed Tiny SSD compared to Tiny YOLO. It can be observed that Tiny SSD has comparable
object detection results as Tiny YOLO in some cases, while in some cases outperforms Tiny YOLO in assigning more accurate category labels to detected
objects. This significant improvement in object detection accuracy when compared to Tiny YOLO illustrates the efficacy of Tiny SSD for providing more
reliable embedded object detection performance.
100
Authorized licensed use limited to: ULAKBIM UASL - Uluslararasi Kibris Universitesi. Downloaded on March 22,2023 at 04:59:01 UTC from IEEE Xplore. Restrictions apply.
[15] Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick.
Training region-based object detectors with online hard
example mining. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 761–769,
2016.
[16] Bichen Wu, Forrest Iandola, Peter H Jin, and Kurt Keutzer.
Squeezedet: Unified, small, low power fully convolutional
neural networks for real-time object detection for autonomous
driving. arXiv preprint arXiv:1612.01051, 2016.
101
Authorized licensed use limited to: ULAKBIM UASL - Uluslararasi Kibris Universitesi. Downloaded on March 22,2023 at 04:59:01 UTC from IEEE Xplore. Restrictions apply.