0% found this document useful (0 votes)
3 views4 pages

22RTAS

The document presents a framework for real-time 3D object detection in autonomous vehicles, focusing on a compiler-aware pruning search method to optimize deep learning models for mobile devices. It highlights the challenges of computation-intensive deep learning methods and proposes a solution that combines a generator and evaluator to identify effective pruning strategies while maintaining detection accuracy. The framework achieves real-time performance on mobile hardware, demonstrating significant improvements in efficiency and speed for 3D object detection tasks.

Uploaded by

nnamanianthony3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views4 pages

22RTAS

The document presents a framework for real-time 3D object detection in autonomous vehicles, focusing on a compiler-aware pruning search method to optimize deep learning models for mobile devices. It highlights the challenges of computation-intensive deep learning methods and proposes a solution that combines a generator and evaluator to identify effective pruning strategies while maintaining detection accuracy. The framework achieves real-time performance on mobile hardware, demonstrating significant improvements in efficiency and speed for 3D object detection tasks.

Uploaded by

nnamanianthony3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

*&&&UI3FBM5JNFBOE&NCFEEFE5FDIOPMPHZBOE"QQMJDBUJPOT4ZNQPTJVN 35"4

Brief Industry Paper: Towards Real-Time 3D Object


2021 IEEE 27th Real-Time and Embedded Technology and Applications Symposium (RTAS) | 978-1-6654-0386-3/20/$31.00 ©2021 IEEE | DOI: 10.1109/RTAS52030.2021.00043

Detection for Autonomous Vehicles with Pruning


Search
1 Pu Zhao, 2 Wei Niu, 1 Geng Yuan, 1 Yuxuan Cai, 3 Hsin-Hsuan Sung, 4 Shaoshan Liu,
5 Sijia Liu, 3 Xipeng Shen, 2 Bin Ren, 1 Yanzhi Wang, 1 Xue Lin
1
Northeastern University, Boston, MA
2
William & Mary, Williamsburg, VA
3
North Carolina State University, NC
4
PerceptIn, CA
5
Michigan State University, MI

Abstract—In autonomous driving, 3D object detection is es- move redundancy in the model, therefore reducing stor-
sential as it provides basic knowledge about the environment. age/computation cost and accelerating inference. There are
However, as deep learning based 3D detection methods are unstructured pruning scheme [2]–[4] to remove arbitrary
usually computation intensive, it is challenging to support real-
time 3D object detection on edge-computing devices in self- weight, coarse-grained structured pruning scheme [1], [4]–
driving cars with limited computation and memory resources. [7] to eliminate whole filters/channels, and fine-grained struc-
To facilitate this, we propose a compiler-aware pruning search tured pruning [8]–[10] to assign different pruning patterns to
framework, to achieve real-time inference of 3D object detection convolutional (CONV) kernels. Though unstructured pruning
on the resource-limited mobile devices. Specifically, a generator can achieve high accuracy, the arbitrary pruned irregular
is applied to sample better pruning proposals in the search
space based on current proposals with their performance, and an weights limited hardware parallelism, leading to difficulties for
evaluator is adopted to evaluate the sampled pruning proposal inference acceleration. Compared with unstructured pruning,
performance. To accelerate the search, the evaluator employs structured pruning can achieve higher hardware parallelism
Bayesian optimization with an ensemble of neural predictors. We and mobile inference acceleration, assisted by the compiler-
demonstrate in experiments that for the first time, the pruning level code generation and optimization techniques [9], with
search framework can achieve real-time 3D object detection
on mobile (Samsung Galaxy S20 phone) with state-of-the-art competitive classification/detection performance.
detection performance. Though compiler optimization can support various struc-
Index Terms—3D object detection, real-time, point cloud tured pruning (sparsity) schemes with notable mobile accel-
eration performance, we found that different sparsity schemes
I. I NTRODUCTION
lead to different accuracy and acceleration performance with
As the rapid development of autonomous vehicles to self- compiler optimization. For the specific 3D detection problem,
drive without human intervention, object detection (especially it is still questionable to adopt which sparsity scheme with
3D detection to deal with LiDAR data) serves as a fundamental which pruning rate to satisfy the accuracy and real-time re-
prerequisite for autonomous navigation. 3D detection can quirements. To find the pruning solution, motivated by the idea
extract the desirable knowledge about its environment from of Neural Architecture Search (NAS) [11], [12], we propose
3D point clouds of LiDAR sensors, thus enabling high-level a compiler-aware pruning search framework to automatically
computations and optimizations for auto-driving. determine the pruning scheme and pruning rate for each
Due to the instantaneously interaction requirement with the individual layer. The objective is to maximize accuracy with an
environment in auto-driving, it is essential to implement real- inference speed/latency constraint on the target mobile device.
time 3D object detection on autonomous vehicles. However, Different from previous work with fixed pruning scheme for
the current deep neural networks (DNNs) based 3D object all layers, our work can have different pruning schemes and
detectors usually cost tremendous memory and computation rates for different layers in the model. We summarize our
resources, leading to difficulties for real-time implementation, contribution as follows,
especially on autonomous vehicles with limited hardware re-
source. Though more powerful high-end GPUs can be adopted • We incorporate the overall DNN latency constraint into
for this task, they usually result in significant increasing price automatic pruning search process to satisfy a predefined
and power consumption. Thus it is desirable to facilitate the real-time requirement.
real-time 3D detection deployment on autonomous cars. • Our framework configures different pruning schemes and
To reduce the DNN model size and computations, DNN pruning rates for different layers which is different from
weight pruning [1], [2] has shown great advantages to re- previous works with fixed pruning scheme for all layers.

978-1-6654-0386-3/21/$31.00 ©2021 IEEE 


DOI 10.1109/RTAS52030.2021.00043
Authorized licensed use limited to: N.C. State University Libraries - Acquisitions & Discovery S. Downloaded on February 05,2024 at 17:05:37 UTC from IEEE Xplore. Restrictions apply.
• We adopt an ensemble of neural predictors and Bayesian Provide multiple
optimization (BO) to reduce the number of evaluated pruning proposals Train an ensemble of
pruning proposals, leading to less searching efforts. neural predictors

• We can achieve (close-to) real-time (98ms) 3D detection Select proposals based


with PointPillars, on an off-the-shelf mobile phone with Generator Evaluator on Acquisition function
minor (or no) accuracy loss. Evaluate the selected
proposals
II. BACKGROUND AND R ELATED W ORK Evaluate and feedback
their perforamnce
A. 3D Object Detection
Fig. 1. Automatic network pruning search framework
3D object detection detects objects with point clouds from TABLE I
LiDAR sensors. PointPillars [13] is a popular 3D detection S EARCH SPACE FOR EACH DNN LAYER
method with three main stages: (1) A feature encoder network
Pruning scheme {Filter [18], Pattern-based [9], Block-based [10]}
to convert a point cloud to a sparse pseudo-image; (2) a 2D
CONV backbone to transform the pseudo-image into high- Pruning rate { 1×, 2×, 3×, 5×, 7×, 10×, 15× }
level representation; and (3) a detection head to regress 3D
boxes. Besides PointPillars, there are various 3D detection the framework can obtain the final pruning proposal with
methods such as SECOND [14] and Point-GNN [15]. We satisfying detection accuracy and speed performance.
mainly focus on PointPillars as we found that PointPillars is In each iteration, the evaluator first trains an ensemble of
the only one whihc can run on mobile while others are not neural predictors and then selects proposals based on their
available on mobile since their special structures to deal with acquisition function values enabled by the predictor ensemble.
sparse data are not supported by mobile compiler. Besides, Next the selected proposals are evaluated to obtain their per-
PointPillars costs less computations than others with faster formance while the rest unselected proposals are not evaluated,
inference speed on server GPUs (e.g., 25ms for PointPillars thus reducing evaluation time and efforts.
v.s. 600ms for Point-GNN). After the framework finishes and outputs a final pruning
proposal, we further apply ADMM pruning [17] to perform
B. Weight Pruning Schemes an enhanced pruning following the best proposal. Compared
Previous weight pruning work can be categorized according with the simple magnitude pruning [3] method applied dur-
to pruning scheme: unstructured pruning [2], [3], [16], coarse- ing evaluation for time-saving, ADMM usually outperforms
grained structured pruning [1], [4]–[6], and fine-grained struc- magnitude pruning in terms of accuracy with an increased
tured pruning including pattern [8] and block [10] pruning. complexity, that is why we only adopt it for the final proposal.
Unstructured pruning [3], [16] removes weights at arbi- A. Generator
trary positions, leading to irregular sparse weight matrix with The generator samples pruning proposals from the search
indices, incurring damages to the parallel implementations space. Each pruning proposal is a directed graph consisting
and acceleration performance on hardware. Different from un- of the layer-wise pruning scheme and layer-wise pruning rate.
structured pruning, coarse-grained structured pruning [4], [5] For example, it has 20 nodes for a 10-layer DNN model.
removes the whole filters/channels to maintain model structure 1) Proposal Formulation (Search Space): Each pruning
with high regularity for efficient hardware parallel implemen- proposal contains the pruning scheme and pruning rate for
tation, at the cost of certain obvious accuracy degradation. To each layer of the model, as shown in Tab. I.
overcome the disadvantages, fine-grained structured pruning Per-layer pruning schemes: The generator can choose
[8], [9] follows a pruning pattern (chosen from a predefined from filter (channel) pruning [18], pattern-based pruning [8]
library) to prune each CONV kernel, where the predefined and block-based pruning [10] for each layer. As different
patterns have been optimized with compiler optimizations layers may have different best-suited pruning schemes, the
for mobile acceleration. Fine-grained structured pruning can generator can choose different pruning schemes for different
achieve high accuracy due to the flexibility with different pat- layers, also supported by our compiler code generation.
terns, and high hardware parallelism (and mobile acceleration) Per-layer pruning rate: The pruning rate is the rate
with compiler-based code generation and optimization. between the number of original parameters and that of
left parameters after pruning. We can choose from the list
III. AUTOMATIC N EURAL P RUNING S EARCH
{1×, 2×, 3×, 5×, 7×, 10×, 15×}, where 1× means the layer
We show the framework in Fig. 1, consisting of two is not pruned (i.e., bypassing this layer).
basic components: a generator and an evaluator. Given the 2) Proposal Updating: The generator keeps a record of all
search space, the generator first generates or samples various evaluated pruning proposals with their evaluation performance.
pruning proposals. Then the evaluator evaluates their detection To generate new pruning proposals, it mutate the proposals
accuracy and speed performance, and feeds them back to the with the best evaluation performance in the records by ran-
generator. Next the generator samples new pruning proposals domly changing one pruning scheme or one pruning rate of
based on existing proposals’ performance. After iterations, one layer. More specifically, it first selects K proposals with



Authorized licensed use limited to: N.C. State University Libraries - Acquisitions & Discovery S. Downloaded on February 05,2024 at 17:05:37 UTC from IEEE Xplore. Restrictions apply.
Algorithm 1 Evaluation with predictor ensemble & BO 2) Ensemble of Neural Predictors: We use a neural network
Input: Observation data D, BO batch size B, BO acquisition repeatedly trained on the current set of evaluated pruning pro-
function φ(·) posals with their evaluation performance as a neural predictor
Output: The best pruning proposal g to predict the reward (incorporating the accuracy and speed
for steps do performance) of unseen pruning proposals. The neural network
Generate a pool of candidate pruning proposals Gc ; is a sequential fully-connected network with 8 layers of width
Train an ensemble of neural predictors with D;
Select {ĝ i }B 30 trained by the Adam optimizer with a learning rate of 0.01.
i=1 = arg maxg∈Gc φ(g);
Evaluate the proposal and obtain reward {ri }B i B
i=1 of {ĝ }i=1 ;
Note that it does not cost much predictor training efforts due
D ← D ∪ ({ĝ i }B i B
i=1 , {r }i=1 ); to their simple architectures and parallel training.
end for For the loss function in neural predictors, mean absolute per-
centage error (MAPE) is adopted as it can give a higher weight
to pruning proposals with higher evaluation performance:
 (i) 
n  
the highest evaluation performance, and mutates each of them 1   mpred − mUB 
iteratively until it gets C new proposals. L(mpred , mtrue ) =  (i) − 1 , (2)
n i=1  mtrue − mUB 
3) Proposal encoding: As pruning proposals are basically
(i) (i)
graphs, special attention is required for the proposal repre- where mpred and mtrue are the predicted and true values of the
sentation. Different from traditional representations with an reward for the i-th proposal in a batch, and mUB is a global
adjacency matrix for graphs, we adopt the pruning encoding to upper bound on the maximum true reward.
encode each proposal with a vector of binary values. There is To incorporate BO, it also needs an uncertainty estimate for
the prediction. So we adopt an ensemble of neural predictors to
a binary feature for each possible node in each layer, denoting provide the uncertainty estimate. More specifically, we train P
whether the node (pruning scheme or pruning rate of certain neural predictors using different random weight initializations
layer) is adopted or not. To encode a proposal, we simply and training data orders. Then for any proposal, we can obtain
check which pruning scheme or rate for each layer is applied, the mean and standard deviation of these P predictions. More
and set the corresponding features to 1s. This simple proposal specifically, we train an ensemble of P predictive models,
{fp }P
p=1 , where fp : A → R with a pruning proposal g as
encoding can help with proposal evaluation.
input and the predicted reward as output. The mean prediction
B. Evaluator and its deviation are given by,

The evaluator needs to evaluate pruning proposal perfor- 1 
P P
p=1 (fp (g) − fˆ(g))2
mance. We define the performance measurement (reward) as: fˆ(g) = fp (g), and σ̂(g) = . (3)
P p=1 P −1
m = V − α · max(0, r − R), (1)
3) Selection with Acquisition Function: After training an
where V is the validation mean average precision (mAP) ensemble of neural predictors, we can obtain the acquisi-
tion function value for proposals and select a small part of
of the model, r is the model inference latency, which is proposals with largest acquisition values. We choose upper
actually measured on a mobile device with compiler code confidence bound (UCB) [20] as the acquisition function,
optimization and generation for inference acceleration. R is
φUCB (g) = fˆ(g) + β σ̂(g) (4)
the real-time requirement threshold. Generally, satisfying real-
time requirement (r < R) with high mAP leads to high m. where the tradeoff parameter β is set to 0.5.
Otherwise if the real-time requirement is violated, m is small. 4) Evaluation with Magnitude Pruning: After selecting the
1) Fast Evaluation with BO: As it incurs large time cost to pruning proposal from the pool, the evaluator uses magni-
evaluate the performance of each pruning proposal (including tude based pruning framework [3] (with two steps including
pruning and retraining the model with multiple epochs), we pruning and retraining) to perform the actual pruning and
adopt Bayesian optimization (BO) [19] to accelerate evalu- obtain its evaluation performance for the proposal. Note that
ation. The generator provides C pruning proposals, and the it can evaluate the proposals in parallel. Besides, the speed
evaluator first use BO to select B proposal with potentially measurement on a mobile device can be performed in parallel
better performance. Next the evaluator measure the accurate with the accuracy measurement.
accuracy and speed performance of the selected proposals IV. E XPERIMENTAL R ESULTS
while the rest unselected proposals are not evaluated. Thus, A. Experiment Setup
the number of actual evaluated proposals is reduced.
We focus on 3D detection and employ the PointPillars
In general, there are two main components in BO includ-
[13] as starting point and test on KITTI dataset [21]. We
ing training an ensemble of neural predictors and selecting
use 40 GPUs for parallel training and pruning search and
proposal based on acquisition function values enabled by
it takes about 6 days to find the best pruning proposal in
the predictor ensemble. To make use of BO, the ensemble
each experiment. In Eq. (1), we set α to 0.01 and the mobile
of neural predictors provides an accuracy prediction with
inference time is measured in milliseconds. The pool size C
its corresponding uncertainty estimate for an unseen pruning
is set to 50 and the Bayesian batch size B is set to 10. We
proposal. Then BO is able to choose the proposal which max-
test the speed performance on the mobile GPU (Qualcomm
imizes the acquisition function. We show the full algorithm in
Adreno 640) of a Samsung Galaxy S20 smartphone.
Algorithm 1 and specify the two components in the following.



Authorized licensed use limited to: N.C. State University Libraries - Acquisitions & Discovery S. Downloaded on February 05,2024 at 17:05:37 UTC from IEEE Xplore. Restrictions apply.
TABLE II ϳϴ
C OMPARISON OF VARIOUS PRUNING METHODS FOR P OINT P ILLARS ϳϲ
ϳϰ
Methods Para. Comp. # Speed Car 3D detection ϳϮ
(grid size) # (MACs) (ms) Easy Moderate Hard
ϳϬ
PointPillars (0.16) 5.8M 60G 553 85.16 74.39 69.42 ϲϴ
Filter [22] (0.16) 1.1M 10.8G 178 80.63 67.51 65.28
ϲϲ
Pattern [8] (0.16) 1.1M 10.7G 225 83.64 74.30 68.42 Ϭ ϭϬϬ ϮϬϬ ϯϬϬ ϰϬϬ ϱϬϬ ϲϬϬ
Block [10] (0.16) 1.1M 10.7G 268 82.86 75.43 69.71
Ours (0.16) 1.1M 10.7G 193 85.52 76.69 70.10
PointPillars (0.24) 5.4M 28G 253 84.24 75.28 68.46
Filter [22] (0.24) 0.8M 4.0G 82 81.36 68.06 65.77 Fig. 2. Comparison with other methods
Pattern [8] (0.24) 0.8M 3.9G 116 82.16 73.93 68.25
Block [10] (0.24) 0.8M 4.0G 140 83.69 74.09 68.06 VI. ACKNOWLEDGEMENTS
Ours (0.24) 0.8M 3.9G 98 85.38 75.72 68.53
This project is partly supported by National Science
Foundation CNS-1929300, CNS-1739748, and CNS-1909172,
B. Performance on 3D Object Detection Army Research Office/Army Research Laboratory (ARO)
As shown in Tab. II and Fig. 2, we compare the perfor- W911NF-20-1-0167 (YIP) to Northeastern University, a grant
mance of the original unpruned PointPillars model and the from Semiconductor Research Corporation (SRC), and Jeffress
model derived by our method and other pruning methods with Trust Awards in Interdisciplinary Research. Any opinions,
different grid sizes (0.16m and 0.24m). We set the threshold findings, and conclusions in this material are those of the
of the real-time requirements to 200ms for 0.16m and 100ms authors and do not necessarily reflect the views of NSF, ARO,
for 0.24m. For the grid size, as large grid size leads to small SRC, or Thomas F. and Kate Miller Jeffress Memorial Trust.
pseudo-image input size for the model, the 0.24m grid size has R EFERENCES
a smaller parameter and computation numbers, and a faster
[1] W. Wen, C. Wu et al., “Learning structured sparsity in deep neural
inference speed on mobile GPUs, compared with 0.16m. networks,” in NeurIPS, 2016, pp. 2074–2082.
For the same grid size, compared with the original unpruned [2] Y. Guo, A. Yao, and Y. Chen, “Dynamic network surgery for efficient
PointPillars model, we observe that our method can signif- dnns,” in NeurIPS, 2016, pp. 1379–1387.
[3] S. Han, J. Pool et al., “Learning both weights and connections for
icantly reduce the number of parameters and computations, efficient neural network,” in NeurIPS, 2015, pp. 1135–1143.
achieving state-of-the-art detection performance while satisfy- [4] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell, “Rethinking the
ing the real-time requirement. The accuracy of our method is value of network pruning,” arXiv preprint arXiv:1810.05270, 2018.
[5] Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating very
even higher than the unpruned model, demonstrating that the deep neural networks,” in ICCV, 2017, pp. 1389–1397.
unpruned model may suffer from the over-fitting problem and [6] J.-H. Luo, J. Wu, and W. Lin, “Thinet: A filter level pruning method
removing the redundancy can help with its accuracy. for deep neural network compression,” in ICCV, 2017, pp. 5058–5066.
[7] N. Liu, X. Ma et al., “Autocompress: An automatic dnn structured
We also compare with other pruning methods for the same pruning framework for ultra-high compression rates,” in AAAI, 2020.
grid size. For other pruning methods, the same pruning scheme [8] X. Ma et al., “Pconv: The missing but desirable sparsity in dnn weight
pruning for real-time execution on mobile devices,” in AAAI, 2020.
is applied to all layers and the pruning rate is set to the same [9] W. Niu et al., “Patdnn: Achieving real-time dnn execution on mobile
with the overall pruning ratio of our pruned model (80% devices with pattern-based weight pruning,” arXiv:2001.00138, 2020.
for grid size 0.16m and 86% for 0.24m). As observed, the [10] P. Dong, S. Wang et al., “Rtmobile: Beyond real-time mobile accelera-
tion of rnns for speech recognition,” arXiv:2002.11474, 2020.
proposed method achieves the best detection performance with [11] B. Zoph and Q. V. Le, “Neural architecture search with reinforcement
highest accuracy compared with other methods with the same learning,” in ICLR, 2017.
pruning scheme for every layer, demonstrating the advantages [12] H. Liu, K. Simonyan, and Y. Yang, “Darts: Differentiable architecture
search,” arXiv preprint arXiv:1806.09055, 2018.
of different pruning scheme for different layers. We notice that [13] A. H. Lang, S. Vora et al., “Pointpillars: Fast encoders for object
although filter pruning can be faster than our method, it suffers detection from point clouds,” in CVPR, 2019, pp. 12 697–12 705.
from an obvious degradation on the detection performance. [14] Y. Yan, Y. Mao, and B. Li, “Second: Sparsely embedded convolutional
detection,” Sensors, vol. 18, no. 10, p. 3337, 2018.
For the speed, we notice that for grid size 0.24m, the pro- [15] W. Shi and R. Rajkumar, “Point-gnn: Graph neural network for 3d object
posed method only needs 98ms to process one LiDAR image detection in a point cloud,” in Proceedings of the IEEE/CVF conference
on mobile devices with the highest accuracy, demonstrating its on computer vision and pattern recognition, 2020, pp. 1711–1719.
[16] H. Mao, S. Han et al., “Exploring the regularity of sparse structure in
superior performance to achieve (close-to) real-time inference convolutional neural networks,” arXiv:1705.08922, 2017.
on mobile with state-of-the-art detection performance. [17] T. Zhang, S. Ye et al., “Systematic weight pruning of dnns using
alternating direction method of multipliers,” ECCV, 2018.
[18] Z. Zhuang, M. Tan et al., “Discrimination-aware channel pruning for
V. C ONCLUSION deep neural networks,” in NeurIPS, 2018, pp. 875–886.
[19] Y. Chen, A. Huang et al., “Bayesian optimization in alphago,”
We propose pruning search to flexibly configure the pruning arXiv:1812.06855, 2018.
scheme and rate for each layer in the model with real-time [20] N. Srinivas, A. Krause et al., “Gaussian process optimization in the
inference requirement. Our experiments demonstrate that the bandit setting: No regret and experimental design,” in ICML, 2010.
[21] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous
proposed method achieves (close-to) real-time (98ms) 3D driving? the kitti vision benchmark suite,” in CVPR, 2012.
object detection based on PointPillars, on an off-the-shelf [22] Y. He, P. Liu et al., “Filter pruning via geometric median for deep
mobile phone with minor (or no) accuracy loss. convolutional neural networks acceleration,” in CVPR, 2019.



Authorized licensed use limited to: N.C. State University Libraries - Acquisitions & Discovery S. Downloaded on February 05,2024 at 17:05:37 UTC from IEEE Xplore. Restrictions apply.

You might also like