22RTAS
22RTAS
Abstract—In autonomous driving, 3D object detection is es- move redundancy in the model, therefore reducing stor-
sential as it provides basic knowledge about the environment. age/computation cost and accelerating inference. There are
However, as deep learning based 3D detection methods are unstructured pruning scheme [2]–[4] to remove arbitrary
usually computation intensive, it is challenging to support real-
time 3D object detection on edge-computing devices in self- weight, coarse-grained structured pruning scheme [1], [4]–
driving cars with limited computation and memory resources. [7] to eliminate whole filters/channels, and fine-grained struc-
To facilitate this, we propose a compiler-aware pruning search tured pruning [8]–[10] to assign different pruning patterns to
framework, to achieve real-time inference of 3D object detection convolutional (CONV) kernels. Though unstructured pruning
on the resource-limited mobile devices. Specifically, a generator can achieve high accuracy, the arbitrary pruned irregular
is applied to sample better pruning proposals in the search
space based on current proposals with their performance, and an weights limited hardware parallelism, leading to difficulties for
evaluator is adopted to evaluate the sampled pruning proposal inference acceleration. Compared with unstructured pruning,
performance. To accelerate the search, the evaluator employs structured pruning can achieve higher hardware parallelism
Bayesian optimization with an ensemble of neural predictors. We and mobile inference acceleration, assisted by the compiler-
demonstrate in experiments that for the first time, the pruning level code generation and optimization techniques [9], with
search framework can achieve real-time 3D object detection
on mobile (Samsung Galaxy S20 phone) with state-of-the-art competitive classification/detection performance.
detection performance. Though compiler optimization can support various struc-
Index Terms—3D object detection, real-time, point cloud tured pruning (sparsity) schemes with notable mobile accel-
eration performance, we found that different sparsity schemes
I. I NTRODUCTION
lead to different accuracy and acceleration performance with
As the rapid development of autonomous vehicles to self- compiler optimization. For the specific 3D detection problem,
drive without human intervention, object detection (especially it is still questionable to adopt which sparsity scheme with
3D detection to deal with LiDAR data) serves as a fundamental which pruning rate to satisfy the accuracy and real-time re-
prerequisite for autonomous navigation. 3D detection can quirements. To find the pruning solution, motivated by the idea
extract the desirable knowledge about its environment from of Neural Architecture Search (NAS) [11], [12], we propose
3D point clouds of LiDAR sensors, thus enabling high-level a compiler-aware pruning search framework to automatically
computations and optimizations for auto-driving. determine the pruning scheme and pruning rate for each
Due to the instantaneously interaction requirement with the individual layer. The objective is to maximize accuracy with an
environment in auto-driving, it is essential to implement real- inference speed/latency constraint on the target mobile device.
time 3D object detection on autonomous vehicles. However, Different from previous work with fixed pruning scheme for
the current deep neural networks (DNNs) based 3D object all layers, our work can have different pruning schemes and
detectors usually cost tremendous memory and computation rates for different layers in the model. We summarize our
resources, leading to difficulties for real-time implementation, contribution as follows,
especially on autonomous vehicles with limited hardware re-
source. Though more powerful high-end GPUs can be adopted • We incorporate the overall DNN latency constraint into
for this task, they usually result in significant increasing price automatic pruning search process to satisfy a predefined
and power consumption. Thus it is desirable to facilitate the real-time requirement.
real-time 3D detection deployment on autonomous cars. • Our framework configures different pruning schemes and
To reduce the DNN model size and computations, DNN pruning rates for different layers which is different from
weight pruning [1], [2] has shown great advantages to re- previous works with fixed pruning scheme for all layers.
Authorized licensed use limited to: N.C. State University Libraries - Acquisitions & Discovery S. Downloaded on February 05,2024 at 17:05:37 UTC from IEEE Xplore. Restrictions apply.
Algorithm 1 Evaluation with predictor ensemble & BO 2) Ensemble of Neural Predictors: We use a neural network
Input: Observation data D, BO batch size B, BO acquisition repeatedly trained on the current set of evaluated pruning pro-
function φ(·) posals with their evaluation performance as a neural predictor
Output: The best pruning proposal g to predict the reward (incorporating the accuracy and speed
for steps do performance) of unseen pruning proposals. The neural network
Generate a pool of candidate pruning proposals Gc ; is a sequential fully-connected network with 8 layers of width
Train an ensemble of neural predictors with D;
Select {ĝ i }B 30 trained by the Adam optimizer with a learning rate of 0.01.
i=1 = arg maxg∈Gc φ(g);
Evaluate the proposal and obtain reward {ri }B i B
i=1 of {ĝ }i=1 ;
Note that it does not cost much predictor training efforts due
D ← D ∪ ({ĝ i }B i B
i=1 , {r }i=1 ); to their simple architectures and parallel training.
end for For the loss function in neural predictors, mean absolute per-
centage error (MAPE) is adopted as it can give a higher weight
to pruning proposals with higher evaluation performance:
(i)
n
the highest evaluation performance, and mutates each of them 1 mpred − mUB
iteratively until it gets C new proposals. L(mpred , mtrue ) = (i) − 1 , (2)
n i=1 mtrue − mUB
3) Proposal encoding: As pruning proposals are basically
(i) (i)
graphs, special attention is required for the proposal repre- where mpred and mtrue are the predicted and true values of the
sentation. Different from traditional representations with an reward for the i-th proposal in a batch, and mUB is a global
adjacency matrix for graphs, we adopt the pruning encoding to upper bound on the maximum true reward.
encode each proposal with a vector of binary values. There is To incorporate BO, it also needs an uncertainty estimate for
the prediction. So we adopt an ensemble of neural predictors to
a binary feature for each possible node in each layer, denoting provide the uncertainty estimate. More specifically, we train P
whether the node (pruning scheme or pruning rate of certain neural predictors using different random weight initializations
layer) is adopted or not. To encode a proposal, we simply and training data orders. Then for any proposal, we can obtain
check which pruning scheme or rate for each layer is applied, the mean and standard deviation of these P predictions. More
and set the corresponding features to 1s. This simple proposal specifically, we train an ensemble of P predictive models,
{fp }P
p=1 , where fp : A → R with a pruning proposal g as
encoding can help with proposal evaluation.
input and the predicted reward as output. The mean prediction
B. Evaluator and its deviation are given by,
The evaluator needs to evaluate pruning proposal perfor- 1
P P
p=1 (fp (g) − fˆ(g))2
mance. We define the performance measurement (reward) as: fˆ(g) = fp (g), and σ̂(g) = . (3)
P p=1 P −1
m = V − α · max(0, r − R), (1)
3) Selection with Acquisition Function: After training an
where V is the validation mean average precision (mAP) ensemble of neural predictors, we can obtain the acquisi-
tion function value for proposals and select a small part of
of the model, r is the model inference latency, which is proposals with largest acquisition values. We choose upper
actually measured on a mobile device with compiler code confidence bound (UCB) [20] as the acquisition function,
optimization and generation for inference acceleration. R is
φUCB (g) = fˆ(g) + β σ̂(g) (4)
the real-time requirement threshold. Generally, satisfying real-
time requirement (r < R) with high mAP leads to high m. where the tradeoff parameter β is set to 0.5.
Otherwise if the real-time requirement is violated, m is small. 4) Evaluation with Magnitude Pruning: After selecting the
1) Fast Evaluation with BO: As it incurs large time cost to pruning proposal from the pool, the evaluator uses magni-
evaluate the performance of each pruning proposal (including tude based pruning framework [3] (with two steps including
pruning and retraining the model with multiple epochs), we pruning and retraining) to perform the actual pruning and
adopt Bayesian optimization (BO) [19] to accelerate evalu- obtain its evaluation performance for the proposal. Note that
ation. The generator provides C pruning proposals, and the it can evaluate the proposals in parallel. Besides, the speed
evaluator first use BO to select B proposal with potentially measurement on a mobile device can be performed in parallel
better performance. Next the evaluator measure the accurate with the accuracy measurement.
accuracy and speed performance of the selected proposals IV. E XPERIMENTAL R ESULTS
while the rest unselected proposals are not evaluated. Thus, A. Experiment Setup
the number of actual evaluated proposals is reduced.
We focus on 3D detection and employ the PointPillars
In general, there are two main components in BO includ-
[13] as starting point and test on KITTI dataset [21]. We
ing training an ensemble of neural predictors and selecting
use 40 GPUs for parallel training and pruning search and
proposal based on acquisition function values enabled by
it takes about 6 days to find the best pruning proposal in
the predictor ensemble. To make use of BO, the ensemble
each experiment. In Eq. (1), we set α to 0.01 and the mobile
of neural predictors provides an accuracy prediction with
inference time is measured in milliseconds. The pool size C
its corresponding uncertainty estimate for an unseen pruning
is set to 50 and the Bayesian batch size B is set to 10. We
proposal. Then BO is able to choose the proposal which max-
test the speed performance on the mobile GPU (Qualcomm
imizes the acquisition function. We show the full algorithm in
Adreno 640) of a Samsung Galaxy S20 smartphone.
Algorithm 1 and specify the two components in the following.
Authorized licensed use limited to: N.C. State University Libraries - Acquisitions & Discovery S. Downloaded on February 05,2024 at 17:05:37 UTC from IEEE Xplore. Restrictions apply.
TABLE II ϳϴ
C OMPARISON OF VARIOUS PRUNING METHODS FOR P OINT P ILLARS ϳϲ
ϳϰ
Methods Para. Comp. # Speed Car 3D detection ϳϮ
(grid size) # (MACs) (ms) Easy Moderate Hard
ϳϬ
PointPillars (0.16) 5.8M 60G 553 85.16 74.39 69.42 ϲϴ
Filter [22] (0.16) 1.1M 10.8G 178 80.63 67.51 65.28
ϲϲ
Pattern [8] (0.16) 1.1M 10.7G 225 83.64 74.30 68.42 Ϭ ϭϬϬ ϮϬϬ ϯϬϬ ϰϬϬ ϱϬϬ ϲϬϬ
Block [10] (0.16) 1.1M 10.7G 268 82.86 75.43 69.71
Ours (0.16) 1.1M 10.7G 193 85.52 76.69 70.10
PointPillars (0.24) 5.4M 28G 253 84.24 75.28 68.46
Filter [22] (0.24) 0.8M 4.0G 82 81.36 68.06 65.77 Fig. 2. Comparison with other methods
Pattern [8] (0.24) 0.8M 3.9G 116 82.16 73.93 68.25
Block [10] (0.24) 0.8M 4.0G 140 83.69 74.09 68.06 VI. ACKNOWLEDGEMENTS
Ours (0.24) 0.8M 3.9G 98 85.38 75.72 68.53
This project is partly supported by National Science
Foundation CNS-1929300, CNS-1739748, and CNS-1909172,
B. Performance on 3D Object Detection Army Research Office/Army Research Laboratory (ARO)
As shown in Tab. II and Fig. 2, we compare the perfor- W911NF-20-1-0167 (YIP) to Northeastern University, a grant
mance of the original unpruned PointPillars model and the from Semiconductor Research Corporation (SRC), and Jeffress
model derived by our method and other pruning methods with Trust Awards in Interdisciplinary Research. Any opinions,
different grid sizes (0.16m and 0.24m). We set the threshold findings, and conclusions in this material are those of the
of the real-time requirements to 200ms for 0.16m and 100ms authors and do not necessarily reflect the views of NSF, ARO,
for 0.24m. For the grid size, as large grid size leads to small SRC, or Thomas F. and Kate Miller Jeffress Memorial Trust.
pseudo-image input size for the model, the 0.24m grid size has R EFERENCES
a smaller parameter and computation numbers, and a faster
[1] W. Wen, C. Wu et al., “Learning structured sparsity in deep neural
inference speed on mobile GPUs, compared with 0.16m. networks,” in NeurIPS, 2016, pp. 2074–2082.
For the same grid size, compared with the original unpruned [2] Y. Guo, A. Yao, and Y. Chen, “Dynamic network surgery for efficient
PointPillars model, we observe that our method can signif- dnns,” in NeurIPS, 2016, pp. 1379–1387.
[3] S. Han, J. Pool et al., “Learning both weights and connections for
icantly reduce the number of parameters and computations, efficient neural network,” in NeurIPS, 2015, pp. 1135–1143.
achieving state-of-the-art detection performance while satisfy- [4] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell, “Rethinking the
ing the real-time requirement. The accuracy of our method is value of network pruning,” arXiv preprint arXiv:1810.05270, 2018.
[5] Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating very
even higher than the unpruned model, demonstrating that the deep neural networks,” in ICCV, 2017, pp. 1389–1397.
unpruned model may suffer from the over-fitting problem and [6] J.-H. Luo, J. Wu, and W. Lin, “Thinet: A filter level pruning method
removing the redundancy can help with its accuracy. for deep neural network compression,” in ICCV, 2017, pp. 5058–5066.
[7] N. Liu, X. Ma et al., “Autocompress: An automatic dnn structured
We also compare with other pruning methods for the same pruning framework for ultra-high compression rates,” in AAAI, 2020.
grid size. For other pruning methods, the same pruning scheme [8] X. Ma et al., “Pconv: The missing but desirable sparsity in dnn weight
pruning for real-time execution on mobile devices,” in AAAI, 2020.
is applied to all layers and the pruning rate is set to the same [9] W. Niu et al., “Patdnn: Achieving real-time dnn execution on mobile
with the overall pruning ratio of our pruned model (80% devices with pattern-based weight pruning,” arXiv:2001.00138, 2020.
for grid size 0.16m and 86% for 0.24m). As observed, the [10] P. Dong, S. Wang et al., “Rtmobile: Beyond real-time mobile accelera-
tion of rnns for speech recognition,” arXiv:2002.11474, 2020.
proposed method achieves the best detection performance with [11] B. Zoph and Q. V. Le, “Neural architecture search with reinforcement
highest accuracy compared with other methods with the same learning,” in ICLR, 2017.
pruning scheme for every layer, demonstrating the advantages [12] H. Liu, K. Simonyan, and Y. Yang, “Darts: Differentiable architecture
search,” arXiv preprint arXiv:1806.09055, 2018.
of different pruning scheme for different layers. We notice that [13] A. H. Lang, S. Vora et al., “Pointpillars: Fast encoders for object
although filter pruning can be faster than our method, it suffers detection from point clouds,” in CVPR, 2019, pp. 12 697–12 705.
from an obvious degradation on the detection performance. [14] Y. Yan, Y. Mao, and B. Li, “Second: Sparsely embedded convolutional
detection,” Sensors, vol. 18, no. 10, p. 3337, 2018.
For the speed, we notice that for grid size 0.24m, the pro- [15] W. Shi and R. Rajkumar, “Point-gnn: Graph neural network for 3d object
posed method only needs 98ms to process one LiDAR image detection in a point cloud,” in Proceedings of the IEEE/CVF conference
on mobile devices with the highest accuracy, demonstrating its on computer vision and pattern recognition, 2020, pp. 1711–1719.
[16] H. Mao, S. Han et al., “Exploring the regularity of sparse structure in
superior performance to achieve (close-to) real-time inference convolutional neural networks,” arXiv:1705.08922, 2017.
on mobile with state-of-the-art detection performance. [17] T. Zhang, S. Ye et al., “Systematic weight pruning of dnns using
alternating direction method of multipliers,” ECCV, 2018.
[18] Z. Zhuang, M. Tan et al., “Discrimination-aware channel pruning for
V. C ONCLUSION deep neural networks,” in NeurIPS, 2018, pp. 875–886.
[19] Y. Chen, A. Huang et al., “Bayesian optimization in alphago,”
We propose pruning search to flexibly configure the pruning arXiv:1812.06855, 2018.
scheme and rate for each layer in the model with real-time [20] N. Srinivas, A. Krause et al., “Gaussian process optimization in the
inference requirement. Our experiments demonstrate that the bandit setting: No regret and experimental design,” in ICML, 2010.
[21] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous
proposed method achieves (close-to) real-time (98ms) 3D driving? the kitti vision benchmark suite,” in CVPR, 2012.
object detection based on PointPillars, on an off-the-shelf [22] Y. He, P. Liu et al., “Filter pruning via geometric median for deep
mobile phone with minor (or no) accuracy loss. convolutional neural networks acceleration,” in CVPR, 2019.
Authorized licensed use limited to: N.C. State University Libraries - Acquisitions & Discovery S. Downloaded on February 05,2024 at 17:05:37 UTC from IEEE Xplore. Restrictions apply.