0% found this document useful (0 votes)
117 views8 pages

RadarPillars: Efficient Object Detection From 4D Radar Point Clouds

Automotive radar systems have evolved to provide not only range, azimuth and Doppler velocity, but also elevation data. This additional dimension allows for the representation of 4D radar as a 3D point cloud. As a result, existing deep learning methods for 3D object detection, which were initially developed for LiDAR data, are often applied to these radar point clouds. However, this neglects the special characteristics of 4D radar data, such as the extreme sparsity and the optimal utilization of

Uploaded by

cocbottest01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
117 views8 pages

RadarPillars: Efficient Object Detection From 4D Radar Point Clouds

Automotive radar systems have evolved to provide not only range, azimuth and Doppler velocity, but also elevation data. This additional dimension allows for the representation of 4D radar as a 3D point cloud. As a result, existing deep learning methods for 3D object detection, which were initially developed for LiDAR data, are often applied to these radar point clouds. However, this neglects the special characteristics of 4D radar data, such as the extreme sparsity and the optimal utilization of

Uploaded by

cocbottest01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

RadarPillars: Efficient Object Detection from 4D Radar Point Clouds

Alexander Musiat1 , Laurenz Reichardt1 , Michael Schulze1 and Oliver Wasenmüller1

Abstract— Automotive radar systems have evolved to provide


not only range, azimuth and Doppler velocity, but also elevation
data. This additional dimension allows for the representation of
4D radar as a 3D point cloud. As a result, existing deep learning
methods for 3D object detection, which were initially developed
arXiv:2408.05020v1 [cs.CV] 9 Aug 2024

for LiDAR data, are often applied to these radar point clouds.
However, this neglects the special characteristics of 4D radar
data, such as the extreme sparsity and the optimal utilization of
velocity information. To address these gaps in the state-of-the-
Fig. 1: Example of our RadarPillars detection results on
art, we present RadarPillars, a pillar-based object detection 4D radar. Cars are marked in red, pedestrians in green
network. By decomposing radial velocity data, introducing and cyclists in blue. The radial velocities of the points are
PillarAttention for efficient feature extraction, and studying indicated by arrows.
layer scaling to accommodate radar sparsity, RadarPillars
significantly outperform state-of-the-art detection results on the
View-of-Delft dataset. Importantly, this comes at a significantly
reduced parameter count, surpassing existing methods in terms our RadarPillars, a novel 3D detection network tailored
of efficiency and enabling real-time performance on edge specifically for 4D radar data. Through RadarPillars we
devices. address gaps in the current state-of-the-art with the following
I. INTRODUCTION contributions, significantly improving performance, while
maintaining real-time capabilities:
In the context of autonomy and automotive applications,
• Enhancement of velocity information utilization: We
radar stands out as a pivotal sensing technology, enabling
decompose radial velocity data, providing additional
vehicles to detect objects and obstacles in their surroundings.
features to significantly enhance network performance.
This capability is crucial for ensuring the safety and effi-
• Adapting to radar sparsity: RadarPillars leverages the
ciency of various autonomous driving functionalities, includ-
pillar representation [3] for efficient real-time process-
ing collision avoidance, adaptive cruise control, and lane-
ing. We capitalize on the sparsity inherent in 4D radar
keeping assistance. Recent advancements in radar technology
data and introduce PillarAttention, a novel self-attention
have led to the development of 4D radar, incorporating three
layer treating each pillar as a token, while maintaining
spatial dimensions along with an additional dimension for
both efficiency and real-time performance.
Doppler velocity. Unlike traditional radar systems, 4D radar
• Scaling for sparse radar data: We demonstrate that
introduces elevation information as its third dimension. This
the sparsity of radar data can lead to less informative
enhancement allows for the representation of radar data in
features in the detection network. Through uniform
3D point clouds, akin to those generated by LiDAR or depth
network, we not only improve performance but also sig-
sensing cameras, thereby enabling the application of deep
nificantly reduce parameter count, enhanciung runtime
learning methodologies previously reserved for such sensors.
efficiency.
However, while deep learning techniques from the domain
of LiDAR detection have been adapted to 4D radar data, they II. RELATED WORK
have not fully explored or adapted to its unique features.
A. 4D Radar Object Detection
Compared to LiDAR data, 4D radar data is significantly less
abundant. Regardless of this sparsity, radar uniquely provides Point clouds can be processed in various ways: as an
velocity as a feature, which could help in the detection of unordered set of points, ordered by graphs, within a dis-
moving objects in various scenarios, such as at long range crete voxel grid, or as range projections. Among these
where LiDAR traditionally struggles [1]. In the View-of- representations, pillars stand out as a distinct type, where
Delft dataset, an average 4D radar scan comprises only each voxel is defined as a vertical column, enabling the
216 points, while a LiDAR scan within the same field of reduction of the height dimension. This allows for pillar
view contains 21, 344 points [2]. In response, we propose features to be cast into a 2D-pseudo-image, with its height
and width defined by the grid size used for the base of the pil-
1 Mannheim University of Applied Sciences, Germany lars. This dimensionality reduction facilitates the application
[email protected] of 2D network architectures for birds-eye-view processing.
[email protected]
[email protected] PointPillars-based [3] networks have proven particularly ef-
[email protected] fective for LiDAR data, balancing performance and runtime
efficiently. Consequently, researchers have begun applying [16], [17] or the octree representation [18]. The downside of
the pillar representation to 4D radar data. Currently, further geometric partitioning is that groups of equal shape will each
exploration of alternative representation methods besides have a different amount of points in them. This is detrimental
pillars for 4D radar data remains limited. to parallelization, meaning that such methods are not real
Palffy et al. [2] established a baseline by benchmarking time capable. Despite these efforts, partition based attention
PointPillars on their View-of-Delft dataset, adapting only the is limited to the local context, with various techniques to
parameters of the pillar-grid to match radar sensor specifi- facilitate information transfer between these groups such
cations. Recognizing the sparsity inherent in 4D radar data, as changing neighborhood size, downsampling, or shifting
subsequent work aims to maximize information utilization windows. The addition of constant shifting and reordering
through parallel branches or multi-scale fusion techniques. of data leads to further memory inefficiencies and increased
SMURF [4] introduces a parallel learnable branch to the latency. In response to these challenges, Flatformer [19] opts
pillar representation, integrating kernel density estimation. for computational efficiency by forming groups of equal
MVFAN [5] employs two parallel branches — one for cylin- size rather than equal geometric shape, sacrificing spatial
drical projection and the other for the pillar representation — proximity for better parallelization and memory efficiency.
merging features prior to passing them through an encoder- Similarly, SphereFormer [20] voxelizes point clouds based
decoder network for detection. SRFF [6] does not use a on exponential distance in the spherical coordinate system
parallel branch, instead incorporating an attention-based neck to achieve higher density voxel grids. Point Transformer
to fuse encoder-stage features, arguing that multi-scale fusion v3 [21] first embeds voxels through sparse convolution and
improves information extraction from sparse radar data. pooling, then ordering and partitioning the resulting tokens
Further approaches like RC-Fusion [7] and LXL [8] and using space-filling curves. Through this, only the last group
GRC-Net [9] opt to fuse both camera and 4D radar data, along the curve needs padding, thereby prioritizing efficiency
taking a dual-modality approach at object detection. CM- through pattern-based ordering over spatial ordering or geo-
FA [10] uses LiDAR data during training, but not during metric partitioning.
inference. These methods often require specialized attention libraries
It’s worth noting that the modifications introduced by that do not leverage the efficient attention implementations
these methods come at the cost of increased computational available in standard frameworks.
load and memory requirements, compromising the real-
time advantage associated with the pillar representation. III. METHOD
Furthermore, none of these methods fully explore the optimal
The current state-of-the-art in 4D radar object detection
utilization of radar features themselves. Herein lies untapped
predominantly relies on LiDAR-based methods. As a re-
potential.
sult, there is a noticeable gap in research regarding the
B. Transformers in Point Cloud Perception comprehensive utilization of velocity information to enhance
detection performance. Despite incremental advancements in
The self-attention mechanism [11] dynamically weighs
related works, these improvements often sacrifice efficiency
input elements in relation to each other, capturing long-
and real-time usability. To address these issues, we delve into
range dependencies and allowing for a global receptive
optimizing radar features to improve network performance
field for feature extraction. Self-attention incorporated in
through enhanced input data quality.
the transformer layer has benefited tasks like natural lan-
guage processing, computer vision, and speech recognition, While various self-attention variants have been explored
achieving state-of-the-art performance across domains. How- in point cloud perception, their restricted receptive fields,
ever, applying self-attention to point clouds poses distinct in conjunction with the sparsity and irregularity of point
challenges. The computational cost is quadratic, limiting the clouds, lead to computationally intensive layers. Leveraging
amount of tokens (context window) and hindering long-range the sparsity inherent in 4D radar data, we introduce Pil-
processing compared to convolutional methods. Additionally, larAttention, a novel self-attention layer providing a global
the inherent sparsity and varied point distributions complicate receptive field by treating each pillar as a token. Contrary
logical and geometric ordering, thus impeding the adoption to existing layers, PillarAttention does not reduce features
of transformer-based architectures in point cloud processing. through tokenization or need complex ordering algorithms.
Various strategies have been proposed to address these Additionally, we investigate network scaling techniques to
challenges. Point Transformer [12] utilizes k-nearest- further enhance both runtime efficiency and performance in
neighbors (KNN) to group points before applying vector light of the radar data sparsity.
attention. However, the neighborhood size is limited, as KNN
A. 4D Radar Features
grouping is also quadratic in terms of memory requirements
and complexity. On top of grouping, some approaches re- The individual points within 4D radar point clouds are
duce the point cloud through pooling [13] or farthest-point- characterized by various parameters including range (r),
sampling [14], leading to information loss. azimuth (α), elevation (θ ), RCS reflectivity, and relative
Others partition the point cloud into groups of equal radial velocity (vrel ). The determination of radial velocity
geometric shape, employing window-based attention [15], relies on the Doppler effect, reflecting the speed of an object
vr,y which does not influence the runtime of the model, beyond
vr vr,x
v its input layer.

B. PillarAttention
x
The pillar representation of 4D radar data as a 2D pseudo-

r
image is very sparse, with only a few valid pillars. Due to
this sparsity, pillars belonging to the same object are far
y apart. When processed by a convolutional backbone with a
Radar local field of view, this means that early layers cannot cap-
ture neighborhood dependencies. This is only achieved with
Fig. 2: Absolute radial velocity vr compensated with ego mo- subsequent layers and the resulting increase in the effective
tion of 4D radar. As an object moves, vr changes depending receptive field, or by the downsampling between network
on its heading angle to the sensor. The cars actual velocity stages [22], [23]. As such, the aggregation of information
v remains unknown, as its heading cannot be determined. belonging to the same object occurs late within the network
However, vr can be decomposed into its x and y components backbone. However, downsampling can lead to the loss of
to provide additional features. The coordinate system and information critical to small objects. The tokenization and
nomenclature follows the View-of-Delft dataset [2]. grouping methods of point cloud transformers can have a
similar negative effect.
Inspired by Self-Attention [11], we introduce PillarAtten-
in relation to the sensor’s position. When dealing with a non- tion to globally connect the local features of individual pillars
stationary radar sensor (e.g. mounted on a car), compensating across the entire pillar grid. We achieve this by capitalizing
vrel with the ego-motion yields the absolute radial velocity on the inherent sparsity of 4D radar data, treating each pillar
vr . The spherical coordinates (r, α, θ ) can be converted as a token, allowing our method to be free of grouping
into Cartesian coordinates (x, y, z). While these features or downsampling methods. PillarAttention diverges from
are akin to LiDAR data, radar’s unique capability lies in conventional self-attention in the manner in which sparsity
providing velocity information. Despite the commonality is handled. Given the largely empty nature of the pillar grid
in coordinate systems between radar and LiDAR, radar’s with size H,W , we employ a sparsity mask to exclusively
inclusion of velocity remains unique and underutilized. Cur- gather only occupied pillar features p. Subsequently, we learn
rent practices often incorporate velocity information merely key (K), query (Q), and value (V ) before applying stan-
as an additional feature within networks. Therefore, our dard self-attention. Conventionally, sparse values are masked
investigation delves into the impact of both relative and during self-attention calculation. In contrast, our approach
absolute radial velocities. Through this analysis, we advocate reduces the spatial complexity and memory requirements for
for the creation of supplementary features derived from radial self-attention from (HW )2 to p2 . Nevertheless, it’s essential
velocity, enriching the original data points. to acknowledge that sparsity, and thus the number of valid
First, we explore decomposing vr into its x and y com- pillars, varies between scans. Consequently, the sequence
ponents, resulting in vectors vr,x and vr,y , respectively. This length of tokens fluctuates during both training and inference.
approach similarly applies to vrel . This concept is visualized Another difference to conventional self-attention is that we
in Figure 2. The velocity vectors of each point can be did not find the inclusion of position embedding necessary.
decomposed through the following equations. Note that the This can be attributed to the fact that pillar features inherently
Equation (1) and Equation (2) apply to both vr and vrel in contain position information derived from point clouds.
the Cartesian coordinate system, in which arctan xy = β . Moreover, since pillars are organized within a 2D grid, the
  y  order of tokens remains consistent across scans, allowing the
vr,x = cos arctan · vr (1) model to learn contextual relationships between individual
x pillars. As such, the use of specialized algorithms for order-
  y 
vr,y = sin arctan · vr (2) ing such as octrees and space filling curves is not needed.
x Also, PillarAttention is not reliant on specialized libraries
Secondly, we construct new features by calculating the and benefits from recent developments in the space such as
offset velocities inside a pillar. For this, we first average the Flash-Attention-2 [24].
velocities inside a pillar and then subtract it from the velocity We next PillarAttention inside a transformer layer. This
of each point, to form an additional offset feature. These new layer is encapsulated by two MLPs which control its hidden
features can be calculated for both radial velocities vrel , vr dimension E. Following PillarAttention, the transformed
and their decomposed x, y variants. In later experiments we features are scattered back into their original pillar positions.
denote the use of these new offset features with subscript m, The concept of PillarAttention is depicted in Figure 3.
for example vr,m when using the offset velocities for vr .
The construction of these additional point features is in- C. Architecture and Scaling
tended to make it easier for the model to learn dependencies Our architecture (see Figure 3) is loosely inspired by
from the data in order to increase performance in a way PointPillars [3]. Similar to PointPillars, we incorporate offset
Pointcloud

Backbone Detection Head


Pillar Feature Encoding Pillar Attention Predictions
(2D CNN) (SSD)

Pillar Features Pillar Indices Sparsity Mask Pillar Features


(H, W, C) (H, W) (H, W) (H, W, C)

1 2 3 4 1 0 0 0
5 6 7 8 0 0 1 0

Flatten
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
9 10 11 12 1 0 0 0
13 14 15 16 0 1 0 1 1 0 0 0 0 0 1 0 1 0 0 0 0 1 0 1

Gather Scatter

1 7 9 14 16
Attention Map
Pillar Features (p, p)
(p, C)

Softmax
Q FFN Hidden Dim
(p, E) MLP X (p, C)
(p, E)
Layer Norm

T
K
MLP MLP + FFN + MLP

V
MLP X
Legend

Valid Pillar Empty Pillar


X Matrix Multiplication

Fig. 3: Overview of our PillarAttention. We leverage the sparsity of radar point clouds by using a mask to gather features
from non-empty pillars, reducing spatial size from H,W to p. Each pillar-feature with C channels is treated as a token for
the calculation of self-attention. Our PillarAttenion is encapsulated in a transformer layer, with the feed-forward network
(FFN) consisting of Layer Norm, followed by two MLPs with the GeLU activation between them. The hidden dimension E
of PillarAttention is controlled by a MLP before and after the layer. Finally, the pillar features with C channels are scattered
back to their original position within the grid. Our PillarAttention does not use position embedding.

coordinates xc , yc , zc derived from the pillar center c as to unseen data or exhibit suboptimal performance in tasks
additional features within the point cloud. Subsequently, we such as object detection or classification. Therefore, adapting
employ a PointNet [25] layer to transform the point cloud to data sparsity is crucial for ensuring the robustness and
into pillar features, resembling a 2D pseudo image. These efficiency of neural network-based approaches in 4D radar
pillar features undergo processing via our novel PillarAtten- perception tasks.
tion mechanism, followed by a three-stage encoder. Each In the View-of-Delft dataset, the ratio of LiDAR points
encoder stage contains 3x3 2D convolution layers, with the to radar points is approximately 98.81. Despite this signif-
ReLU activation function and batch normalization. The first icant difference, current state-of-the-art 4D radar detection
stage employs three layers, while subsequent stages employ methods employ architectures originally designed for denser
five. Additionally, the initial convolution layers in stages two LiDAR point clouds. Given the limited points captured by
and three downsample features with a stride of two. The 4D radar, we theorize that networks needs less capacity as
output features of each encoder stage undergo upsampling only a limited amount of meaningful features can be learned.
via transposed 2D convolution before being concatenated. We propose a solution by suggesting uniform scaling
Finally, we employ a SSD [26] detection head to derive of neural network encoder stages when transitioning from
predictions from these concatenated features. LiDAR to 4D radar data. In the case of RadarPillars, we
used the same amount of channels C in all encoder stages of
The sparsity inherent in 4D radar data can severely im- the architecture. In contrast, networks based on PointPillars
pact neural network learning. Previous research [22], [23] double the amount of channels C with each stage. Our
has demonstrated in the context of LiDAR perception that approach is expected to enhance both performance through
sparsity propagates between layers, influencing the expres- generalization and runtime efficiency.
siveness of individual layers. This diminishes the network’s
capacity to extract meaningful features from the data, where IV. EVALUATION
certain neurons fail to activate due to insufficient input. We evaluate our network RadarPillars for object detection
Consequently, a network may struggle to generalize well on 4D radar data on the View-of-Delft (VoD) dataset [2]. As
TABLE I: Comparison of RadarPillars to different LiDAR Baseline 39.5 20.6
and 4D radar models on the validation split of the View- + Velocity Components 43.3 19.8
of-Delft dataset. R and C indicate the 4D radar and camera
+ Uniform Scaling 43.8 35.2
modalities respectively for both training and inference. (L)
indicates LiDAR during training only. + PillarAttention 46.0 34.3

Entire Area Driving Corridor Frame rate mAP Frame rate Hz

AGX Xavier
Pedestrian

Pedestrian

RTX 3090
Fig. 4: Combination of our proposed methods forming

Cyclist

Cyclist
Modality

V100
Car RadarPillars, in comparison to the baseline PointPillars [3].

Car
Model mAP AP50 AP25 AP25 mAP AP50 AP25 AP25 Hz Hz Hz Results for 1-frame object detection precision for the entire
1-Frame Data radar area on the View-of-Delft dataset [2]. The frame rate
Point-RCNN* [30] R 29.7 31.0 16.2 42.1 55.7 59.5 32.2 75.0 23.5 63.2 10.1
Voxel-RCNN* [31] R 36.9 33.6 23.0 54.1 63.8 70.0 38.3 83.0 23.1 51.4 9.3
was evaluated on a Nvidia AGX Xavier 32GB.
PV-RCNN* [32] R 43.6 39.0 32.8 59.1 64.5 71.5 43.5 78.6 15.2 34.3 4.1
PV-RCNN++* [33] R 40.7 36.2 28.7 57.1 61.5 68.3 39.1 77.3 9.9 20.1 2.7
SECOND* [34] R 33.2 32.8 22.8 44.0 56.1 69.0 33.9 65.3 34.6 88.6 11.6
PillarNet* [35] R 23.7 25.8 11.8 33.6 43.8 56.7 17.0 57.6 42.7 104.0 20.2
PointPillars* [3] R 39.5 30.2 25.6 62.8 60.9 61.5 36.8 84.5 77.0 182.3 20.6 I. Given the nascent stage of 4D radar detection, we es-
MVFAN [5] R 39.4 34.1 27.3 57.1 64.4 69.8 38.7 84.9 - - -
RadarPillars (ours) R 46.0 36.0 35.5 66.4 67.3 69.4 47.1 85.4 86.6 184.5 34.3 tablish additional benchmarks by training LiDAR detection
CM-FA [10] R+(L) 41.7 32.3 42.4 50.4 - - - - - 23.0 -
GRC-Net [9] R+C 41.1 27.9 31.0 64.6 - - - - - - -
networks for 4D radar data: PV-RCNN [32], PV-RCNN++
RC-Fusion [7] R+C 49.7 41.7 39.0 68.3 69.2 71.9 47.5 88.3 - 10.8 - [33], PillarNet [35], Voxel-RCNN [31], and SECOND [34].
LXL [8] R+C 56.3 42.3 49.5 77.1 72.9 72.2 58.3 88.3 6.1 - -
RCBEV [36] R+C 49.9 40.6 38.8 70.4 69.8 72.4 49.8 87.0 - 21.0 - For these networks, we utilize the settings as Palffy et al.
3-Frame Data [2] used in their adaption of PointPillars [3]. Following other
PointPillars* R 44.1 39.2 29.8 63.3 67.7 71.8 45.7 85.7 75.6 182.2 20.2
RadarPillars (ours) R 50.4 40.2 39.2 71.8 70.0 70.9 51.4 87.6 85.8 183.1 34.1 work, we evaluate frame rate performance on an Nvidia Tesla
5-Frame Data
SRFF [6] R 46.2 36.7 36.8 65.0 66.9 69.1 47.2 84.3 - - -
V100, Nvidia RTX 3090 and Nvidia AGX Xavier 32GB.
SMIFormer [37] R 48.7 39.5 41.8 64.9 71.1 77.04 53.4 82.9 - - - Our comparison highlights the remarkable superiority of
SMURF [4] R 51.0 42.3 39.1 71.5 69.7 71.7 50.5 86.9 30.3 - -
PointPillars* [3] R 46.7 38.8 34.4 66.9 67.8 71.9 45.1 88.4 78.4 178.4 20.6 our RadarPillars over the current state-of-the-art. These find-
RadarPillars (ours) R 50.7 41.1 38.6 72.6 70.5 71.1 52.3 87.9 82.8 179.1 34.4
ings firmly establish RadarPillars as a lightweight model with
* Re-implemented
significantly reduced computational demands, outperforming
all other 4D radar-only models. While RadarPillars matches
there there is no public benchmark or test-split evaluation, SMURF [4] in precision (with a margin of +0.8 for the
we follow established practice and perform all experiments driving corridor and −0.3 for the entire radar area), its
on the validation split. Following VoD, we use the mean advantage in frame rate is seismic, outperforming SMURF
Average Precision (mAP) across both the entire sensor area by a factor of 2.73. Considering this difference, SMURF
and the driving corridor as metrics. During training, we would likely struggle to achieve real-time capabilities on an
augment the dataset by randomly flipping and scaling the embedded device such as an Nvidia AGX Xavier, whereas
point cloud. Data is normalized according to the mean and RadarPillars excels in this regard. In the 3-frame and 5-frame
standard deviation. We adopt a OneCycle schedule [27] settings, RadarPillars performs on-par or better than the
with a starting learning rate of 0.0003 and a maximum state of the art in terms of precision, while exceeding other
learning rate of 0.003. For loss functions, we utilize Focal methods in terms of frame rate. However, accumulating radar
Loss [28] for classification, smooth L1-Loss for bounding frames requires trajectory information. The accumulated data
box regression, and Cross Entropy loss for rotation. Our is already preprocessed in the View-of-Delft dataset. In a
RadarPillars use a backbone size of C = 32 for all encoder real-world application, waiting on and processing frames
stages, a hidden dimension of E = 32 for PillarAttention, of multiple timesteps before passing them to the network,
and vr,x , vr,y as additional features. This puts RadarPillars would incur a delay in detection predictions. Such a delay
at only 0.27 M parameters with 1.99 GFLOPS. Our pillar could be detrimental depending on the application, such as
grid size is set to 320 × 320 for 1-, 3- and 5-frame data. We reacting to a pedestrian crossing the street. Because of this,
set the concatenated feature size for the detection head to the 1-frame setting can be considered as more meaningful.
160 × 160. We implement our network in the OpenPCDet Despite its simplicity compared to complex network archi-
framework [29], training all models on a Nvidia RTX 4070 tectures, RadarPillars sets a new standard for performance,
Ti GPU with a batch size of 8 and float32 data type. even surpassing established LiDAR detection networks in
Our ablation studies in Sections IV-B, IV-C and IV-D are both frame rate and precision. Compared to PointPillars,
carried out for 1-frame detection. In each ablation study, our network showcases a significant improvement in both
we only study the impact of a single method. We cover mAP (+6.5) and frame rate (+13.7 Hz), accompanied by
the combination of our methods to form our final model in a drastic reduction in parameters (−94.4 %) from 4.84 M
Section IV-A. to 0.27 M. Furthermore, the computational complexity is re-
duced by (−87.9 %) from 16.46 GFLOPS to 1.99 GFLOPS.
A. RadarPillars These results establish RadarPillars as the new state-of-the-
We present a comprehensive evaluation of our RadarPillars art for 4D radar-only object detection in terms of both
against state-of-the-art networks, detailing results in Table performance and run-time. While they are not directly com-
TABLE II: Comparison of the results for the features that TABLE III: Comparison of different implementations of self-
are additionally generated from the radial velocities. attention on the validation split of the View-of-Delft dataset.
Features Entire Area Driving Corridor Entire Area Driving Corridor

Pedestrian

Pedestrian

Pedestrian

Pedestrian
x, y, z, RCS

Cyclist

Cyclist

Cyclist

Cyclist
Car

Car
vrel,xy,m

vr,xy,m
vrel,xy

vrel,m

Car

Car
vr,xy

vr,m
vrel

vr

mAP AP50 AP25 AP25 mAP AP50 AP25 AP25

✓ ✓ ✓ 39.5 30.2 25.6 62.8 60.9 61.5 36.8 84.5


Method mAP AP50 AP25 AP25 mAP AP50 AP25 AP25
✓ ✓ 32.3 33.7 20.1 43.2 58.6 70.8 29.6 75.4 None (Baseline) 39.5 30.2 25.6 62.8 60.9 61.5 36.8 84.5
✓ ✓ 38.6 33.7 24.7 57.6 62.9 68.8 37.0 83.0
Point-Attention (unmasked) 40.6 36.6 25.9 59.4 62.4 68.6 36.6 81.9
✓ ✓ ✓ ✓ 41.8 37.9 26.0 61.4 64.2 69.4 37.8 85.4
✓ ✓ ✓ ✓ 43.3 37.4 29.6 62.9 65.9 70.6 40.9 86.0 Point-Attention (masked) 41.6 37.8 26.7 60.4 63.4 69.6 37.4 83.1
✓ ✓ ✓ ✓ ✓ 37.9 31.7 25.9 56.1 61.6 68.6 38.5 77.8 Pillar-Attention 42.9 38.1 28.1 62.4 64.2 68.5 40.0 84.2
✓ ✓ ✓ ✓ 39.9 36.8 24.8 58.0 62.0 69.5 35.4 81.1 Feature-Attention 41.3 37.7 28.2 58.1 62.5 70.5 34.2 82.9
✓ ✓ ✓ ✓ 39.3 35.7 25.7 56.7 63.8 70.3 37.1 84.1
✓ ✓ ✓ ✓ ✓ 39.6 36.0 26.7 56.2 63.8 70.0 37.5 83.8
✓ ✓ ✓ ✓ ✓ 38.4 31.6 26.5 57.1 61.4 66.9 36.9 80.3
✓ ✓ ✓ ✓ ✓ 39.9 31.2 28.9 59.6 62.0 65.2 39.1 81.8
✓ ✓ ✓ ✓ ✓ ✓ ✓ 38.6 33.1 26.7 56.0 64.3 69.7 39.2 84.0
result is achieved by constructing the x and y components
of only the compensated radial velocity vr , which leads to
a significant increase in precision of 3.8. Further processing
parable, RadarPillars achieves competitive results to multi- of the velocities in the form of constructing an offset feature
sensor methods fusing camera and radar data for detection. (denoted by the subscript m) to the average values within a
Interestingly, RadarPillars outperforms the precision of GRC- pillar does not show any clear improvements.
Net [9] without fusing image data and CM-FA [10] which
uses LiDAR point clouds for training. C. PillarAttention
RadarPillars’ performance stems from several key design We investigate how our PillarAttention layer described in
choices, notably the decomposition of the compensated radial Section III-B affects detection precision. Equal settings are
velocity vr into its x and y components as additional features, used for all layers for fair comparison. The experimental
choosing a uniform channel size of C = 32 for all stages results are summarized in Table III.
of the backbone, and incorporating PillarAttention. Figure We first contrast PillarAttention with what we describe as
4 illustrates the impact of each method on model perfor- PointAttention. In PointAttention point features are grouped
mance. Notably, the introduction of x and y components (but not projected) by their pillar index, with each group
of radial velocity yields a substantial mAP boost of (+3.8) zero-padded to a group size of 10. Then, standard self-
without significant runtime overhead. We theorize that this attention inside a transformer layer is applied to to these
leads to more meaningful point feature encoding, before point features, before pillar-projecting the result as pillar-
these are grouped and projected, in turn leading to more features. To assess the impact of padding, we also train a
meaningful pillar features. Furthermore, downscaling the masked version of PointAttention. In both PointAttention
backbone architecture through unform scaling significantly versions, self-attention is computed among all radar points
enhances frame rate without compromising performance. in the point cloud, treating each point as a token, similar to
Finally, PillarAttention contributes to an increase in mAP PillarAttention. Thirdly, we compare with implementing self-
(+1.6) at only a slight runtime increase. We delve into our attention between the concatenated features of all encoder
design choices through the subsequent ablation studies. stages, before processing by the detection head and similar
in concept to SRFF [6]. In this scenario, the concatenated
B. 4D Radar Features feature maps are flattened prior to self-attention calculation.
The results of our proposed construction of additional The results of Table III show that Pillar-Attention leads to
point features from the radial velocities are shown in Table the greatest increase in detection precision. Using attention
II. For a description of our methods, please refer to Section directly on the points is less beneficial for both the masked
IV-B. The first finding of note is, that the performance of and unmasked versions of PointAttention. We theorize that a
the model is strongly dependent on the compensation of the cause of this could be that, while there is some ordering by
radial velocity vrel through ego motion (leading to vr ). If pillar grouping, the points inside one of these groupings are
vr is not used as a feature, the detection precision of the still unordered. In contrast, Pillar-Attention has a defined or-
model drops by 7.2. On the other hand, if the relative radial ders for every token, while still providing fine grained detail.
velocity vrel is not used as a feature, the precision of the This result is shared by the use of late attention, indicating
model only drops by 0.9. This can naturally be explained that a global receptive field is advantageous early on for 4D
by the fact that the measured relative radial velocities vrel radar data. In a further experiment, we investigate the choice
are dependent on the ego motion of the recording vehicle. of the hidden dimension E of the PillarAttention layer. The
As the vehicles driving velocity changes during a record- results from Table IV show that the best precision is achieved
ing, the characteristic velocity profiles of the road users with an embedding dimension of E = 128 channels.
are distorted by their relative measurement to the vehicle
velocity. Furthermore, the results show that the decomposing D. Backbone Scaling
of the radial velocities vrel , vr into their respective x and y We study the uniform scaling strategy of RadarPillars,
components lead to an increase in performance. The best setting all three encoder stages to the same amount of
TABLE IV: Results for different embedding dimensions E TABLE V: We show that the uniform backbone scaling of
of the PillarAttention module on the validation split of the RadarPillars outperforms traditional double-scaling in terms
View-of-Delft dataset. of precision and frame rate. All of our choices of channels
Entire Area Driving Corridor
C outperform this double-scaling strategy with C = 64.

Pedestrian

Pedestrian
Entire Area Driving Corridor F.Rate

AGX Xavier
Parameters

Pedestrian

Pedestrian
Cyclist

Cyclist
Car

Car

Cyclist

Cyclist
Car

Car
Dim. mAP AP50 AP25 AP25 mAP AP50 AP25 AP25
Features M mAP AP50 AP25 AP25 mAP AP50 AP25 AP25 Hz
E = 16 37.6 33.3 23.5 56.0 61.6 68.6 32.0 84.1
(512, 512, 512) 37.12 40.2 32.1 26.3 62.2 63.8 67.5 39.5 84.5 4.2
E = 32 39.6 36.3 23.4 59.1 62.6 69.7 34.7 83.6 (256, 256, 256) 9.72 40.9 33.5 26.8 62.3 64.7 70.2 37.9 85.9 9.3
E = 64 39.9 36.1 24.9 58.6 62.7 69.1 37.4 81.5 (128, 128, 128) 2.74 41.9 36.4 27.1 62.3 64.6 70.2 38.8 84.6 17.7
E = 128 42.9 38.1 28.1 62.4 64.2 68.5 40.0 84.2 (64, 64, 64) 0.79 42.6 36.3 28.6 63.0 65.0 69.1 39.7 86.1 28.3
E = 256 39.1 33.8 25.5 58.0 60.7 67.4 35.8 78.9 (32, 32, 32) 0.26 42.0 33.4 30.4 62.3 64.8 69.1 42.6 82.7 36.1
E = 512 37.5 33.0 20.3 59.1 59.9 68.9 29.4 81.3 (16, 16, 16) 0.11 40.2 31.8 28.4 60.5 61.0 65.8 38.8 78.3 35.9
Baseline [3] 4.84 39.5 30.2 25.6 62.8 60.9 61.5 36.8 84.5 20.6
(64, 128, 256)

channels C. This we compare to the common practice of

16 32 64 128 256 512


Channels Uniform Scaling
doubling the amount of channels with each encoder stage,
as is the case in PointPillars, leading to a backbone with C,
2C and 4C channels. Experimental results are shown in Table
V. Uniform scaling with C = 64 leads to a precision increase
of 3.1, outperforming the double-scaling baseline with C =
64, while reducing network parameters by (−83.6 %) from
4.84 M to 0.79 M. This also results in reduced computational
0.0 0.2 0.4 0.6 0.8
effort, which is reflected in the frame rates achieved. The Normalized Weight Magnitude
increase in precision is consistent across all choices of C, Fig. 5: Weight magnitude analysis comparing various chan-
indicating that uniform scaling is superior for 4D radar data. nel sizes for uniformly scaling RadarPillars. Results show
With real-time performance in mind we choose C = 32 for that the weight strength increases with decreased network
RadarPillars, reducing precision by 0.6, but increasing the size. This visualization excludes dead weights and outliers.
frame rate by 15.5 Hz on a Nvidia AGX Xavier 32GB. This
further reduces parameter count to 0.26 M. V. CONCLUSION
We theorize that this phenomenon stems from the extreme This work introduces RadarPillars, our novel approach for
sparsity of radar data, providing only little input for a neural object detection utilizing 4D radar data. As a lightweight
network. As such, the network can only form weak con- network of only 0.27 M parameters and 1.99 GFLOPS,
nections during training, leaving most feature maps without our RadarPillars establishes a new benchmark in terms of
impact. To provide additional context to strengthen this as- detection performance, while enabling real-time capabilities,
sumption, we perform a weight magnitude analysis. For this thus significantly outperforming the current state-of-the-
analysis, we first clip the weight values at a minimum of 0, as art. We investigate the optimal utilization of radar velocity
ReLU is used as the activation function in RadarPillars. Next, to offer enhanced context for the network. Additionally,
we divide by the maximum weight in the entire layer. This we introduce PillarAttention, a pioneering layer that treats
scales the weights of all layers independently of each other each pillar as a token, while still ensuring efficiency. We
into a normalized magnitude range between 0 and 1. We then demonstrate the benefits of uniform scaled networks for both
remove dead weights by using a minimum magnitude thresh- detection performance and real-time inference. Leveraging
old of 0.001. The remaining weight magnitudes are plotted RadarPillars as a foundation, our future efforts will focus on
in a box plot to enable comparison independent of parameter enhancing runtime by optimizing the backbone and exploring
counts in Figure 5. Outlier weights are not depicted for visual anchorless detection heads. Another avenue of research in-
clarity, as these number in the thousands. The box plot shows volves investigating end-to-end object detection using trans-
that smaller backbones with less channels learn stronger former layers with PillarAttention exclusively or adapting
connections, offering a possible explanation as to why a promising LiDAR methods [38], [39] to benefit radar. Lastly,
reduced parameter count is so beneficial. In conclusion, we propose the potential extension of RadarPillars to other
adapting LiDAR networks requires the downscaling of their sensor data modalities, such as depth sensing or LiDAR.
backbones to adapt to the sparsity of the 4D radar data, as
shown by the effectiveness of RadarPillars. In preliminary ACKNOWLEDGMENT
investigations we have also tried removing an encoder stage
or adding an additional stage, however both were detrimental This research was partially funded by the Federal Ministry
to performance and using three encoder stages was optimal of Education and Research Germany in the project PreciRaSe
for precision. (01IS23023B).
R EFERENCES [24] T. Dao, “Flashattention-2: Faster attention with better parallelism
and work partitioning,” in International Conference on Learning
[1] M. Fürst, O. Wasenmüller, and D. Stricker, “Lrpd: Long range 3d Representations (ICLR), 2024.
pedestrian detection leveraging specific strengths of lidar and rgb,” in [25] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning
Intelligent Transportation Systems Conference (ITSC), 2020. on point sets for 3d classification and segmentation,” in Conference
[2] A. Palffy, E. Pool, S. Baratam, J. F. Kooij, and D. M. Gavrila, “Multi- on Computer Vision and Pattern Recognition (CVPR), 2017.
class road user detection with 3+ 1d radar in the view-of-delft dataset,” [26] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu,
Robotics and Automation Letters (RA-L), 2022. and A. C. Berg, “Ssd: Single shot multibox detector,” in European
[3] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom, Conference on Computer Vision (ECCV), 2016.
“Pointpillars: Fast encoders for object detection from point clouds,” [27] L. N. Smith and N. Topin, “Super-convergence: Very fast training of
in Conference on Computer Vision and Pattern Recognition (CVPR), neural networks using large learning rates,” in Artificial Intelligence
2019. and Machine Learning for Multi-Domain Operations Applications,
[4] J. Liu, Q. Zhao, W. Xiong, T. Huang, Q.-L. Han, and B. Zhu, “Smurf: 2019.
Spatial multi-representation fusion for 3d object detection with 4d [28] T.-Y. Ross and G. Dollár, “Focal loss for dense object detection,”
imaging radar,” Transactions on Intelligent Vehicles (T-IV), 2023. in Conference on Computer Vision and Pattern Recognition (CVPR),
[5] Q. Yan and Y. Wang, “Mvfan: Multi-view feature assisted network 2017.
for 4d radar object detection,” in International Conference on Neural [29] O. D. Team, “Openpcdet: An open-source toolbox for 3d object detec-
Information Processing (ICONIP), 2023. tion from point clouds,” https://github.com/open-mmlab/OpenPCDet,
[6] L. Ruddat, L. Reichardt, N. Ebert, and O. Wasenmüller, “Sparsity- 2020.
robust feature fusion for vulnerable road-user detection with 4d radar,” [30] S. Shi, X. Wang, and H. Li, “Pointrcnn: 3d object proposal generation
Applied Sciences, 2024. and detection from point cloud,” in Conference on Computer Vision
[7] L. Zheng, S. Li, B. Tan, L. Yang, S. Chen, L. Huang, J. Bai, X. Zhu, and Pattern Recognition (CVPR), 2019.
and Z. Ma, “Rcfusion: Fusing 4d radar and camera with bird’s-eye [31] J. Deng, S. Shi, P. Li, W. Zhou, Y. Zhang, and H. Li, “Voxel r-
view features for 3d object detection,” Transactions on Instrumentation cnn: Towards high performance voxel-based 3d object detection,” in
and Measurement (TIM), 2023. Conference on Artificial Intelligence, 2021.
[8] W. Xiong, J. Liu, T. Huang, Q.-L. Han, Y. Xia, and B. Zhu, “Lxl: [32] S. Shi, C. Guo, L. Jiang, Z. Wang, J. Shi, X. Wang, and H. Li,
Lidar excluded lean 3d object detection with 4d imaging radar and “Pv-rcnn: Point-voxel feature set abstraction for 3d object detection,”
camera fusion,” Transactions on Intelligent Vehicles (T-IV), 2023. in Conference on Computer Vision and Pattern Recognition (CVPR),
[9] L. Fan, C. Zeng, Y. Li, X. Wang, and D. Cao, “Grc-net: Fusing gat- 2020.
based 4d radar and camera for 3d object detection,” SAE Technical [33] S. Shi, L. Jiang, J. Deng, Z. Wang, C. Guo, J. Shi, X. Wang, and
Paper, Tech. Rep., 2023. H. Li, “Pv-rcnn++: Point-voxel feature set abstraction with local
[10] J. Deng, G. Chan, H. Zhong, and C. X. Lu, “Robust 3d object detection vector representation for 3d object detection,” International Journal
from lidar-radar point clouds via cross-modal feature augmentation,” of Computer Vision (IJCV), 2023.
International Conference on Robotics and Automation (ICRA), 2024. [34] Y. Yan, Y. Mao, and B. Li, “Second: Sparsely embedded convolutional
[11] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. detection,” Sensors, 2018.
Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” [35] G. Shi, R. Li, and C. Ma, “Pillarnet: Real-time and high-performance
Advances in Neural Information Processing Systems (NeurIPS), 2017. pillar-based 3d object detection,” in European Conference on Com-
[12] H. Zhao, L. Jiang, J. Jia, P. H. Torr, and V. Koltun, “Point transformer,” puter Vision (ECCV).
in International Conference on Computer Vision (ICCV), 2021. [36] Z. Lin, Z. Liu, Z. Xia, X. Wang, Y. Wang, S. Qi, Y. Dong, N. Dong,
[13] X. Wu, Y. Lao, L. Jiang, X. Liu, and H. Zhao, “Point transformer L. Zhang, and C. Zhu, “Rcbevdet: Radar-camera fusion in bird’s eye
v2: Grouped vector attention and partition-based pooling,” Advances view for 3d object detection,” in Conference on Computer Vision and
in Neural Information Processing Systems (NeurIPS), 2022. Pattern Recognition (CVPR), 2024.
[14] X. Yu, L. Tang, Y. Rao, T. Huang, J. Zhou, and J. Lu, “Point-bert: [37] W. Shi, Z. Zhu, K. Zhang, H. Chen, Z. Yu, and Y. Zhu, “Smiformer:
Pre-training 3d point cloud transformers with masked point modeling,” Learning spatial feature representation for 3d object detection from 4d
in Conference on Computer Vision and Pattern Recognition (CVPR), imaging radar via multi-view interactive transformers,” Sensors, 2023.
2022. [38] L. Reichardt, N. Ebert, and O. Wasenmüller, “360deg from a single
[15] Y.-Q. Yang, Y.-X. Guo, J.-Y. Xiong, Y. Liu, H. Pan, P.-S. Wang, camera: A few-shot approach for lidar segmentation,” in International
X. Tong, and B. Guo, “Swin3d: A pretrained transformer backbone Conference on Computer Vision (ICCV), 2023.
for 3d indoor scene understanding,” arXiv preprint arXiv:2304.06906, [39] L. Reichardt, L. Uhr, and O. Wasenmüller, “Text3daug – prompted in-
2023. stance augmentation for lidar perception,” in International Conference
on Intelligent Robots and Systems (IROS), 2024.
[16] P. Sun, M. Tan, W. Wang, C. Liu, F. Xia, Z. Leng, and D. Anguelov,
“Swformer: Sparse window transformer for 3d object detection in
point clouds,” in European Conference on Computer Vision (ECCV),
2022.
[17] X. Lai, J. Liu, L. Jiang, L. Wang, H. Zhao, S. Liu, X. Qi, and J. Jia,
“Stratified transformer for 3d point cloud segmentation,” in Conference
on Computer Vision and Pattern Recognition (CVPR), 2022.
[18] P.-S. Wang, “Octformer: Octree-based transformers for 3d point
clouds,” ACM Transactions on Graphics (TOG), 2023.
[19] Z. Liu, X. Yang, H. Tang, S. Yang, and S. Han, “Flatformer: Flattened
window attention for efficient point cloud transformer,” in Conference
on Computer Vision and Pattern Recognition (CVPR), 2023.
[20] X. Lai, Y. Chen, F. Lu, J. Liu, and J. Jia, “Spherical transformer for
lidar-based 3d recognition,” in Conference on Computer Vision and
Pattern Recognition (CVPR), 2023.
[21] X. Wu, L. Jiang, P.-S. Wang, Z. Liu, X. Liu, Y. Qiao, W. Ouyang,
T. He, and H. Zhao, “Point transformer v3: Simpler, faster, stronger,”
in Conference on Computer Vision and Pattern Recognition (CVPR),
2024.
[22] J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and
A. Geiger, “Sparsity invariant cnns,” in International Conference on
3D Vision (3DV), 2017.
[23] L. Reichardt, P. Mangat, and O. Wasenmüller, “Dvmn: Dense validity
mask network for depth completion,” in Intelligent Transportation
Systems Conference (ITSC), 2021.

You might also like