RadarPillars: Efficient Object Detection From 4D Radar Point Clouds
RadarPillars: Efficient Object Detection From 4D Radar Point Clouds
for LiDAR data, are often applied to these radar point clouds.
However, this neglects the special characteristics of 4D radar
data, such as the extreme sparsity and the optimal utilization of
velocity information. To address these gaps in the state-of-the-
Fig. 1: Example of our RadarPillars detection results on
art, we present RadarPillars, a pillar-based object detection 4D radar. Cars are marked in red, pedestrians in green
network. By decomposing radial velocity data, introducing and cyclists in blue. The radial velocities of the points are
PillarAttention for efficient feature extraction, and studying indicated by arrows.
layer scaling to accommodate radar sparsity, RadarPillars
significantly outperform state-of-the-art detection results on the
View-of-Delft dataset. Importantly, this comes at a significantly
reduced parameter count, surpassing existing methods in terms our RadarPillars, a novel 3D detection network tailored
of efficiency and enabling real-time performance on edge specifically for 4D radar data. Through RadarPillars we
devices. address gaps in the current state-of-the-art with the following
I. INTRODUCTION contributions, significantly improving performance, while
maintaining real-time capabilities:
In the context of autonomy and automotive applications,
• Enhancement of velocity information utilization: We
radar stands out as a pivotal sensing technology, enabling
decompose radial velocity data, providing additional
vehicles to detect objects and obstacles in their surroundings.
features to significantly enhance network performance.
This capability is crucial for ensuring the safety and effi-
• Adapting to radar sparsity: RadarPillars leverages the
ciency of various autonomous driving functionalities, includ-
pillar representation [3] for efficient real-time process-
ing collision avoidance, adaptive cruise control, and lane-
ing. We capitalize on the sparsity inherent in 4D radar
keeping assistance. Recent advancements in radar technology
data and introduce PillarAttention, a novel self-attention
have led to the development of 4D radar, incorporating three
layer treating each pillar as a token, while maintaining
spatial dimensions along with an additional dimension for
both efficiency and real-time performance.
Doppler velocity. Unlike traditional radar systems, 4D radar
• Scaling for sparse radar data: We demonstrate that
introduces elevation information as its third dimension. This
the sparsity of radar data can lead to less informative
enhancement allows for the representation of radar data in
features in the detection network. Through uniform
3D point clouds, akin to those generated by LiDAR or depth
network, we not only improve performance but also sig-
sensing cameras, thereby enabling the application of deep
nificantly reduce parameter count, enhanciung runtime
learning methodologies previously reserved for such sensors.
efficiency.
However, while deep learning techniques from the domain
of LiDAR detection have been adapted to 4D radar data, they II. RELATED WORK
have not fully explored or adapted to its unique features.
A. 4D Radar Object Detection
Compared to LiDAR data, 4D radar data is significantly less
abundant. Regardless of this sparsity, radar uniquely provides Point clouds can be processed in various ways: as an
velocity as a feature, which could help in the detection of unordered set of points, ordered by graphs, within a dis-
moving objects in various scenarios, such as at long range crete voxel grid, or as range projections. Among these
where LiDAR traditionally struggles [1]. In the View-of- representations, pillars stand out as a distinct type, where
Delft dataset, an average 4D radar scan comprises only each voxel is defined as a vertical column, enabling the
216 points, while a LiDAR scan within the same field of reduction of the height dimension. This allows for pillar
view contains 21, 344 points [2]. In response, we propose features to be cast into a 2D-pseudo-image, with its height
and width defined by the grid size used for the base of the pil-
1 Mannheim University of Applied Sciences, Germany lars. This dimensionality reduction facilitates the application
[email protected] of 2D network architectures for birds-eye-view processing.
[email protected]
[email protected] PointPillars-based [3] networks have proven particularly ef-
[email protected] fective for LiDAR data, balancing performance and runtime
efficiently. Consequently, researchers have begun applying [16], [17] or the octree representation [18]. The downside of
the pillar representation to 4D radar data. Currently, further geometric partitioning is that groups of equal shape will each
exploration of alternative representation methods besides have a different amount of points in them. This is detrimental
pillars for 4D radar data remains limited. to parallelization, meaning that such methods are not real
Palffy et al. [2] established a baseline by benchmarking time capable. Despite these efforts, partition based attention
PointPillars on their View-of-Delft dataset, adapting only the is limited to the local context, with various techniques to
parameters of the pillar-grid to match radar sensor specifi- facilitate information transfer between these groups such
cations. Recognizing the sparsity inherent in 4D radar data, as changing neighborhood size, downsampling, or shifting
subsequent work aims to maximize information utilization windows. The addition of constant shifting and reordering
through parallel branches or multi-scale fusion techniques. of data leads to further memory inefficiencies and increased
SMURF [4] introduces a parallel learnable branch to the latency. In response to these challenges, Flatformer [19] opts
pillar representation, integrating kernel density estimation. for computational efficiency by forming groups of equal
MVFAN [5] employs two parallel branches — one for cylin- size rather than equal geometric shape, sacrificing spatial
drical projection and the other for the pillar representation — proximity for better parallelization and memory efficiency.
merging features prior to passing them through an encoder- Similarly, SphereFormer [20] voxelizes point clouds based
decoder network for detection. SRFF [6] does not use a on exponential distance in the spherical coordinate system
parallel branch, instead incorporating an attention-based neck to achieve higher density voxel grids. Point Transformer
to fuse encoder-stage features, arguing that multi-scale fusion v3 [21] first embeds voxels through sparse convolution and
improves information extraction from sparse radar data. pooling, then ordering and partitioning the resulting tokens
Further approaches like RC-Fusion [7] and LXL [8] and using space-filling curves. Through this, only the last group
GRC-Net [9] opt to fuse both camera and 4D radar data, along the curve needs padding, thereby prioritizing efficiency
taking a dual-modality approach at object detection. CM- through pattern-based ordering over spatial ordering or geo-
FA [10] uses LiDAR data during training, but not during metric partitioning.
inference. These methods often require specialized attention libraries
It’s worth noting that the modifications introduced by that do not leverage the efficient attention implementations
these methods come at the cost of increased computational available in standard frameworks.
load and memory requirements, compromising the real-
time advantage associated with the pillar representation. III. METHOD
Furthermore, none of these methods fully explore the optimal
The current state-of-the-art in 4D radar object detection
utilization of radar features themselves. Herein lies untapped
predominantly relies on LiDAR-based methods. As a re-
potential.
sult, there is a noticeable gap in research regarding the
B. Transformers in Point Cloud Perception comprehensive utilization of velocity information to enhance
detection performance. Despite incremental advancements in
The self-attention mechanism [11] dynamically weighs
related works, these improvements often sacrifice efficiency
input elements in relation to each other, capturing long-
and real-time usability. To address these issues, we delve into
range dependencies and allowing for a global receptive
optimizing radar features to improve network performance
field for feature extraction. Self-attention incorporated in
through enhanced input data quality.
the transformer layer has benefited tasks like natural lan-
guage processing, computer vision, and speech recognition, While various self-attention variants have been explored
achieving state-of-the-art performance across domains. How- in point cloud perception, their restricted receptive fields,
ever, applying self-attention to point clouds poses distinct in conjunction with the sparsity and irregularity of point
challenges. The computational cost is quadratic, limiting the clouds, lead to computationally intensive layers. Leveraging
amount of tokens (context window) and hindering long-range the sparsity inherent in 4D radar data, we introduce Pil-
processing compared to convolutional methods. Additionally, larAttention, a novel self-attention layer providing a global
the inherent sparsity and varied point distributions complicate receptive field by treating each pillar as a token. Contrary
logical and geometric ordering, thus impeding the adoption to existing layers, PillarAttention does not reduce features
of transformer-based architectures in point cloud processing. through tokenization or need complex ordering algorithms.
Various strategies have been proposed to address these Additionally, we investigate network scaling techniques to
challenges. Point Transformer [12] utilizes k-nearest- further enhance both runtime efficiency and performance in
neighbors (KNN) to group points before applying vector light of the radar data sparsity.
attention. However, the neighborhood size is limited, as KNN
A. 4D Radar Features
grouping is also quadratic in terms of memory requirements
and complexity. On top of grouping, some approaches re- The individual points within 4D radar point clouds are
duce the point cloud through pooling [13] or farthest-point- characterized by various parameters including range (r),
sampling [14], leading to information loss. azimuth (α), elevation (θ ), RCS reflectivity, and relative
Others partition the point cloud into groups of equal radial velocity (vrel ). The determination of radial velocity
geometric shape, employing window-based attention [15], relies on the Doppler effect, reflecting the speed of an object
vr,y which does not influence the runtime of the model, beyond
vr vr,x
v its input layer.
B. PillarAttention
x
The pillar representation of 4D radar data as a 2D pseudo-
r
image is very sparse, with only a few valid pillars. Due to
this sparsity, pillars belonging to the same object are far
y apart. When processed by a convolutional backbone with a
Radar local field of view, this means that early layers cannot cap-
ture neighborhood dependencies. This is only achieved with
Fig. 2: Absolute radial velocity vr compensated with ego mo- subsequent layers and the resulting increase in the effective
tion of 4D radar. As an object moves, vr changes depending receptive field, or by the downsampling between network
on its heading angle to the sensor. The cars actual velocity stages [22], [23]. As such, the aggregation of information
v remains unknown, as its heading cannot be determined. belonging to the same object occurs late within the network
However, vr can be decomposed into its x and y components backbone. However, downsampling can lead to the loss of
to provide additional features. The coordinate system and information critical to small objects. The tokenization and
nomenclature follows the View-of-Delft dataset [2]. grouping methods of point cloud transformers can have a
similar negative effect.
Inspired by Self-Attention [11], we introduce PillarAtten-
in relation to the sensor’s position. When dealing with a non- tion to globally connect the local features of individual pillars
stationary radar sensor (e.g. mounted on a car), compensating across the entire pillar grid. We achieve this by capitalizing
vrel with the ego-motion yields the absolute radial velocity on the inherent sparsity of 4D radar data, treating each pillar
vr . The spherical coordinates (r, α, θ ) can be converted as a token, allowing our method to be free of grouping
into Cartesian coordinates (x, y, z). While these features or downsampling methods. PillarAttention diverges from
are akin to LiDAR data, radar’s unique capability lies in conventional self-attention in the manner in which sparsity
providing velocity information. Despite the commonality is handled. Given the largely empty nature of the pillar grid
in coordinate systems between radar and LiDAR, radar’s with size H,W , we employ a sparsity mask to exclusively
inclusion of velocity remains unique and underutilized. Cur- gather only occupied pillar features p. Subsequently, we learn
rent practices often incorporate velocity information merely key (K), query (Q), and value (V ) before applying stan-
as an additional feature within networks. Therefore, our dard self-attention. Conventionally, sparse values are masked
investigation delves into the impact of both relative and during self-attention calculation. In contrast, our approach
absolute radial velocities. Through this analysis, we advocate reduces the spatial complexity and memory requirements for
for the creation of supplementary features derived from radial self-attention from (HW )2 to p2 . Nevertheless, it’s essential
velocity, enriching the original data points. to acknowledge that sparsity, and thus the number of valid
First, we explore decomposing vr into its x and y com- pillars, varies between scans. Consequently, the sequence
ponents, resulting in vectors vr,x and vr,y , respectively. This length of tokens fluctuates during both training and inference.
approach similarly applies to vrel . This concept is visualized Another difference to conventional self-attention is that we
in Figure 2. The velocity vectors of each point can be did not find the inclusion of position embedding necessary.
decomposed through the following equations. Note that the This can be attributed to the fact that pillar features inherently
Equation (1) and Equation (2) apply to both vr and vrel in contain position information derived from point clouds.
the Cartesian coordinate system, in which arctan xy = β . Moreover, since pillars are organized within a 2D grid, the
y order of tokens remains consistent across scans, allowing the
vr,x = cos arctan · vr (1) model to learn contextual relationships between individual
x pillars. As such, the use of specialized algorithms for order-
y
vr,y = sin arctan · vr (2) ing such as octrees and space filling curves is not needed.
x Also, PillarAttention is not reliant on specialized libraries
Secondly, we construct new features by calculating the and benefits from recent developments in the space such as
offset velocities inside a pillar. For this, we first average the Flash-Attention-2 [24].
velocities inside a pillar and then subtract it from the velocity We next PillarAttention inside a transformer layer. This
of each point, to form an additional offset feature. These new layer is encapsulated by two MLPs which control its hidden
features can be calculated for both radial velocities vrel , vr dimension E. Following PillarAttention, the transformed
and their decomposed x, y variants. In later experiments we features are scattered back into their original pillar positions.
denote the use of these new offset features with subscript m, The concept of PillarAttention is depicted in Figure 3.
for example vr,m when using the offset velocities for vr .
The construction of these additional point features is in- C. Architecture and Scaling
tended to make it easier for the model to learn dependencies Our architecture (see Figure 3) is loosely inspired by
from the data in order to increase performance in a way PointPillars [3]. Similar to PointPillars, we incorporate offset
Pointcloud
1 2 3 4 1 0 0 0
5 6 7 8 0 0 1 0
Flatten
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
9 10 11 12 1 0 0 0
13 14 15 16 0 1 0 1 1 0 0 0 0 0 1 0 1 0 0 0 0 1 0 1
Gather Scatter
1 7 9 14 16
Attention Map
Pillar Features (p, p)
(p, C)
Softmax
Q FFN Hidden Dim
(p, E) MLP X (p, C)
(p, E)
Layer Norm
T
K
MLP MLP + FFN + MLP
V
MLP X
Legend
Fig. 3: Overview of our PillarAttention. We leverage the sparsity of radar point clouds by using a mask to gather features
from non-empty pillars, reducing spatial size from H,W to p. Each pillar-feature with C channels is treated as a token for
the calculation of self-attention. Our PillarAttenion is encapsulated in a transformer layer, with the feed-forward network
(FFN) consisting of Layer Norm, followed by two MLPs with the GeLU activation between them. The hidden dimension E
of PillarAttention is controlled by a MLP before and after the layer. Finally, the pillar features with C channels are scattered
back to their original position within the grid. Our PillarAttention does not use position embedding.
coordinates xc , yc , zc derived from the pillar center c as to unseen data or exhibit suboptimal performance in tasks
additional features within the point cloud. Subsequently, we such as object detection or classification. Therefore, adapting
employ a PointNet [25] layer to transform the point cloud to data sparsity is crucial for ensuring the robustness and
into pillar features, resembling a 2D pseudo image. These efficiency of neural network-based approaches in 4D radar
pillar features undergo processing via our novel PillarAtten- perception tasks.
tion mechanism, followed by a three-stage encoder. Each In the View-of-Delft dataset, the ratio of LiDAR points
encoder stage contains 3x3 2D convolution layers, with the to radar points is approximately 98.81. Despite this signif-
ReLU activation function and batch normalization. The first icant difference, current state-of-the-art 4D radar detection
stage employs three layers, while subsequent stages employ methods employ architectures originally designed for denser
five. Additionally, the initial convolution layers in stages two LiDAR point clouds. Given the limited points captured by
and three downsample features with a stride of two. The 4D radar, we theorize that networks needs less capacity as
output features of each encoder stage undergo upsampling only a limited amount of meaningful features can be learned.
via transposed 2D convolution before being concatenated. We propose a solution by suggesting uniform scaling
Finally, we employ a SSD [26] detection head to derive of neural network encoder stages when transitioning from
predictions from these concatenated features. LiDAR to 4D radar data. In the case of RadarPillars, we
used the same amount of channels C in all encoder stages of
The sparsity inherent in 4D radar data can severely im- the architecture. In contrast, networks based on PointPillars
pact neural network learning. Previous research [22], [23] double the amount of channels C with each stage. Our
has demonstrated in the context of LiDAR perception that approach is expected to enhance both performance through
sparsity propagates between layers, influencing the expres- generalization and runtime efficiency.
siveness of individual layers. This diminishes the network’s
capacity to extract meaningful features from the data, where IV. EVALUATION
certain neurons fail to activate due to insufficient input. We evaluate our network RadarPillars for object detection
Consequently, a network may struggle to generalize well on 4D radar data on the View-of-Delft (VoD) dataset [2]. As
TABLE I: Comparison of RadarPillars to different LiDAR Baseline 39.5 20.6
and 4D radar models on the validation split of the View- + Velocity Components 43.3 19.8
of-Delft dataset. R and C indicate the 4D radar and camera
+ Uniform Scaling 43.8 35.2
modalities respectively for both training and inference. (L)
indicates LiDAR during training only. + PillarAttention 46.0 34.3
AGX Xavier
Pedestrian
Pedestrian
RTX 3090
Fig. 4: Combination of our proposed methods forming
Cyclist
Cyclist
Modality
V100
Car RadarPillars, in comparison to the baseline PointPillars [3].
Car
Model mAP AP50 AP25 AP25 mAP AP50 AP25 AP25 Hz Hz Hz Results for 1-frame object detection precision for the entire
1-Frame Data radar area on the View-of-Delft dataset [2]. The frame rate
Point-RCNN* [30] R 29.7 31.0 16.2 42.1 55.7 59.5 32.2 75.0 23.5 63.2 10.1
Voxel-RCNN* [31] R 36.9 33.6 23.0 54.1 63.8 70.0 38.3 83.0 23.1 51.4 9.3
was evaluated on a Nvidia AGX Xavier 32GB.
PV-RCNN* [32] R 43.6 39.0 32.8 59.1 64.5 71.5 43.5 78.6 15.2 34.3 4.1
PV-RCNN++* [33] R 40.7 36.2 28.7 57.1 61.5 68.3 39.1 77.3 9.9 20.1 2.7
SECOND* [34] R 33.2 32.8 22.8 44.0 56.1 69.0 33.9 65.3 34.6 88.6 11.6
PillarNet* [35] R 23.7 25.8 11.8 33.6 43.8 56.7 17.0 57.6 42.7 104.0 20.2
PointPillars* [3] R 39.5 30.2 25.6 62.8 60.9 61.5 36.8 84.5 77.0 182.3 20.6 I. Given the nascent stage of 4D radar detection, we es-
MVFAN [5] R 39.4 34.1 27.3 57.1 64.4 69.8 38.7 84.9 - - -
RadarPillars (ours) R 46.0 36.0 35.5 66.4 67.3 69.4 47.1 85.4 86.6 184.5 34.3 tablish additional benchmarks by training LiDAR detection
CM-FA [10] R+(L) 41.7 32.3 42.4 50.4 - - - - - 23.0 -
GRC-Net [9] R+C 41.1 27.9 31.0 64.6 - - - - - - -
networks for 4D radar data: PV-RCNN [32], PV-RCNN++
RC-Fusion [7] R+C 49.7 41.7 39.0 68.3 69.2 71.9 47.5 88.3 - 10.8 - [33], PillarNet [35], Voxel-RCNN [31], and SECOND [34].
LXL [8] R+C 56.3 42.3 49.5 77.1 72.9 72.2 58.3 88.3 6.1 - -
RCBEV [36] R+C 49.9 40.6 38.8 70.4 69.8 72.4 49.8 87.0 - 21.0 - For these networks, we utilize the settings as Palffy et al.
3-Frame Data [2] used in their adaption of PointPillars [3]. Following other
PointPillars* R 44.1 39.2 29.8 63.3 67.7 71.8 45.7 85.7 75.6 182.2 20.2
RadarPillars (ours) R 50.4 40.2 39.2 71.8 70.0 70.9 51.4 87.6 85.8 183.1 34.1 work, we evaluate frame rate performance on an Nvidia Tesla
5-Frame Data
SRFF [6] R 46.2 36.7 36.8 65.0 66.9 69.1 47.2 84.3 - - -
V100, Nvidia RTX 3090 and Nvidia AGX Xavier 32GB.
SMIFormer [37] R 48.7 39.5 41.8 64.9 71.1 77.04 53.4 82.9 - - - Our comparison highlights the remarkable superiority of
SMURF [4] R 51.0 42.3 39.1 71.5 69.7 71.7 50.5 86.9 30.3 - -
PointPillars* [3] R 46.7 38.8 34.4 66.9 67.8 71.9 45.1 88.4 78.4 178.4 20.6 our RadarPillars over the current state-of-the-art. These find-
RadarPillars (ours) R 50.7 41.1 38.6 72.6 70.5 71.1 52.3 87.9 82.8 179.1 34.4
ings firmly establish RadarPillars as a lightweight model with
* Re-implemented
significantly reduced computational demands, outperforming
all other 4D radar-only models. While RadarPillars matches
there there is no public benchmark or test-split evaluation, SMURF [4] in precision (with a margin of +0.8 for the
we follow established practice and perform all experiments driving corridor and −0.3 for the entire radar area), its
on the validation split. Following VoD, we use the mean advantage in frame rate is seismic, outperforming SMURF
Average Precision (mAP) across both the entire sensor area by a factor of 2.73. Considering this difference, SMURF
and the driving corridor as metrics. During training, we would likely struggle to achieve real-time capabilities on an
augment the dataset by randomly flipping and scaling the embedded device such as an Nvidia AGX Xavier, whereas
point cloud. Data is normalized according to the mean and RadarPillars excels in this regard. In the 3-frame and 5-frame
standard deviation. We adopt a OneCycle schedule [27] settings, RadarPillars performs on-par or better than the
with a starting learning rate of 0.0003 and a maximum state of the art in terms of precision, while exceeding other
learning rate of 0.003. For loss functions, we utilize Focal methods in terms of frame rate. However, accumulating radar
Loss [28] for classification, smooth L1-Loss for bounding frames requires trajectory information. The accumulated data
box regression, and Cross Entropy loss for rotation. Our is already preprocessed in the View-of-Delft dataset. In a
RadarPillars use a backbone size of C = 32 for all encoder real-world application, waiting on and processing frames
stages, a hidden dimension of E = 32 for PillarAttention, of multiple timesteps before passing them to the network,
and vr,x , vr,y as additional features. This puts RadarPillars would incur a delay in detection predictions. Such a delay
at only 0.27 M parameters with 1.99 GFLOPS. Our pillar could be detrimental depending on the application, such as
grid size is set to 320 × 320 for 1-, 3- and 5-frame data. We reacting to a pedestrian crossing the street. Because of this,
set the concatenated feature size for the detection head to the 1-frame setting can be considered as more meaningful.
160 × 160. We implement our network in the OpenPCDet Despite its simplicity compared to complex network archi-
framework [29], training all models on a Nvidia RTX 4070 tectures, RadarPillars sets a new standard for performance,
Ti GPU with a batch size of 8 and float32 data type. even surpassing established LiDAR detection networks in
Our ablation studies in Sections IV-B, IV-C and IV-D are both frame rate and precision. Compared to PointPillars,
carried out for 1-frame detection. In each ablation study, our network showcases a significant improvement in both
we only study the impact of a single method. We cover mAP (+6.5) and frame rate (+13.7 Hz), accompanied by
the combination of our methods to form our final model in a drastic reduction in parameters (−94.4 %) from 4.84 M
Section IV-A. to 0.27 M. Furthermore, the computational complexity is re-
duced by (−87.9 %) from 16.46 GFLOPS to 1.99 GFLOPS.
A. RadarPillars These results establish RadarPillars as the new state-of-the-
We present a comprehensive evaluation of our RadarPillars art for 4D radar-only object detection in terms of both
against state-of-the-art networks, detailing results in Table performance and run-time. While they are not directly com-
TABLE II: Comparison of the results for the features that TABLE III: Comparison of different implementations of self-
are additionally generated from the radial velocities. attention on the validation split of the View-of-Delft dataset.
Features Entire Area Driving Corridor Entire Area Driving Corridor
Pedestrian
Pedestrian
Pedestrian
Pedestrian
x, y, z, RCS
Cyclist
Cyclist
Cyclist
Cyclist
Car
Car
vrel,xy,m
vr,xy,m
vrel,xy
vrel,m
Car
Car
vr,xy
vr,m
vrel
vr
Pedestrian
Pedestrian
Entire Area Driving Corridor F.Rate
AGX Xavier
Parameters
Pedestrian
Pedestrian
Cyclist
Cyclist
Car
Car
Cyclist
Cyclist
Car
Car
Dim. mAP AP50 AP25 AP25 mAP AP50 AP25 AP25
Features M mAP AP50 AP25 AP25 mAP AP50 AP25 AP25 Hz
E = 16 37.6 33.3 23.5 56.0 61.6 68.6 32.0 84.1
(512, 512, 512) 37.12 40.2 32.1 26.3 62.2 63.8 67.5 39.5 84.5 4.2
E = 32 39.6 36.3 23.4 59.1 62.6 69.7 34.7 83.6 (256, 256, 256) 9.72 40.9 33.5 26.8 62.3 64.7 70.2 37.9 85.9 9.3
E = 64 39.9 36.1 24.9 58.6 62.7 69.1 37.4 81.5 (128, 128, 128) 2.74 41.9 36.4 27.1 62.3 64.6 70.2 38.8 84.6 17.7
E = 128 42.9 38.1 28.1 62.4 64.2 68.5 40.0 84.2 (64, 64, 64) 0.79 42.6 36.3 28.6 63.0 65.0 69.1 39.7 86.1 28.3
E = 256 39.1 33.8 25.5 58.0 60.7 67.4 35.8 78.9 (32, 32, 32) 0.26 42.0 33.4 30.4 62.3 64.8 69.1 42.6 82.7 36.1
E = 512 37.5 33.0 20.3 59.1 59.9 68.9 29.4 81.3 (16, 16, 16) 0.11 40.2 31.8 28.4 60.5 61.0 65.8 38.8 78.3 35.9
Baseline [3] 4.84 39.5 30.2 25.6 62.8 60.9 61.5 36.8 84.5 20.6
(64, 128, 256)