DENSER: 3D Gaussians Splatting For Scene Reconstruction of Dynamic Urban Environments
DENSER: 3D Gaussians Splatting For Scene Reconstruction of Dynamic Urban Environments
Urban Environments
Fig. 2. DENSER Scene Composition Framework. The pipeline starts by processing raw sensor data to get a set of densified point cloud for each
foreground object in its reference frame and for the static background. Object point clouds are used to initialize 3D Gaussians of dynamic objects
for which wavelets are used to estimate their color appearance. Background point cloud initializes the 3D Gaussians of the static with appearance
modelled using a traditional SH basis. All 3D Gaussians form a scene graph which can jointly rendered for a novel view.
simply transform GiO to the word frame by applying and varying appearances [7]. Representing the appearance
homogeneous transformation using TiW (t) as follows of transient objects solely using SH coefficients tends to
be insufficient [15]. This arises mainly from the sensitivity
GiW (t) = TiW (t) ⊗ GiO (5) of SH to the changes in the position of the objects in the
The set of all Gaussians to be used for rendering can be scene and the associated changes in shadows and lighting
obtained by concatenating all sets Gaussians of the static induced by these motions. To maintain a consistent visual
background node and transfomed dynamic objects node appearance, DENSER handles this challenge by using (i)
densification of object point clouds across all different
m
frames, which ensures not only a strong prior for initial-
GW = G jW , ∀ j = 0, 1, 2, . . . , m,
M
(6)
ization of the 3D Gaussias, but also mitigates the pose
j =0
calibration errors and noisy measurements inherent in the
with j = 0 represents the background, i.e. GbW = G0W and datasets. Using the sensor pose transformation matrix Tj
the remaining sets of Gaussians are those of dynamic and LIDAR point cloud Pi , one can apply an ROI filter
object nodes. defined using the bounding box of the object O j to get the
point cloud Pij of object j at frame i. Concatenating across
C. Scene Decomposition all frames results in the densified point cloud Pjd used for
This paper improves existing 3DGS composite scene initialization. (ii) We use a time-dependent approximation
reconstruction by enhancing the modeling of appearance of SH bases to capture the varying appearance of dynamic
of transient objects, resulting in a more realistic and objects using an orthonormal basis of wavelets with scale
consistent scene representation. The input to DENSER is and translation parameters are optimizable parameters. In
a sequence n frames. The frame Fi is defined in term DENSER, the Ricker wavelet is used
of a set of m tracked objects, a sensor pose Ti , a LIDAR
τ2
τ 2
point cloud Pi and a set camera images Ii and optionally 2
ψ(t) = √ 1− exp − 2 , (7)
a depth map Di , ∀i ∈ {1, 2, . . . , n}. Each object j in the 3aπ 1/4 a 2a
frame i, Oij is often defined by a bounding box, a tracking
identifier, and an object class, ∀ j ∈ {1, 2, . . . , m}. Based on where a is its scale parameter and τ = t − b, with b is its
these inputs, DENSER starts by accumulating point clouds i ( θ, ϕ ) for
translation parameter. The SH basis function Yuv
from over all frames in the world frame W while using object j is approximated using the linear combination of
object bounding boxes to filter the points corresponding child wavelets
to foreground objects. The resulting point cloud PbW is
d
to initialize the 3D Gaussians of the background GbW for
Yuv (t) = ∑ wi ψ(t, ai , bi )
j
(8)
the position µb , opacity αb and covariance Σb and the i =1
corresponding rotation Rb and scale Sb as described in
(2) in a similar to [6]. Besides, each Gaussian of the where d is the dimension of the wavelet basis and wi
background is assigned a set of SH coefficients H b = is also an optimizable parameters. Unlike the truncated
{hbuv | 0 ≤ u ≤ U, −u ≤ v ≤ V }, where U and V are Fourier transform used in [15], wavelets are known to
defined by the order of SH basis defining the view- capture higher frequency contents even with a finite
dependent color Yuv b ( θ, ϕ ), with θ and ϕ define the viewing dimension of wavelet basis, resulting in significant perfor-
direction. While for static scenes, the original 3DGS has mance to capture dynamic object details as well as varying
shown to be capable of representing scene efficiently, it appearances. Both (i) and (ii) constitute the genuine
struggles to represent scenes including dynamic entities contribution of the present paper.
D. Optimization we tested all these configurations and found that the
To optimize our scene, we employ a composite loss improvement in reconstruction quality was negligible be-
function L defined as yond 30K iterations, with a gain of only about 0.2 in
PSNR when extending from 30K to 350K iterations. Given
L = Lcolor + Ldepth + Laccum , (9) the minimal improvement and the significant increase
where Lcolor represents the reconstruction loss to ensures in training time, extending to 350K iterations was not
that the predicted image Ipred closely matches the GT im- justifiable. Specifically, training for 30K iterations takes
age Igt . This is achieved through a combination of L1 loss approximately 30 minutes, whereas 350K iterations would
and Structural Similarity Index (SSIM) loss. The L1 loss is require around 5.0 hours.
given by L1 = ∥ Igt − Ipred ∥1 and the SSIM loss LSSIM is C. Results and Evaluation
given by LSSIM = 1 − SSIM( Igt , Ipred ) with LSSIM quantifies
the similarity between two images, taking into account We conduct qualitative and quantitative comparisons
changes in luminance, contrast, and structure. SSIM eval- against other state-of-the-art methods. These methods
uates image quality and is more sensitive to structural in- include, NSG [17], which represents the background as
formation. The total color loss Lcolor is defined in terms of multi-plane images and utilizes per-object learned latent
L1 and LSSIM as Lcolor = (1 − λc )L1 + λc LSSIM where λc codes with a shared decoder to model moving objects.
is a parameter to encourage structural alignment between MARS [14], which builds the scene graph based on Nerfs-
Igt and Ipred [6]. Ldepth is the mono-depth loss, which tudio [31], 3D Gaussians [6], which models the scene with
ensures that the predicted depth maps are consistent with a set of anisotropic Gaussians, and StreetGaussian [15],
the observed depth information. This term helps maintain which represents the scene as composite 3D Gaussians for
the geometric consistency of the scene. The depth loss foreground and background representation. We directly
Ldepth is computed as the L1 loss between the predicted use the metrics reported in their respective papers to com-
depth Dpred and the ground truth depth Dgt maps as pare against our method. Table,I, presents the quantitative
Ldepth = λd ∥ Dgt − Dpred ∥1 and Laccum is the accumula- comparison results of our method with baseline methods.
tion loss, which penalizes the deviation of accumulated As we strictly followed the same procedure and settings
object occupancy probabilities from the desired distri- used in MARS and StreetGaussians (SG) to legitimately
butions. Specifically, it includes an entropy-based loss to borrow their results for comparison. The rendering image
ensure balanced occupancy probabilities for each object resolution is 1242×375. Our approach significantly outper-
as Laccum = − ( β log( β) + (1 − β) log(1 − β)) where β forms previous methods. The training and testing image
represents the object occupancy probability. This compos- sets in the image reconstruction setting are identical,
ite loss function facilitates the simultaneous optimization whereas in novel view synthesis, we render frames that
of appearance, geometry, and occupancy probabilities, are not included in the training data. Specifically, we hold
ensuring a coherent and realistic reconstruction of the out one in every four frames for the 75% split, one in
scene. every two frames for the 50% split, and only every fourth
IV. E XPERIMENTS AND R ESULTS frame is used for training in the 25% split, resulting in
A. Dataset and Baselines 25%, 50%, and 75% of the data being used for training,
We conduct comprehensive evaluation of DENSER for respectively. We adopt PSNR, SSIM, and LPIPS as metrics
reconstructing dynamic scenes on the KITTI dataset [30] to evaluate rendering quality. Our model achieves the best
as one of the standard benchmark for scene recon- performance across all metrics. Our experimental results
structions in urban environments. Data frames in KITT indicate that DENSER performs exceptionally well in re-
are recorded at 10Hz. We follow the same settings and constructing dynamic scenes compared baseline methods.
evaluation methods used in NSG [17], MARS [14] and The results show significant improvements in Peak Signal-
StreetGaussians [15] which constitute the recent methods to-Noise Ratio (PSNR), Structural Similarity Index (SSIM),
we use as our baseline for quantitative and qualitative and Learned Perceptual Image Patch Similarity (LPIPS)
comparisons. metrics, as detailed in Table I. The improvements in
PSNR and SSIM highlight our wavelet-based approach’s
B. Implementation Details effectiveness in maintaining high fidelity and structural
The training setup for our scene reconstruction uti- integrity in complex environments. Furthermore, DENSER
lizes the Adam optimizer across all parameters, with has shown to be capable of reconstructing little details,
30K iterations. The learning rate for the wavelets scale e.g. shadow at the back of truck in Scene 0006 as shown
and translation parameters is set to r = 0.001 with ϵ = in Fig. 3, while other baseline methods are not.
1 × 10−15 . All experiments are conducted on an NVIDIA
Tesla V100-SXM2-16GB GPU. In our comparative analy- D. Ablation on the Dimension of Wavelet Basis
sis, we observed that NSG [17] and MARS [14] trained We conducted an ablation study to analyse the impact
their models for 200K and 350K iterations, respectively, of the size of the wavelet basis, e.g. the number of wavelets
while the Street Gaussian [15] reported training for 30K used to approximate the SH functions. We run our ex-
iterations. To determine the optimal training regimen, periments while incrementing the dimension of wavelets
TABLE I
Quantitative results on KITTI [30] comparing our approach with baseline methods, MARS [14], SG [15], NSG [17], and 3DGS [6]
GT GT
NSG NSG
MARS MARS
SG SG
Ours Ours
and analysing the impact on performance metrics (PSNR↑, E. Scene Editing Applications
SSIM↑ LPIPS↓), we used for evaluation in order obtain the
optimal dimension giving the best performance. As shown DENSER enables photorealistic scene editing, such as
in Fig. 4, the peroformance increases gradually up to 7 swapping, translating, and rotating vehicles, to create
wavelets and starts to degrade gradually after it. diverse and realistic scenarios. This versatility allows au-
tonomous systems to improve their performance and their
ability to handle complex real-world conditions, from
routine traffic to critical situations.
Fig. 4. Ablation: Impact of the dimension of wavelet basis on the
performance of scene reconstruction
Fig. 7. Rotation and Translation: The top row displays GT, illustrating
the original positions and orientations of the vehicles. In the middle and
bottom left images, the vehicles have been rotated. In the middle and
bottom right images, the vehicle has been both rotated and translated
to another lane.
Fig. 5. Object Removal: The top row shows the GT while the bottom
row displays the modified scenes where the bus have been removed.