0% found this document useful (0 votes)
21 views7 pages

DENSER: 3D Gaussians Splatting For Scene Reconstruction of Dynamic Urban Environments

The paper introduces DENSER, a novel framework utilizing 3D Gaussian splatting for the reconstruction of dynamic urban environments, enhancing the representation of dynamic objects and their appearance over time. By integrating wavelets for dynamically estimating Spherical Harmonics (SH) bases, DENSER achieves superior scene decomposition and faster model training compared to existing methods. Extensive evaluations on the KITTI dataset demonstrate its significant performance improvements over state-of-the-art techniques in capturing complex dynamic scenes.

Uploaded by

XQC5939
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views7 pages

DENSER: 3D Gaussians Splatting For Scene Reconstruction of Dynamic Urban Environments

The paper introduces DENSER, a novel framework utilizing 3D Gaussian splatting for the reconstruction of dynamic urban environments, enhancing the representation of dynamic objects and their appearance over time. By integrating wavelets for dynamically estimating Spherical Harmonics (SH) bases, DENSER achieves superior scene decomposition and faster model training compared to existing methods. Extensive evaluations on the KITTI dataset demonstrate its significant performance improvements over state-of-the-art techniques in capturing complex dynamic scenes.

Uploaded by

XQC5939
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

DENSER: 3D Gaussians Splatting for Scene Reconstruction of Dynamic

Urban Environments

Mahmud A. Mohamad, Gamal Elghazaly, Arthur Hubert, and Raphael Frank

Abstract— This paper presents DENSER, an efficient and


effective approach leveraging 3D Gaussian splatting (3DGS)
for the reconstruction of dynamic urban environments. While
several methods for photorealistic scene representations, both (a) (b)
implicitly using neural radiance fields (NeRF) and explic-
arXiv:2409.10041v1 [cs.CV] 16 Sep 2024

itly using 3DGS have shown promising results in scene


reconstruction of relatively complex dynamic scenes, mod-
eling the dynamic appearance of foreground objects tend to
be challenging, limiting the applicability of these methods (c) (d)
to capture subtleties and details of the scenes, especially
far dynamic objects. To this end, we propose DENSER, a Fig. 1. Scene decomposition using DENSER into static background
framework that significantly enhances the representation of and dynamic objects and reconstruction (a) Ground truth (b) scene
dynamic objects and accurately models the appearance of decomposition: static background (c) scene decomposition: dynamic
objects (d) scene reconstruction
dynamic objects in the driving scene. Instead of directly using
Spherical Harmonics (SH) to model the appearance of dynamic To close this gap, new data-driven and photorealistic
objects, we introduce and integrate a new method aiming
techniques based NeRFs [5] and 3DGS [6] have shown
at dynamically estimating SH bases using wavelets, resulting
in better representation of dynamic objects appearance in significant capabilities for 3D scene reconstructions to vi-
both space and time. Besides object appearance, DENSER sually and geometrically realistic fidelity. While NeRFs and
enhances object shape representation through densification 3DGS excel in static and small-scale scene reconstruction,
of its point cloud across multiple scene frames, resulting in reconstructing highly dynamic and complex large urban
faster convergence of model training. Extensive evaluations
scenes remains a significant challenge.
on KITTI dataset show that the proposed approach signifi-
cantly outperforms state-of-the-art methods by a wide margin. Both NeRFs [5] and 3DGS [6] represent two distinct
Source codes and models will be uploaded to this repository approaches that have shown ground-breaking scene rep-
https://github.com/sntubix/denser resentation results enabling photorealistic rendering and
synthesising novel views of the 3D scene. While NeRFs
I. I NTRODUCTION implicitly neural representation of the radiance field and
density of the 3D scene, 3DGS explicitly represent the
Modeling dynamic 3D urban environments from images
scene using a large set anisotropic 3D Gaussians with
has a wide range of important applications, including
associated color and opacity features. This explicit of
building city-sale digital twins and simulation environ-
3DGS representation results in a faster training and ren-
ments that can significantly reduce training and testing
dering compared to NeRFs, thanks to parallel rasterization
costs of autonomous driving systems. These applications
computed in GPUs. Despite the significant potential of
demand efficient and high-fidelity 3D representation of
of birth NeRFs and 3DGSs in static scene representation,
the road environment from captured data and the ability
their performance deteriorates considerably in dynamic
to render high-quality novel views in real time. Simula-
scenes involving moving transient objects or when faced
tion is crucial for developing and refining autonomous
with changing conditions such as weather, exposure, and
driving functions by providing a controlled, safe, and cost-
varying lighting [7], [8]. Numerous works have already
effective testing environment. While traditional simulation
attempted to address this challenge. Early approaches
tools like CARLA [1], LGSVL [2], and DeepDrive [3] have
disregarded dynamic objects and focused solely on recon-
accelerated autonomous driving development they all
structing static components of the scene [9], [7], [8], [10],
share of common limitation, a large sim-to-reality gap [4].
rendered viewers from these approaches typically suffer
This gap is induced by the limitations in asset modelling
from artefacts induced by transient objects. Two different
and rendering that hinder model-based simulation tools
approaches for dynamic scene representation have shown
their ability to fully replicate the complexities of the real
initial but promising results. The first represents the scene
world.
the scenes as a combination of a static and time-varying
Mahmud A. Mohamad, Gamal Elghazaly, Arthur Hubert and radiance fields [11], [12], [13]. In the second approach,
Raphael Frank are with SnT - Interdisciplinary Centre for graph is used to represent the scene and its nodes
Security, Reliability and Trust, University of Luxembourg, 29 represent static background and foreground dynamics
Avenue John F. Kennedy, L-1855 Luxembourg, Luxemburg. Email:
{mahmud.ali, gamal.elghazaly, arthur.hubert, objects, while edges maintain relationships between scene
raphael.frank}@uni.lu static and dynamic entities needed for scene composition
over time [14], [15], [16], [17]. However, most of these represented using a set of coefficients in a basis of SH.
scene graph-based approaches do not or insufficiently The 3D volume Gi occupied by the Gaussian Gi could be
consider the appearance of dynamic objects time. This expressed as
paper proposes DENSER, a scene graph-based framework 1 T Σ −1 ( x − µ )
that significantly enhances the representation of dynamic Gi ( x ) = e− 2 ( x−µ) (1)
objects and accurately models the appearance of dynamic The covariance matrix Σ of Gi could be decomposed
objects in the driving scene (Fig. 1). Instead of directly using the rotation matrix R and the scale vector S as
using Spherical Harmonics (SH) to model the appearance
of dynamic objects, we introduce and integrate a new Σ = RSS T R T (2)
method aiming at dynamically estimating SH bases using
For rendering, these 3D Gaussians are projected to 2D,
wavelets, resulting in a better representation of dynamic
and their covariance matrices are transformed accordingly.
objects appearance in both space and time. Our proposed
This involves computing a new covariance matrix Σ′
methods achieve superior scene decomposition on the
in camera coordinates using the Jacobian of the affine
KITTI dataset.
approximation of the projective transformation J and a
The rest of this paper is organized as follows. Section
viewing transformation W [29]
II provides a review of related work in 3D scene recon-
struction. Section III presents the proposed methodology, Σ′ = JWΣW T J T (3)
Section IV presents experimental results, demonstrating
To compute the color c of a pixel is calculated using an
the effectiveness of our approach on the KITTI dataset.
N -ordered 2D splats using α-blending
Finally, Section V concludes this paper.
N i −1
II. R ELATED W ORK c = ∑ c i α i ∏ (1 − α j ) (4)
i =1 j =1
Dynamic scene representation has seen remarkable
progress, especially in the domain of 4D neural scene While 3DGS performs well in static and object-centric
representations focusing on scenes of single dynamic ob- small scenes, it faces challenges when dealing with scenes
ject, where time is considered as an additional dimension featuring transient objects and varying appearances [?].
besides spatial ones [18], [19], [20], [21], [22], [23], [24]. This paper proposes a framework to model the appear-
Alternative to time modulation, dynamic scenes can be ance of dynamic objects by dynamically estimating the
modelled by coupling a deformation network to map SH coefficients using wavelets, resulting in better repre-
time-varying observations to canonical deformations [25], sentation of dynamic objects appearance in both space
[26], [27]. These approaches are generally limited to small- and time.
scale scenes and slight movements and are considered
B. Scene Graph Representation
inadequate for complex urban environments. Further-
more, these approaches are not designed to decouple As shown in Fig. 2, the proposed framework is built on
dynamic scenes into their static and dynamic primitives, a scene graph representation accommodating both static
e.g. instance-aware decomposition, therefore their appli- background and dynamic objects. In DENSER, the scene is
cability in autonomous driving simulations is limited. decomposed into background node representing the static
Alternatively, explicit decomposition of the dynamic scene entities in the environment such roads and buildings and
facilitates accessibility and editing to manipulate these ob- object nodes, each represent a dynamic object in the scene,
jects for simulation purposes. Scene graph has been used e.g. vehicles. Each of these nodes are represented using a
to model the relations between the entities composing the set of 3D Gaussians as described in Section III-A that are
scene as in Neural Scene Graphs (NSG) [17], MARS [14], optimized separately for each node. While the background
UniSim [28], StreetGaissians [15], and [16]. However, scene node is directly optimized in the world reference frame W ,
graph-based methods handle objects with limited time- the object nodes are optimized in their object reference
varying appearances. This paper uses wavelets to enhance frame Oi that can be transformed into the world reference
scene graph-based methods and how to model accurately frame. All Gaussians corresponding to both background
models the appearance of dynamic objects in the driving node and dynamic objects nodes are concatendated for
scene. rendering in a similar manner as proposed in [15], [6],
III. F RAMEWORK AND M ETHODOLOGY [16].
Let us denote GbW as the set of 3D Gaussians rep-
A. Preliminaries resenting the background node and GiO as the set of
As introduced in [6], 3DGS represents a scene explicitly 3D Gaussians representing dynamic object i in its object
using a finite set of 3D n anisotropic Gaussians G = {Gi }, reference frame Oi . Given the trajectory τi : t → T of
each is defined by a 5-tuple Gi = ⟨µ, S, R, α, c⟩, ∀i = object i, one can extract a pose transformation matrix
1, 2, . . . , n, where µ ∈ R3 represents its centroid, S ∈ R3+ TiW (t) ∈ SE(3) representing the position and orientation
is a scale vector, R ∈ SO(3) its rotation matrix, α ∈ (0, 1) of object i at time t. Assuming the geometry of objects
is opacity, and c ∈ C3 is a view-dependent color, often does not change from one pose to aonther, one can
Scene Raw Data Preprocessing Dynamic Object Model Scene Graph Rendering
Pj
wi
G jO c = ∑i ci αi ∏ij− 1
Ti =1 ( 1 − α j )
IiO Wavelet Basis
3DGS
j
Ii Yuv = ∑id=1 wi ψi SH Basis GW
Background Model
Oij
PW hbuv
Pi GbW
Iib 3DGS Σ′ = JWΣW T J T
SH Basis

Fig. 2. DENSER Scene Composition Framework. The pipeline starts by processing raw sensor data to get a set of densified point cloud for each
foreground object in its reference frame and for the static background. Object point clouds are used to initialize 3D Gaussians of dynamic objects
for which wavelets are used to estimate their color appearance. Background point cloud initializes the 3D Gaussians of the static with appearance
modelled using a traditional SH basis. All 3D Gaussians form a scene graph which can jointly rendered for a novel view.

simply transform GiO to the word frame by applying and varying appearances [7]. Representing the appearance
homogeneous transformation using TiW (t) as follows of transient objects solely using SH coefficients tends to
be insufficient [15]. This arises mainly from the sensitivity
GiW (t) = TiW (t) ⊗ GiO (5) of SH to the changes in the position of the objects in the
The set of all Gaussians to be used for rendering can be scene and the associated changes in shadows and lighting
obtained by concatenating all sets Gaussians of the static induced by these motions. To maintain a consistent visual
background node and transfomed dynamic objects node appearance, DENSER handles this challenge by using (i)
densification of object point clouds across all different
m
frames, which ensures not only a strong prior for initial-
GW = G jW , ∀ j = 0, 1, 2, . . . , m,
M
(6)
ization of the 3D Gaussias, but also mitigates the pose
j =0
calibration errors and noisy measurements inherent in the
with j = 0 represents the background, i.e. GbW = G0W and datasets. Using the sensor pose transformation matrix Tj
the remaining sets of Gaussians are those of dynamic and LIDAR point cloud Pi , one can apply an ROI filter
object nodes. defined using the bounding box of the object O j to get the
point cloud Pij of object j at frame i. Concatenating across
C. Scene Decomposition all frames results in the densified point cloud Pjd used for
This paper improves existing 3DGS composite scene initialization. (ii) We use a time-dependent approximation
reconstruction by enhancing the modeling of appearance of SH bases to capture the varying appearance of dynamic
of transient objects, resulting in a more realistic and objects using an orthonormal basis of wavelets with scale
consistent scene representation. The input to DENSER is and translation parameters are optimizable parameters. In
a sequence n frames. The frame Fi is defined in term DENSER, the Ricker wavelet is used
of a set of m tracked objects, a sensor pose Ti , a LIDAR
τ2
  τ 2   
point cloud Pi and a set camera images Ii and optionally 2
ψ(t) = √ 1− exp − 2 , (7)
a depth map Di , ∀i ∈ {1, 2, . . . , n}. Each object j in the 3aπ 1/4 a 2a
frame i, Oij is often defined by a bounding box, a tracking
identifier, and an object class, ∀ j ∈ {1, 2, . . . , m}. Based on where a is its scale parameter and τ = t − b, with b is its
these inputs, DENSER starts by accumulating point clouds i ( θ, ϕ ) for
translation parameter. The SH basis function Yuv
from over all frames in the world frame W while using object j is approximated using the linear combination of
object bounding boxes to filter the points corresponding child wavelets
to foreground objects. The resulting point cloud PbW is
d
to initialize the 3D Gaussians of the background GbW for
Yuv (t) = ∑ wi ψ(t, ai , bi )
j
(8)
the position µb , opacity αb and covariance Σb and the i =1
corresponding rotation Rb and scale Sb as described in
(2) in a similar to [6]. Besides, each Gaussian of the where d is the dimension of the wavelet basis and wi
background is assigned a set of SH coefficients H b = is also an optimizable parameters. Unlike the truncated
{hbuv | 0 ≤ u ≤ U, −u ≤ v ≤ V }, where U and V are Fourier transform used in [15], wavelets are known to
defined by the order of SH basis defining the view- capture higher frequency contents even with a finite
dependent color Yuv b ( θ, ϕ ), with θ and ϕ define the viewing dimension of wavelet basis, resulting in significant perfor-
direction. While for static scenes, the original 3DGS has mance to capture dynamic object details as well as varying
shown to be capable of representing scene efficiently, it appearances. Both (i) and (ii) constitute the genuine
struggles to represent scenes including dynamic entities contribution of the present paper.
D. Optimization we tested all these configurations and found that the
To optimize our scene, we employ a composite loss improvement in reconstruction quality was negligible be-
function L defined as yond 30K iterations, with a gain of only about 0.2 in
PSNR when extending from 30K to 350K iterations. Given
L = Lcolor + Ldepth + Laccum , (9) the minimal improvement and the significant increase
where Lcolor represents the reconstruction loss to ensures in training time, extending to 350K iterations was not
that the predicted image Ipred closely matches the GT im- justifiable. Specifically, training for 30K iterations takes
age Igt . This is achieved through a combination of L1 loss approximately 30 minutes, whereas 350K iterations would
and Structural Similarity Index (SSIM) loss. The L1 loss is require around 5.0 hours.
given by L1 = ∥ Igt − Ipred ∥1 and the SSIM loss LSSIM is C. Results and Evaluation
given by LSSIM = 1 − SSIM( Igt , Ipred ) with LSSIM quantifies
the similarity between two images, taking into account We conduct qualitative and quantitative comparisons
changes in luminance, contrast, and structure. SSIM eval- against other state-of-the-art methods. These methods
uates image quality and is more sensitive to structural in- include, NSG [17], which represents the background as
formation. The total color loss Lcolor is defined in terms of multi-plane images and utilizes per-object learned latent
L1 and LSSIM as Lcolor = (1 − λc )L1 + λc LSSIM where λc codes with a shared decoder to model moving objects.
is a parameter to encourage structural alignment between MARS [14], which builds the scene graph based on Nerfs-
Igt and Ipred [6]. Ldepth is the mono-depth loss, which tudio [31], 3D Gaussians [6], which models the scene with
ensures that the predicted depth maps are consistent with a set of anisotropic Gaussians, and StreetGaussian [15],
the observed depth information. This term helps maintain which represents the scene as composite 3D Gaussians for
the geometric consistency of the scene. The depth loss foreground and background representation. We directly
Ldepth is computed as the L1 loss between the predicted use the metrics reported in their respective papers to com-
depth Dpred and the ground truth depth Dgt maps as pare against our method. Table,I, presents the quantitative
Ldepth = λd ∥ Dgt − Dpred ∥1 and Laccum is the accumula- comparison results of our method with baseline methods.
tion loss, which penalizes the deviation of accumulated As we strictly followed the same procedure and settings
object occupancy probabilities from the desired distri- used in MARS and StreetGaussians (SG) to legitimately
butions. Specifically, it includes an entropy-based loss to borrow their results for comparison. The rendering image
ensure balanced occupancy probabilities for each object resolution is 1242×375. Our approach significantly outper-
as Laccum = − ( β log( β) + (1 − β) log(1 − β)) where β forms previous methods. The training and testing image
represents the object occupancy probability. This compos- sets in the image reconstruction setting are identical,
ite loss function facilitates the simultaneous optimization whereas in novel view synthesis, we render frames that
of appearance, geometry, and occupancy probabilities, are not included in the training data. Specifically, we hold
ensuring a coherent and realistic reconstruction of the out one in every four frames for the 75% split, one in
scene. every two frames for the 50% split, and only every fourth
IV. E XPERIMENTS AND R ESULTS frame is used for training in the 25% split, resulting in
A. Dataset and Baselines 25%, 50%, and 75% of the data being used for training,
We conduct comprehensive evaluation of DENSER for respectively. We adopt PSNR, SSIM, and LPIPS as metrics
reconstructing dynamic scenes on the KITTI dataset [30] to evaluate rendering quality. Our model achieves the best
as one of the standard benchmark for scene recon- performance across all metrics. Our experimental results
structions in urban environments. Data frames in KITT indicate that DENSER performs exceptionally well in re-
are recorded at 10Hz. We follow the same settings and constructing dynamic scenes compared baseline methods.
evaluation methods used in NSG [17], MARS [14] and The results show significant improvements in Peak Signal-
StreetGaussians [15] which constitute the recent methods to-Noise Ratio (PSNR), Structural Similarity Index (SSIM),
we use as our baseline for quantitative and qualitative and Learned Perceptual Image Patch Similarity (LPIPS)
comparisons. metrics, as detailed in Table I. The improvements in
PSNR and SSIM highlight our wavelet-based approach’s
B. Implementation Details effectiveness in maintaining high fidelity and structural
The training setup for our scene reconstruction uti- integrity in complex environments. Furthermore, DENSER
lizes the Adam optimizer across all parameters, with has shown to be capable of reconstructing little details,
30K iterations. The learning rate for the wavelets scale e.g. shadow at the back of truck in Scene 0006 as shown
and translation parameters is set to r = 0.001 with ϵ = in Fig. 3, while other baseline methods are not.
1 × 10−15 . All experiments are conducted on an NVIDIA
Tesla V100-SXM2-16GB GPU. In our comparative analy- D. Ablation on the Dimension of Wavelet Basis
sis, we observed that NSG [17] and MARS [14] trained We conducted an ablation study to analyse the impact
their models for 200K and 350K iterations, respectively, of the size of the wavelet basis, e.g. the number of wavelets
while the Street Gaussian [15] reported training for 30K used to approximate the SH functions. We run our ex-
iterations. To determine the optimal training regimen, periments while incrementing the dimension of wavelets
TABLE I
Quantitative results on KITTI [30] comparing our approach with baseline methods, MARS [14], SG [15], NSG [17], and 3DGS [6]

KITTI - 75% KITTI - 50% KITTI - 25%


PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓
3DGS [6] 19.19 0.737 0.172 19.23 0.739 0.174 19.06 0.730 0.180
NSG [17] 21.53 0.673 0.254 21.26 0.659 0.266 20.00 0.632 0.281
MARS [14] 24.23 0.845 0.160 24.00 0.801 0.164 23.23 0.756 0.177
SG [15] 25.79 0.844 0.081 25.52 0.841 0.084 24.53 0.824 0.090
Ours 31.73 0.949 0.025 31.19 0.945 0.027 30.408 0.935 0.031

GT GT

NSG NSG

MARS MARS

SG SG

Ours Ours

(a) KITTI Scene 0006 (b) KITTI Scene 0002


Fig. 3. Qualitative image reconstruction comparison on KITTI dataset [30].

and analysing the impact on performance metrics (PSNR↑, E. Scene Editing Applications
SSIM↑ LPIPS↓), we used for evaluation in order obtain the
optimal dimension giving the best performance. As shown DENSER enables photorealistic scene editing, such as
in Fig. 4, the peroformance increases gradually up to 7 swapping, translating, and rotating vehicles, to create
wavelets and starts to degrade gradually after it. diverse and realistic scenarios. This versatility allows au-
tonomous systems to improve their performance and their
ability to handle complex real-world conditions, from
routine traffic to critical situations.
Fig. 4. Ablation: Impact of the dimension of wavelet basis on the
performance of scene reconstruction

Fig. 7. Rotation and Translation: The top row displays GT, illustrating
the original positions and orientations of the vehicles. In the middle and
bottom left images, the vehicles have been rotated. In the middle and
bottom right images, the vehicle has been both rotated and translated
to another lane.
Fig. 5. Object Removal: The top row shows the GT while the bottom
row displays the modified scenes where the bus have been removed.

1) Object Removal: To remove an object, we simply


construct a deletion mask that effectively filters out the
Gaussian parameters associated with the objects to be re-
moved. The deletion mask is then applied to the Gaussian
parameters of the trained model, removing the attributes
associated with the unwanted objects as shown in Fig. 5.
2) Object swapping: Swapping vehicles within our rep- Fig. 8. Trajectory Alteration: The left column displays the GT trajectory
resentational framework is a straightforward process that and right column shows the vehicle follows a new modified path.
involves a simple exchange of unique track ids associated
with the two target vehicles. This manipulation results
in a dynamic alteration of the scene, wherein a vehicle while for rotation, we can transform change rotation angle
assumes the spatial attributes, specifically location and about the normal to the plan of motion and calculate
orientation, of the vehicle with which it has been swapped back the corresponding new rotation matrix to be used to
as depicted in Fig. 6. replace the object as depicted in Fig. 7.
4) Trajectory Alteration: A trajectory is defined as a
sequence of poses. Editing the scene to allow an object
follow a trajectory, one can generalize the change in rota-
tion and translation not only between two configurations
as previously illustrated but to apply this change over time
to obtain a smooth change in translation and rotation as
a function of time as illustrated in Fig. 8.
V. C ONCLUSION
In this paper, we presented DENSER, a novel and
efficient framework leveraging 3DGS for reconstruction of
dynamic urban environments. By addressing the limita-
tions of existing methods in modeling the appearance of
dynamic objects, particularly in complex driving scenes,
Fig. 6. Object Swapping: The top shows the GT. In the bottom, the two DENSER demonstrates significant improvements. Our ap-
vehicles within the red box of the top image have been replaced with proach introduces the dynamic estimation of Spherical
different ones in the bottom, and some vehicles have been removed for
better visualization Harmonics (SH) bases using wavelets, which enhances
the representation of dynamic objects in both space and
3) Object Rotation and Translation: Translation and time. Furthermore, the densification of point clouds across
orientation modifications are implemented to adjust an multiple frames contributes to faster convergence during
object’s position and heading dynamically within a 3D model training. Extensive evaluations on the KITTI shows
environment. Given an object position rotation matrix at that DENSER outperforms state-of-the-art techniques by
a specific timestep i, we can modify the translation and a substantial margin, showcasing its effectiveness in dy-
rotation to achieve desired motion maneuver. For the sake namic scene reconstruction. Future work will focus on
of illustration in this paper, one can shift the translation extending this approach to deformable dynamic objects
component in the plan of motion to achieve translation, such as pedestrians and cyclists.
R EFERENCES [25] A. Pumarola, E. Corona, G. Pons-Moll, and F. Moreno-Noguer,
“D-nerf: Neural radiance fields for dynamic scenes,” in CVPR,
[1] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “Carla: pp. 10318–10327, 2021.
An open urban driving simulator,” in Conference on robot learning, [26] Z. Yang, X. Gao, W. Zhou, S. Jiao, Y. Zhang, and X. Jin, “Deformable
pp. 1–16, PMLR, 2017. 3d gaussians for high-fidelity monocular dynamic scene reconstruc-
[2] LGSVL, “Lgsvl simulator.” https://www.svlsimulator.com. tion,” arXiv preprint arXiv:2309.13101, 2023.
Accessed: 2024-08-03. [27] Z. Yang, X. Gao, W. Zhou, S. Jiao, Y. Zhang, and X. Jin, “Deformable
[3] DeepDrive, “Deepdrive.” https://github.com/deepdrive/ 3d gaussians for high-fidelity monocular dynamic scene reconstruc-
deepdrive. Accessed: 2024-08-03. tion,” arXiv preprint arXiv:2309.13101, 2023.
[4] F. Mütsch, H. Gremmelmaier, N. Becker, D. Bogdoll, M. R. Zofka, [28] Z. Yang, Y. Chen, J. Wang, S. Manivasagam, W.-C. Ma, A. J. Yang,
and J. M. Zöllner, “From model-based to data-driven simulation: and R. Urtasun, “Unisim: A neural closed-loop sensor simulator,”
Challenges and trends in autonomous driving,” arXiv preprint in CVPR, 2023.
arXiv:2305.13960, 2023. [29] M. Zwicker, H. Pfister, J. Van Baar, and M. Gross, “Ewa volume
[5] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoor- splatting,” in Proceedings Visualization, 2001. VIS’01., pp. 29–538,
thi, and R. Ng, “Nerf: Representing scenes as neural radiance fields IEEE, 2001.
for view synthesis,” in ECCV, 2020. [30] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous
[6] B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis, “3d gaussian driving? the kitti vision benchmark suite,” in CVPR, 2012.
splatting for real-time radiance field rendering,” TOG, vol. 42, July [31] M. Tancik, E. Weber, E. Ng, R. Li, B. Yi, J. Kerr, T. Wang, A. Kristof-
2023. fersen, J. Austin, K. Salahi, A. Ahuja, D. McAllister, and A. Kanazawa,
[7] H. Dahmani, M. Bennehar, N. Piasco, L. Roldao, and D. Tsishkou, “Nerfstudio: A modular framework for neural radiance field de-
“Swag: Splatting in the wild images with appearance-conditioned velopment,” in ACM SIGGRAPH 2023 Conference Proceedings, SIG-
gaussians,” arXiv preprint arXiv:2403.10427, 2024. GRAPH ’23, 2023.
[8] R. Martin-Brualla, N. Radwan, M. S. Sajjadi, J. T. Barron, A. Dosovit-
skiy, and D. Duckworth, “Nerf in the wild: Neural radiance fields for
unconstrained photo collections,” in CVPR, pp. 7210–7219, 2021.
[9] J. Guo, N. Deng, X. Li, Y. Bai, B. Shi, C. Wang, C. Ding, D. Wang,
and Y. Li, “Streetsurf: Extending multi-view implicit surface recon-
struction to street views,” arXiv preprint arXiv:2306.04988, 2023.
[10] M. Tancik, V. Casser, X. Yan, S. Pradhan, B. Mildenhall, P. P.
Srinivasan, J. T. Barron, and H. Kretzschmar, “Block-nerf: Scalable
large scene neural view synthesis,” in CVPR, 2022.
[11] H. Turki, J. Y. Zhang, F. Ferroni, and D. Ramanan, “Suds: Scalable
urban dynamic scenes,” in CVPR, 2023.
[12] J. Yang, B. Ivanovic, O. Litany, X. Weng, S. W. Kim, B. Li, T. Che,
D. Xu, S. Fidler, M. Pavone, and Y. Wang, “Emernerf: Emergent
spatial-temporal scene decomposition via self-supervision,” arXiv
preprint arXiv:2311.02077, 2023.
[13] T.-A.-Q. Nguyen, L. Roldão, N. Piasco, M. Bennehar, and
D. Tsishkou, “Rodus: Robust decomposition of static and dynamic
elements in urban scenes,” arXiv preprint arXiv:2403.09419, 2024.
[14] Z. Wu, T. Liu, L. Luo, Z. Zhong, J. Chen, H. Xiao, C. Hou, H. Lou,
Y. Chen, R. Yang, Y. Huang, X. Ye, Z. Yan, Y. Shi, Y. Liao, and
H. Zhao, “Mars: An instance-aware, modular and realistic simulator
for autonomous driving,” CICAI, 2023.
[15] Y. Yan, H. Lin, C. Zhou, W. Wang, H. Sun, K. Zhan, X. Lang, X. Zhou,
and S. Peng, “Street gaussians for modeling dynamic urban scenes,”
arXiv preprint arXiv:2401.01339, 2024.
[16] X. Zhou, Z. Lin, X. Shan, Y. Wang, D. Sun, and M.-H. Yang,
“Drivinggaussian: Composite gaussian splatting for surrounding
dynamic autonomous driving scenes,” in CVPR, pp. 21634–21643,
2024.
[17] J. Ost, F. Mannan, N. Thuerey, J. Knodt, and F. Heide, “Neural scene
graphs for dynamic scenes,” in CVPR, pp. 2856–2865, 2021.
[18] B. Attal, J.-B. Huang, C. Richardt, M. Zollhoefer, J. Kopf, M. O’Toole,
and C. Kim, “HyperReel: High-fidelity 6-DoF video with ray-
conditioned sampling,” in CVPR, 2023.
[19] S. Fridovich-Keil, G. Meanti, F. R. Warburg, B. Recht, and
A. Kanazawa, “K-planes: Explicit radiance fields in space, time, and
appearance,” in CVPR, 2023.
[20] Z. Li, S. Niklaus, N. Snavely, and O. Wang, “Neural scene flow fields
for space-time view synthesis of dynamic scenes,” in CVPR, 2021.
[21] H. Lin, S. Peng, Z. Xu, T. Xie, X. He, H. Bao, and X. Zhou, “High-
fidelity and real-time novel view synthesis for dynamic scenes,” in
SIGGRAPH Asia 2023 Conference Proceedings, pp. 1–9, 2023.
[22] K. Park, U. Sinha, P. Hedman, J. T. Barron, S. Bouaziz, D. B.
Goldman, R. Martin-Brualla, and S. M. Seitz, “Hypernerf: A higher-
dimensional representation for topologically varying neural radi-
ance fields,” arXiv preprint arXiv:2106.13228, 2021.
[23] S. Peng, Y. Yan, Q. Shuai, H. Bao, and X. Zhou, “Representing
volumetric videos as dynamic mlp maps,” in CVPR, pp. 4252–4262,
2023.
[24] L. Song, A. Chen, Z. Li, Z. Chen, L. Chen, J. Yuan, Y. Xu, and
A. Geiger, “Nerfplayer: A streamable dynamic scene representation
with decomposed neural radiance fields,” TVCG, vol. 29, no. 5,
pp. 2732–2742, 2023.

You might also like