In this section, we describe the complete SLAM framework in detail, and the system overview is shown in
Figure 2. In
Section 3.1, we first introduce 3D Gaussian scene representation and the differentiable splatting rasterized rendering method.
Section 3.2 presents the G-ICP based tracking thread and keyframe selection strategy.
Section 3.3 presents the 3D Gaussian adaptive expansion strategy, the pruning strategy, and the loss function construction.
3.1. 3D Gaussian Scene Representation
We represent the underlying map of the environment as a set of 3D Gaussians, and its overall overview is shown in
Figure 3. Each 3D Gaussian initialized by the point cloud contains center
, spatial covariance
, opacity
, and color
c.
c is represented by spherical harmonics for view-dependent radiance. All properties are learnable and optimized by means of back-propagation.
where
,
is a 3D scale vector, and
is the rotation matrix, which participates in computation as a 4D quaternion. The influence of a single 3D Gaussian function on a physical point
in 3D space is calculated using the following equation:
In order to obtain the rendered RGB-D image, we calculate it with the input image and optimize the Gaussian parameters. We use the following equation to render the 3D Gaussian projection onto the 2D image plane:
where
K is the camera intrinsic matrix,
E is the extrinsic matrix capturing the rotation and translation of the camera,
J is the Jacobian of the affine approximation of the projective transformation, and
is the center depth of the 3D Gaussian obtained by projecting onto the z-axis in the camera coordinate system.
The 3D Gaussians projected to the image plane are sorted in depth order, and front-to-back blending rendering is performed to synthesize a pixel color
:
Similarly, the depth
is rendered by the following equation:
3.2. Tracking
In the tracking threads, we use G-ICP to estimate the camera pose. G-ICP introduces probabilistic information on top of ICP. While maintaining the speed and simplicity of ICP, G-ICP allows for adding outlier terms, measurement noise, and other probabilistic techniques, significantly improving the algorithm’s robustness.
For the first frame of the RGB-D camera input, the tracking step is skipped, and the camera pose is set to the identity. The obtained point cloud data are also added to G-ICP as target point clouds, and the pose is estimated from the second frame of the input image. G-ICP assumes the existence of point clouds sets
and
, which generate A and B according to the Gaussian probability models
and
.
and
are associated,
and
are the actual measurements of the point locations, and
and
are covariance matrices associated with the measurement points. The covariance of a 3D point is obtained by computing the covariance matrix of the k-nearest neighbors of the 3D point. G-ICP aims to find a transformation that maximally aligns the current frame point cloud
with the map point cloud
. For the corresponding points
and
, there exists a truth value
such that
. The error term of an arbitrary rigid transformation
is:
Since
and
are independent of each other and both satisfy Gaussian distribution,
also satisfies Gaussian distribution:
can be viewed as a parameter to be estimated in the
probability distribution. We use maximum likelihood estimation (MLE) to iteratively calculate
:
The above equation can be simplified as:
is the camera pose estimated by the system.
Since 3D Gaussian needs good observation coverage and considering the densification strategy, we divide the keyframes into tracking and mapping keyframes, as shown in
Figure 4. We use G-ICP to calculate the match ratio of the current frame point cloud to the map point cloud. The frame is selected as the tracking keyframe if the matching ratio is below the threshold. If the relative rotation of the current frame to the latest keyframe is above the threshold, the frame is selected as the mapping keyframe. The relative rotation here is expressed using Euler angles, and to ensure the stability of the results, we add the Euler angles on the three axes and compare them to a single threshold. In the tracking keyframes, only 3D Gaussians that do not overlap with the existing map are added to the optimization. The opposite is true in the mapping keyframe, where only 3D Gaussians that overlap with the existing map are added to the optimization.
3.3. Mapping
RD-SLAM aims to create a coherent, high-fidelity 3D map. The 3D Gaussian scene representation is updated and optimized at each selected keyframe for stable mapping. We first use an adaptive densification strategy in the mapping thread to add new 3D Gaussians to the entire scene representation for differentiable rendering to obtain color and depth images. The updated 3D Gaussian scene representation is then optimized by minimizing the input and rendered images’ color, depth, and regularization loss. Finally, anomalous 3D Gaussian is removed using the 3D Gaussian pruning strategy.
The Gaussian densification process is closely related to the keyframe selection strategy. We use all pixels of the first frame to initialize new 3D Gaussians. For each pixel, we add a new 3D Gaussian function with the color of that pixel, centered on the spatial position of that pixel, and an opacity of 0.1. When the tracking thread uses G-ICP for position estimation, the system can simultaneously obtain the matching relationship between the current frame’s spatial points and the map’s spatial points. If the matched spatial points are below a certain threshold, the current frame is added as a tracking keyframe, and the unmatched spatial points are set to new Gaussians. When the current frame is detected as a mapping keyframe using the dual keyframe selection strategy, the spatial points on which the current frame matches the map are set as new 3D Gaussians.
After 3D Gaussians splatting to the image plane, the color loss between the input color image and the rendered image is calculated. We calculated the color loss using a weighted combination of
and structural similarity index measure (SSIM) [
33] losses:
where
is the original color image, and
is the rendered color image. Deep optimization is achieved using
loss:
where
is the original depth image, and
is the rendered depth image.
To prevent scale explosion of the 3D Gaussian and to highly elongate in the camera viewing direction, we add the regularization loss
of the scale. It can effectively compensate for the 3D Gaussian scale distortion caused by the insufficient observation viewpoint during SLAM. We use
loss for scale optimization:
where
is the average of the scale, and
is the scale before 3D Gaussian optimization. Finally, the depth, color, and regularization loss are optimized together:
where
,
, and
are predefined hyperparameters that are weighted values of depth, color, and scale regularization loss.
For each keyframe, we perform a fixed number of iterations. First, we iteratively compute using the current keyframe to fully optimize the 3D Gaussian for the current viewpoint. Then, the first n frames in the list of keyframes that overlap with the current keyframe so far are selected, and one of them is randomly chosen in each iteration to ensure that the whole map is fully optimized while avoiding falling into local minima. After a certain number of map iterations, some anomalous Gaussians appear due to the instability of the adaptive control of the Gaussian. We follow the pruning strategy in [
18] while removing 3D Gaussians whose locations are not near the scene surface.