11institutetext: National University of Singapore, Singapore, Singapore 22institutetext: Tsinghua University, Bejing, China 33institutetext: The Chinese University of Hong Kong, Hong Kong, China
33email: ymjin@nus.edu.sg

Deform3DGS: Flexible Deformation for
Fast Surgical Scene Reconstruction
with Gaussian Splatting

Shuojue Yang 11    Qian Li 11    Daiyun Shen 22    Bingchen Gong 33    Qi Dou 33    Yueming Jin(✉) 11
Abstract

Tissue deformation poses a key challenge for accurate surgical scene reconstruction. Despite yielding high reconstruction quality, existing methods suffer from slow rendering speeds and long training times, limiting their intraoperative applicability. D. Shen did this work during his internship at National University of Singapore. Motivated by recent progress in 3D Gaussian Splatting, an emerging technology in real-time 3D rendering, this work presents a novel fast reconstruction framework, termed Deform3DGS, for deformable tissues during endoscopic surgery. Specifically, we introduce 3D GS into surgical scenes by integrating a point cloud initialization to improve reconstruction. Furthermore, we propose a novel flexible deformation modeling scheme (FDM) to learn tissue deformation dynamics at the level of individual Gaussians. Our FDM can model the surface deformation with efficient representations, allowing for real-time rendering performance. More importantly, FDM significantly accelerates surgical scene reconstruction, demonstrating considerable clinical values, particularly in intraoperative settings where time efficiency is crucial. Experiments on DaVinci robotic surgery videos indicate the efficacy of our approach, showcasing superior reconstruction fidelity PSNR: (37.90) and rendering speed (338.8 FPS) while substantially reducing training time to only 1 minute/scene. Our code is available at https://github.com/jinlab-imvr/Deform3DGS.

Keywords:
Fast 3D reconstructon Surgical scene reconstruction 3D Gaussian Splatting Deformable scene.

1 Introduction

Three-dimensional (3D) reconstruction of surgical scenes has great potential to facilitate many downstream applications including intraoperative navigation [12], and visualization enhancement [13, 9]. Besides, a high-quality and dynamic 3D scene model has demonstrated potential benefits to surgical training through shortening learning curves [1] and remote surgical proctoring [17] by allowing immersive observation of surgical scenes.

Conventional reconstruction methods suffer from a redundant workflow including depth estimation, surface reconstruction, and texture mapping [14]. To enhance the compactness and efficiency, many studies introduce Neural Radiance Field (NeRF) [11], an implicit 3D representation using tiny Multi-Layer Perceptrons (MLPs), to model geometric details and appearance from the captured video. This implicit representation can directly render the photo-realistic novel views, simplifying the bulky conventional workflow. Works [15, 20] successfully adapt NeRF into endoscopic scene reconstruction, leading to promising performance in rendering quality and geometry fidelity. However, this line of work suffers from a long training time (hours) and low rendering speed, which significantly impedes their intraoperative applicability. LerPlane [18] encodes the spatial and temporal information using decomposed 4D feature planes to accelerate the reconstruction. This approach inherently models the deformation and directly outputs rendering parameters for deformed tissues, which heavily rely on complex computations performed on feature planes and MLPs, leading to a compromised acceleration.

With the current progress in computer graphics, Gaussian Splatting (GS) [5] emerged as a ground-breaking 3D representation. Driven by its superior performance, many efforts [19, 7, 6] have been made to adapt this technique to dynamic scene reconstruction through a deformation model to represent motions. 4DGS [16] is one of the pilot studies presenting one efficient solution where decomposed feature planes similar to  [2] are used to model time-dependent deformations. Despite the compact feature encoding, 4DGS retains time-consuming feature plane interpolation and decoding with MLPs, which limits the acceleration and intraoperative applicability. Also, 4DGS inadequately utilizes geometric priors, leading to a highly sparse point initialization and a longer time for geometry reconstruction. Therefore, although exhibiting large values for fast reconstruction on dynamic scenes, integrating GS into the surgical scene reconstruction framework is still challenging, which inspires our work toward effectively adapting GS to intraoperative scenes.

In this paper, we develop a highly efficient framework for deformable surgical scene reconstruction, named Deform3DGS, by introducing the GS technique into surgical scenario with a motion-aware point fusion (MAPF) to initialize the Gaussian point cloud densely. Besides, we propose a novel flexible deformation modeling scheme (FDM) for efficiently representing tissue deformations. FDM models the tissue deformations via an efficient linear combination regression, where learnable basis functions are leveraged to improve both the representation capability and efficiency. Finally, we evaluate our method on EndoNeRF [15] and StereoMIS [3] datasets collected from Da Vinci robotic surgery videos. Experiments indicate the efficacy of our approach, demonstrating superior reconstruction quality (PSNR: 37.90) and rendering speed (338.8 FPS) while substantially reducing training time to around only 1 minute per scene.

2 Method

Pipeline. As shown in Fig. 1, during the training phase, our framework initializes the endoscopic scene with a Gaussian point cloud 𝒢𝒢\mathcal{G}caligraphic_G using the MAPF scheme (Sec. 2.3). The following is the FDM that models deformed tissues by learning the time-dependent changes for each Gaussian point 𝔤𝔤\mathfrak{g}fraktur_g (Sec. 2.2). Next, the rendering (Sec. 2.1) is performed to obtain the colored image and depth of the deformed Gaussian point cloud given the camera viewpoint. Finally, the rendered image and depth are supervised by corresponding ground-truth (GT) image and stereo depth to optimize the framework (Sec.2.5). During rendering (testing), given a query time and camera viewpoint, FDM deforms the learned Gaussian point cloud 𝒢𝒢\mathcal{G}caligraphic_G and the view of the deformed Gaussians is rendered.

Refer to caption
Figure 1: Illustration of our fast deformable tissue reconstruction framework, Deform3DGS, composed of (a) Point cloud initialization, (b) Flexible Deformation Modeling, and (c) 3D Gaussian Splatting.

Problem Setting. Our task is to train a deformable tissue reconstruction model 𝚿𝚿\boldsymbol{\Psi}bold_Ψ from an endoscopic surgery video. Given the camera intrinsic matrix 𝑲𝑲\boldsymbol{K}bold_italic_K and extrinsic matrix 𝑻isubscript𝑻𝑖\boldsymbol{T}_{i}bold_italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT recording the camera viewpoint information at the i𝑖iitalic_i-th frame, a desired reconstruction model 𝚿superscript𝚿\boldsymbol{\Psi}^{\ast}bold_Ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is supposed to render the i𝑖iitalic_i-th view 𝑰isubscript𝑰𝑖\boldsymbol{I}_{i}bold_italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT at timestamp tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as 𝑰i=𝚿(𝑻i,ti;𝑲)subscript𝑰𝑖superscript𝚿subscript𝑻𝑖subscript𝑡𝑖𝑲\boldsymbol{I}_{i}=\boldsymbol{\Psi}^{\ast}(\boldsymbol{T}_{i},t_{i};% \boldsymbol{K})bold_italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_Ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; bold_italic_K ).

2.1 Preliminaries of Gaussian Splatting

3D Gaussian Splatting [5] is a static 3D scene representation that models scenes with the form of a 3D Gaussian point cloud in the world coordinate frame. Each Gaussian point contains learnable attributes: position 𝝁3𝝁superscript3\boldsymbol{\mu}\in\mathbb{R}^{3}bold_italic_μ ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, rotation 𝒓4𝒓superscript4\boldsymbol{r}\in\mathbb{R}^{4}bold_italic_r ∈ blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT, scale 𝒔3𝒔superscript3\boldsymbol{s}\in\mathbb{R}^{3}bold_italic_s ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, opacity 𝜶𝜶\boldsymbol{\alpha}bold_italic_α and spherical harmonic (sh𝑠shitalic_s italic_h) coefficients. Given an arbitrary 3D coordinate 𝒙3𝒙superscript3\boldsymbol{x}\in\mathbb{R}^{3}bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT in world frame, the spatial impact of a 3D Gaussian point on 𝒙𝒙\boldsymbol{x}bold_italic_x is defined by a Gaussian distribution as following:

f(𝒙;𝝁,𝚺)=exp(𝟏𝟐(𝒙𝝁)𝑻𝚺𝟏(𝒙𝝁))𝑓𝒙𝝁𝚺12superscript𝒙𝝁𝑻superscript𝚺1𝒙𝝁f\mathit{(}\boldsymbol{x}\mathit{;}\mathit{\;}\boldsymbol{\mu}\mathit{,}% \boldsymbol{\mathit{\;}}\boldsymbol{\Sigma}\boldsymbol{\mathit{\;}}\mathit{)}% \mathit{=}\exp\left({\boldsymbol{\mathit{-}}\frac{\boldsymbol{1}}{\boldsymbol{% 2}}{\boldsymbol{\mathit{(}}\boldsymbol{x}\boldsymbol{\mathit{\;}}\boldsymbol{% \mathit{-}}\boldsymbol{\mathit{\;}}\boldsymbol{\mu}\boldsymbol{\mathit{)}}}^{% \boldsymbol{T}}\boldsymbol{\Sigma}^{\boldsymbol{\mathit{-}}\boldsymbol{1}}% \boldsymbol{\mathit{(}}\boldsymbol{x}\boldsymbol{\mathit{\;}}\boldsymbol{% \mathit{-}}\boldsymbol{\mathit{\;}}\boldsymbol{\mu}\boldsymbol{\mathit{)}}}\right)italic_f ( bold_italic_x ; bold_italic_μ , bold_Σ ) = roman_exp ( bold_- divide start_ARG bold_1 end_ARG start_ARG bold_2 end_ARG bold_( bold_italic_x bold_- bold_italic_μ bold_) start_POSTSUPERSCRIPT bold_italic_T end_POSTSUPERSCRIPT bold_Σ start_POSTSUPERSCRIPT bold_- bold_1 end_POSTSUPERSCRIPT bold_( bold_italic_x bold_- bold_italic_μ bold_) ) (1)
𝚺=𝑹𝑺𝑺𝐓𝑹𝐓,𝚺𝑹𝑺superscript𝑺𝐓superscript𝑹𝐓\boldsymbol{\Sigma}=\boldsymbol{R}\boldsymbol{S}\boldsymbol{S}^{\mathbf{T}}% \boldsymbol{R}^{\mathbf{T}},bold_Σ = bold_italic_R bold_italic_S bold_italic_S start_POSTSUPERSCRIPT bold_T end_POSTSUPERSCRIPT bold_italic_R start_POSTSUPERSCRIPT bold_T end_POSTSUPERSCRIPT , (2)

where 𝑹𝑹\boldsymbol{R}bold_italic_R is the rotation matrix calculated from 𝒓𝒓\boldsymbol{r}bold_italic_r, and 𝑺𝑺\boldsymbol{S}bold_italic_S is the diagonal matrix of 𝒔𝒔\boldsymbol{s}bold_italic_s. Next, the 3D Gaussian point is projected to the 2D image plane for rendering. The projected 2D Gaussian with the position 𝝁𝟐𝐃superscript𝝁2𝐃\boldsymbol{\mu}^{\mathbf{2}\mathbf{D}}bold_italic_μ start_POSTSUPERSCRIPT bold_2 bold_D end_POSTSUPERSCRIPT and covariance matrix 𝚺𝟐𝐃superscript𝚺2𝐃\boldsymbol{\Sigma}^{\mathbf{2}\mathbf{D}}bold_Σ start_POSTSUPERSCRIPT bold_2 bold_D end_POSTSUPERSCRIPT can be analytically computed in pixel coordinate frame given the camera intrinsic and extrinsic parameters (𝑲𝑲\boldsymbol{K}bold_italic_K and 𝑻𝑻\boldsymbol{T}bold_italic_T). Finally, an α𝛼\alphaitalic_α-blending [5] is performed to render the colored images 𝑪^bold-^𝑪\boldsymbol{\widehat{C}}overbold_^ start_ARG bold_italic_C end_ARG and corresponding depth 𝑫^bold-^𝑫\boldsymbol{\widehat{D}}overbold_^ start_ARG bold_italic_D end_ARG.

2.2 Flexible Deformation Modeling

MLP-based implicit deformation fields represented by Hexplane [2] incur significant computational overhead to training and cannot meet the stringent real-time processing demands of surgical video analytics. To mitigate it, [8] proposed a computationally efficient explicit representation of the deformation field, in which Fourier and polynomial basis functions b(t)𝑏𝑡{b}(t)italic_b ( italic_t ) parameterized by time are employed to learn the per-Gaussian motion curve. Each Gaussian is endowed with a set of learnable weights 𝝎𝝎\boldsymbol{\omega}bold_italic_ω that linearly combine the basis functions to generate the motion curve as ψ(t;𝝎)=j=1Bωjbj(t)𝜓𝑡𝝎superscriptsubscript𝑗1𝐵subscript𝜔𝑗subscript𝑏𝑗𝑡\textstyle\psi(t;\boldsymbol{\omega})=\sum_{j=1}^{B}\omega_{j}{b}_{j}(t)italic_ψ ( italic_t ; bold_italic_ω ) = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ). Despite the adaptability of each Gaussian to deformation through weight adjustments, the temporal deformation is confined to canonical motion, leading to inconsistent deformation representations for different queried times. The model may be compelled to forgo certain specific movements to ensure a coherent overall trajectory, which is detrimental in scenes with intricate and nuanced deformations, such as instrument-tissue interactions. Furthermore, the deformation learned at a particular timestamp globally influences the entire trajectory, resulting in a sub-optimal local deformation representation at different queried times.

To overcome these limitations, we introduce a novel flexible deformation modeling scheme (FDM) which offers flexibility and adaptability to basis functions using learnable parameters. We adopt the Gaussian function with learnable center θ𝜃\thetaitalic_θ and variance σ𝜎\sigmaitalic_σ.

b~(t;θ,σ)~𝑏𝑡𝜃𝜎\displaystyle\tilde{b}(t;\theta,\sigma)over~ start_ARG italic_b end_ARG ( italic_t ; italic_θ , italic_σ ) =exp(12σ2(tθ)2),absent12superscript𝜎2superscript𝑡𝜃2\displaystyle=\exp{\left(-\frac{1}{2\sigma^{2}}(t-\theta)^{2}\right)},= roman_exp ( - divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( italic_t - italic_θ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , (3)

For each Gaussian 𝔤𝔤\mathfrak{g}fraktur_g in the point cloud, position 𝝁𝝁\boldsymbol{\mu}bold_italic_μ and rotation 𝒓𝒓\boldsymbol{r}bold_italic_r are naturally related to tissue motions, and scale 𝒔𝒔\boldsymbol{s}bold_italic_s keeps varying since tissues are prone to elastic deformations during instrument intervention. Thus, an additional set of learnable parameters 𝚯μsuperscript𝚯𝜇\boldsymbol{\Theta}^{\mu}bold_Θ start_POSTSUPERSCRIPT italic_μ end_POSTSUPERSCRIPT,𝚯rsuperscript𝚯𝑟\boldsymbol{\Theta}^{r}bold_Θ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT,𝚯ssuperscript𝚯𝑠\boldsymbol{\Theta}^{s}bold_Θ start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT are introduced for each Gaussian to describe temporal deformations in their position, rotation, and scale respectively, see Fig. 1. Taking the positional change in x direction as an example, deformation curve can be expressed by a set of parameters 𝚯μ,x={𝝎μ,x,𝜽μ,x,𝝈μ,x}superscript𝚯𝜇𝑥superscript𝝎𝜇𝑥superscript𝜽𝜇𝑥superscript𝝈𝜇𝑥\boldsymbol{\Theta}^{\mu,x}=\{\boldsymbol{\omega}^{\mu,x},\boldsymbol{\theta}^% {\mu,x},\boldsymbol{\sigma}^{\mu,x}\}bold_Θ start_POSTSUPERSCRIPT italic_μ , italic_x end_POSTSUPERSCRIPT = { bold_italic_ω start_POSTSUPERSCRIPT italic_μ , italic_x end_POSTSUPERSCRIPT , bold_italic_θ start_POSTSUPERSCRIPT italic_μ , italic_x end_POSTSUPERSCRIPT , bold_italic_σ start_POSTSUPERSCRIPT italic_μ , italic_x end_POSTSUPERSCRIPT } as:

ψμ,x(t;𝚯μ,x)superscript𝜓𝜇𝑥𝑡superscript𝚯𝜇𝑥\displaystyle\textstyle\psi^{\mu,x}(t;\boldsymbol{\Theta}^{\mu,x})italic_ψ start_POSTSUPERSCRIPT italic_μ , italic_x end_POSTSUPERSCRIPT ( italic_t ; bold_Θ start_POSTSUPERSCRIPT italic_μ , italic_x end_POSTSUPERSCRIPT ) =j=1Bωjμ,xb~(t;θjμ,x,σjμ,x),absentsuperscriptsubscript𝑗1𝐵subscriptsuperscript𝜔𝜇𝑥𝑗~𝑏𝑡superscriptsubscript𝜃𝑗𝜇𝑥superscriptsubscript𝜎𝑗𝜇𝑥\displaystyle=\sum_{j=1}^{B}\omega^{\mu,x}_{j}\tilde{b}(t;\theta_{j}^{\mu,x},% \sigma_{j}^{\mu,x}),= ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT italic_ω start_POSTSUPERSCRIPT italic_μ , italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT over~ start_ARG italic_b end_ARG ( italic_t ; italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ , italic_x end_POSTSUPERSCRIPT , italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ , italic_x end_POSTSUPERSCRIPT ) , (4)

Due to the locally valuable nature of Gaussian functions, our model ensures that deformations at adjacent moments remain continuous, while those across large time intervals are almost decoupled. Besides, integrating with learnable parameters, this approach yields a deformation model that is not only computationally efficient but also capable of capturing intricate deformation dynamics.

2.3 Point Cloud Initialization

To further boost the reconstruction performance and stabilize the training, we introduce a Gaussian point cloud initialization before the deformation modeling. Specifically, we first employ the camera model and intrinsic matrix to extract the 3D tissue point clouds for each frame as:

𝑷i=𝑲𝟏𝑫i(𝑰i𝑴i),subscript𝑷𝑖superscript𝑲1subscript𝑫𝑖direct-productsubscript𝑰𝑖subscript𝑴𝑖{\boldsymbol{P}}_{i}=\boldsymbol{K}^{\boldsymbol{-}\mathbf{1}}{\boldsymbol{D}}% _{i}{({\boldsymbol{I}_{i}\odot\boldsymbol{M}_{i}})},bold_italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_italic_K start_POSTSUPERSCRIPT bold_- bold_1 end_POSTSUPERSCRIPT bold_italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊙ bold_italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (5)

where 𝑷isubscript𝑷𝑖{\boldsymbol{P}}_{i}bold_italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝑰isubscript𝑰𝑖{\boldsymbol{I}}_{i}bold_italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT respectively denote 3D point cloud and 2D pixel coordinates from the i𝑖iitalic_i-th frame, 𝑫isubscript𝑫𝑖\boldsymbol{D}_{i}bold_italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, 𝑴isubscript𝑴𝑖\boldsymbol{M}_{i}bold_italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denote the i𝑖iitalic_i-th depth map and valid foreground (i.e., tissue) mask, respectively, 𝑲𝑲\boldsymbol{K}bold_italic_K means the intrinsic matrix, direct-product\odot means the element-wise multiplication. By default, the first frame is selected to initiate the Gaussian point cloud as the canonical state, i.e., 𝑷c=𝑷0subscript𝑷𝑐subscript𝑷0{\boldsymbol{P}_{c}}={\boldsymbol{P}}_{0}bold_italic_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. However, with the presence of tool occlusions in the colored image, some pixels in 𝑰0subscript𝑰0{\boldsymbol{I}}_{0}bold_italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT are filtered out by 𝑴𝟎subscript𝑴0\boldsymbol{M_{0}}bold_italic_M start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT, resulting in voids and local sparsity on the initialized point cloud. This unevenly distributed initialization consumes more time densifying the point cloud and leads to a sub-optimal efficiency. Holding an assumption that dense Gaussian point distribution facilitates reconstruction on intensively deformed regions of the surgical scene, we develop a Motion-Aware Point Fusion (MAPF) scheme to selectively fuse points exhibiting intensive motions. Specifically, a motion-aware occlusion mask 𝑭𝑭\boldsymbol{F}bold_italic_F is computed by combining occluded regions and pixels with large color differences from a pixel-wise-averaged image as following:

𝑭=𝕀(|𝑪0jN𝑪j/N|>τ)(𝟏𝑴0),𝑭𝕀subscript𝑪0superscriptsubscript𝑗𝑁subscript𝑪𝑗𝑁𝜏1subscript𝑴0\displaystyle\boldsymbol{F}=\boldsymbol{\mathbb{I}}(\;|\boldsymbol{C}_{0}-{{% \textstyle\sum_{j}^{N}}\boldsymbol{C}_{j}}/N|>\tau)\cup(\boldsymbol{1}-% \boldsymbol{M}_{0}),bold_italic_F = blackboard_bold_I ( | bold_italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT bold_italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT / italic_N | > italic_τ ) ∪ ( bold_1 - bold_italic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , (6)

where 𝕀()𝕀\mathbb{I}(\cdot)blackboard_I ( ⋅ ) refers to the indicator function, 𝑪isubscript𝑪𝑖\boldsymbol{C}_{i}bold_italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the colored image of the i𝑖iitalic_i-frame, N𝑁Nitalic_N is the total number of frames, and τ𝜏\tauitalic_τ is the threshold determining pixels with significant motions. 𝑭𝑭\boldsymbol{F}bold_italic_F masks regions with large motions and local sparsity of the initialized Gaussian points 𝑷csubscript𝑷𝑐{\boldsymbol{P}}_{c}bold_italic_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT (i.e., 𝑷0subscript𝑷0{\boldsymbol{P}}_{0}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT). Finally, 3D points of 𝑷isubscript𝑷𝑖\boldsymbol{P}_{i}bold_italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with 2D projected pixels in mask 𝑭𝑭\boldsymbol{F}bold_italic_F will be fused with the 𝑷csubscript𝑷𝑐\boldsymbol{P}_{c}bold_italic_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT to initialize the canonical state as shown in Fig. 1.

2.4 Optimization

Our proposed framework jointly optimizes the canonical Gaussians 𝒢csubscript𝒢𝑐\mathcal{G}_{c}caligraphic_G start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and the deformation model given by 𝚯𝚯\boldsymbol{\Theta}bold_Θ. Given the tissue mask 𝑴𝑴\boldsymbol{M}bold_italic_M, we train our deformable tissue reconstruction framework by supervising the rendered images and depths by ground-truth colored images and stereo depth maps as following:

Csubscript𝐶\displaystyle{\mathcal{L}}_{C}caligraphic_L start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT =𝑴(𝑪^𝑪),D=𝑴(𝑫^1𝑫1),formulae-sequenceabsentnormdirect-product𝑴bold-^𝑪𝑪subscript𝐷normdirect-product𝑴superscriptbold-^𝑫1superscript𝑫1\displaystyle=||\boldsymbol{M}\odot(\boldsymbol{\widehat{C}}-\boldsymbol{C})||% ,\;\;{\mathcal{L}}_{D}=||\boldsymbol{M}\odot(\boldsymbol{\widehat{D}}^{-1}-% \boldsymbol{D}^{-1})||,= | | bold_italic_M ⊙ ( overbold_^ start_ARG bold_italic_C end_ARG - bold_italic_C ) | | , caligraphic_L start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT = | | bold_italic_M ⊙ ( overbold_^ start_ARG bold_italic_D end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT - bold_italic_D start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) | | , (7)

where the 𝑪^bold-^𝑪\boldsymbol{\widehat{C}}overbold_^ start_ARG bold_italic_C end_ARG, 𝑫^bold-^𝑫\boldsymbol{\widehat{D}}overbold_^ start_ARG bold_italic_D end_ARG, 𝑪𝑪\boldsymbol{C}bold_italic_C, and 𝑫𝑫\boldsymbol{D}bold_italic_D denote the rendered image, rendered depth, GT image, and stereo depth, respectively. The overall training loss is summarized =C+Dsubscript𝐶subscript𝐷{\mathcal{L}}={\mathcal{L}}_{C}+{\mathcal{L}}_{D}caligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT.

3 Experiment

3.1 Experiment Setting

Datasets and Evaluation. We evaluate the proposed method and compare it with existing works on two datasets: 1) EndoNeRF dataset [15] is a collection of stereo endoscopic videos including 6 clips extracted from Da Vinci robotic prostatectomy data. Each clip is captured from a single camera viewpoint with complex surgical instrument occlusion and tissue deformations. 2) StereoMIS dataset [3] is a stereo endoscopic video dataset captured from in-vivo porcine subjects containing diverse anatomical structures and challenging scenes with large tissue deformations. Specifically, we use all 6 scenes of EndoNeRF and select 3 clips from video P2_7 and P3 in StereoMIS datasets with more diverse anatomical structures compared to EndoNeRF dataset. Each selected clip lasts for 45ssimilar-to45𝑠4\sim 5s4 ∼ 5 italic_s with 30 fps𝑓𝑝𝑠fpsitalic_f italic_p italic_s. Following [20], we split frames of each scene into training and testing sets with a ratio of 7:1. We use PSNR, SSIM, and LPIPS to evaluate the reconstruction performance. Also, training time and rendering speed are calculated to evaluate the efficiency.

Implementation Details. For each scene, we normalize the video duration into [0, 1]01[0,\;1][ 0 , 1 ] and empirically apply 17 learnable Gaussian basis functions to compose FDM. The training lasts for 3000 iterations, with an initial learning rate of 1.6×1031.6superscript1031.6\times 10^{-3}1.6 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT. To stabilize the training, we freeze the densification on Gaussian points number at the initial 600 iterations. All the experiments are based on the PyTorch framework [4] and conducted with a single NVIDIA RTX A5000 GPU.

3.2 Comparison with State-of-the-art Methods

{tblr}

cells=c, cell21=r=4, cell61=r=4, vline3=-0.08em, vline6=-0.08em, hline1-2,6,10=-0.08em, row1-10 = rowsep=0pt, row1 = abovesep=0.1em, row2 = belowsep=0.1em, Dataset & Method PSNR\uparrow SSIM(%)\uparrow LPIPS\downarrow Time (sec)\downarrow Speed (fps𝑓𝑝𝑠fpsitalic_f italic_p italic_s)\uparrow
EndoNeRF EndoNeRF 35.55 93.02 0.09 similar-to\sim21600 0.03
LerPlane 36.56 94.36 0.07 similar-to\sim600 1.45
EndoGaussian 37.66 95.89 0.06 138 128.13
Ours 37.90 95.84 0.06 64 338.80
StereoMIS EndoNeRF 28.86 74.15 0.27 similar-to\sim21600 0.03
LerPlane 29.46 77.73 0.20 similar-to\sim600 1.52
EndoGaussian 30.25 82.75 0.21 151 134.50
Ours 30.48 82.74 0.21 66 330.37

Table 1: Quantitative evaluation of our proposed framework against existing methods on endoscopic scene reconstruction. ‘Time’ and ‘Speed’ denote the training time and rendering speed (fps𝑓𝑝𝑠fpsitalic_f italic_p italic_s), respectively. The optimal and suboptimal results are shown in bold and underlined respectively.
Method PSNR\uparrow SSIM(%)\uparrow Time(min𝑚𝑖𝑛minitalic_m italic_i italic_n)\downarrow
LerPlane 22.75 82.13 similar-to\sim1
EndoGaussian 35.81 94.67 similar-to\sim1
Ours 37.90 95.84 similar-to\sim1
Table 2: Quantitative comparison with SOTAs given the limited training time around 1 min𝑚𝑖𝑛minitalic_m italic_i italic_n.
Method PSNR\uparrow SSIM(%)\uparrow Time(sec)𝑠𝑒𝑐absent(sec)\downarrow( italic_s italic_e italic_c ) ↓
Ours-HexPlane 37.13 95.51 108
Ours-PFS 37.08 95.29 60
w/o MAPF 37.47 95.62 61
Ours 37.90 95.84 64
Table 3: Quantitative analysis of the key components on EndoNeRF dataset. ‘Time’ denotes the training time.
Refer to caption
Figure 2: Visualization of the 3D reconstruction results.

We evaluate our proposed framework by comparing its performance with EndoNeRF [15] and two existing SOTAs on fast reconstruction: LerPlane [18], and EndoGaussian[10]. EndoGaussian is a concurrent work using GS to accelerate the endoscopic reconstruction. As listed in Table 1, despite the effectiveness of reconstructing deformable tissues, EndoNeRF takes a long training time (hours) to reconstruct a scene in seconds, which compromises their intraoperative usability. As a comparison, the fast reconstruction method, LerPlane, effectively accelerates the training phase to the minute level and leads to a superior reconstruction quality. Our framework yields noticeable performance gains over LerPlane across all evaluation metrics on reconstruction quality, while substantially improving training efficiency by 10 times to around 1 minute. Furthermore, we compare our method against EndoGaussian [10]. Similar to [16], this method relies on decomposed feature planes to model dynamic Gaussian Splatting, presenting remarkable performance against NeRF-based methods and reaching comparable reconstruction quality to our framework. However, benefitting from the highly efficient FDM, our method leads to over 2×\times× acceleration in both training (138 sec𝑠𝑒𝑐secitalic_s italic_e italic_c \rightarrow 64 sec𝑠𝑒𝑐secitalic_s italic_e italic_c) and rendering (128 fps𝑓𝑝𝑠fpsitalic_f italic_p italic_s \rightarrow 338 fps𝑓𝑝𝑠fpsitalic_f italic_p italic_s). Note that our measured training and rendering speeds for EndoGaussian mismatch with the reported values in [10] due to different hardware used.

Despite the comparable reconstruction quality, as shown in Table 2, our proposed method achieves significantly superior performance within a limited training time, which demonstrates the superiority in intraoperative scenarios.

We also visualize several rendered scenes as shown in Fig .2 for better qualitative evaluation. It can be observed that our method has an enhanced capability of preserving the appearance details and modeling complex tissue motions. Additionally, rendered views given by EndoGaussian indicate comparable rendering quality to ours without a visually perceivable difference. According to these results, our proposed method achieves state-of-the-art (SOTA) performance on endoscopic scene reconstruction, especially its outstanding progress in fast training and real-time-level rendering, which indicates great clinical values in intraoperative applications.

3.3 Quantitative Evaluation of Key Components

We first investigate the effectiveness of the proposed FDM by comparing it with existing deformation modeling techniques on the EndoNeRF dataset. With the identical workflow shown in Fig. 1, we replace the FDM with other modeling methods including a combination of Fourier and Polynomial series (Ours-FPS), and a HexPlane-based decomposed feature plane [2] following [16] denoted as Ours-HexPlane. As illustrated in Table 3, despite the acceleration, Ours-FPS shows limited capability of representing complex deformations. On the other hand, using HexPlane to encode spatial and temporal information significantly enhances the representative capability, however, leading to a relatively longer training time. Our proposed FDM achieves the best deformation representing performance without impairing time efficiency. Furthermore, ‘w/o MAPF’ refers to initializing the Gaussian points with only the first frame point cloud, which exhibits a performance drop and thus demonstrates the important role of the proposed MAPF scheme.

4 Conclusion

In this paper, we work toward intraoperative surgical scene reconstruction by proposing a fast and accurate deformable scene reconstruction framework. With the utilization of Gaussian Splatting, our framework can achieve high-quality rendering at a real-time level. To further accelerate the tissue motion modeling, we introduce an efficient flexible deformation modeling scheme composed of learnable Gaussian basis functions to maintain a strong motion-representative capability. Besides, combined with a motion-aware point fusion scheme for initialization, our framework leads to a SOTA reconstruction quality while significantly minimizing the training time to only 1 min𝑚𝑖𝑛minitalic_m italic_i italic_n/scene, showing the possibility of reconstructing longer sequences with more challenging surgical scenes. Therefore, we believe that our work achieves significant progress in bridging the gap between high-quality rendering and intraoperative applications.

Acknowledgement

This work was supported by Ministry of Education Tier 1 Start up grant, NUS, Singapore (A-8001267-01-00); Ministry of Education Tier 1 grant, NUS, Singapore (A-8001946-00-00).

References

  • [1] Boedecker, C., Huettl, F., Saalfeld, P., Paschold, M., Kneist, W., Baumgart, J., Preim, B., Hansen, C., Lang, H., Huber, T.: Using virtual 3d-models in surgical planning: workflow of an immersive virtual reality application in liver surgery. Langenbeck’s archives of surgery 406, 911–915 (2021)
  • [2] Cao, A., Johnson, J.: Hexplane: A fast representation for dynamic scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 130–141 (2023)
  • [3] Hayoz, M., Hahne, C., Gallardo, M., Candinas, D., Kurmann, T., Allan, M., Sznitman, R.: Learning how to robustly estimate camera pose in endoscopic videos. International journal of computer assisted radiology and surgery 18(7), 1185–1192 (2023)
  • [4] Imambi, S., Prakash, K.B., Kanagachidambaresan, G.: Pytorch. Programming with TensorFlow: Solution for Edge Computing Applications pp. 87–104 (2021)
  • [5] Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics 42(4) (2023)
  • [6] Kratimenos, A., Lei, J., Daniilidis, K.: Dynmf: Neural motion factorization for real-time dynamic view synthesis with 3d gaussian splatting. arXiv preprint arXiv:2312.00112 (2023)
  • [7] Li, Z., Chen, Z., Li, Z., Xu, Y.: Spacetime gaussian feature splatting for real-time dynamic view synthesis. arXiv preprint arXiv:2312.16812 (2023)
  • [8] Lin, Y., Dai, Z., Zhu, S., Yao, Y.: Gaussian-flow: 4d reconstruction with dynamic 3d gaussian particle. arXiv preprint arXiv:2312.03431 (2023)
  • [9] Liu, X., Stiber, M., Huang, J., Ishii, M., Hager, G.D., Taylor, R.H., Unberath, M.: Reconstructing sinus anatomy from endoscopic video–towards a radiation-free approach for quantitative longitudinal assessment. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part III 23. pp. 3–13. Springer (2020)
  • [10] Liu, Y., Li, C., Yang, C., Yixuan, Y.: Endogaussian: Gaussian splatting for deformable surgical scene reconstruction. Arxiv (2024)
  • [11] Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM 65(1), 99–106 (2021)
  • [12] Pelanis, E., Teatini, A., Eigl, B., Regensburger, A., Alzaga, A., Kumar, R.P., Rudolph, T., Aghayan, D.L., Riediger, C., Kvarnström, N., et al.: Evaluation of a novel navigation platform for laparoscopic liver surgery with organ deformation compensation using injected fiducials. Medical image analysis 69, 101946 (2021)
  • [13] Rodby, K.A., Turin, S., Jacobs, R.J., Cruz, J.F., Hassid, V.J., Kolokythas, A., Antony, A.K.: Advances in oncologic head and neck reconstruction: systematic review and future considerations of virtual surgical planning and computer aided design/computer aided modeling. Journal of Plastic, Reconstructive & Aesthetic Surgery 67(9), 1171–1185 (2014)
  • [14] Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4104–4113 (2016)
  • [15] Wang, Y., Long, Y., Fan, S.H., Dou, Q.: Neural rendering for stereo 3d reconstruction of deformable tissues in robotic surgery. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 431–441. Springer (2022)
  • [16] Wu, G., Yi, T., Fang, J., Xie, L., Zhang, X., Wei, W., Liu, W., Tian, Q., Wang, X.: 4d gaussian splatting for real-time dynamic scene rendering. arXiv preprint arXiv:2310.08528 (2023)
  • [17] Wu, T.Y., Meng, Q., Yang, L., Kumari, S., Pirouz, M.: Amassing the security: An enhanced authentication and key agreement protocol for remote surgery in healthcare environment. CMES-Computer Modeling in Engineering & Sciences 134(1) (2023)
  • [18] Yang, C., Wang, K., Wang, Y., Yang, X., Shen, W.: Neural lerplane representations for fast 4d reconstruction of deformable tissues. arXiv preprint arXiv:2305.19906 (2023)
  • [19] Yang, Z., Gao, X., Zhou, W., Jiao, S., Zhang, Y., Jin, X.: Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. arXiv preprint arXiv:2309.13101 (2023)
  • [20] Zha, R., Cheng, X., Li, H., Harandi, M., Ge, Z.: Endosurf: Neural surface reconstruction of deformable tissues with stereo endoscope videos. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 13–23. Springer (2023)