0% found this document useful (0 votes)
50 views11 pages

Self-Supervised Lightweight Depth Estimation in Endoscopy Combining CNN and Transformer

Journal paper for Depth Estimation

Uploaded by

Saad khalil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views11 pages

Self-Supervised Lightweight Depth Estimation in Endoscopy Combining CNN and Transformer

Journal paper for Depth Estimation

Uploaded by

Saad khalil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

1934 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 43, NO.

5, MAY 2024

Self-Supervised Lightweight Depth Estimation in


Endoscopy Combining CNN and Transformer
Zhuoyue Yang , Junjun Pan , Ju Dai , Zhen Sun, and Yi Xiao

Abstract — In recent years, an increasing number of to the narrow field of view and lack of depth perception,
medical engineering tasks, such as surgical navigation, endoscopic surgeries place stringent demands on the expe-
pre-operative registration, and surgical robotics, rely on rience and skills of the surgeon. Nowadays, with the rapid
3D reconstruction techniques. Self-supervised depth esti-
mation has attracted interest in endoscopic scenarios development of VR/AR technology, an increasing number
because it does not require ground truth. Most existing of researchers are choosing AR-based surgical navigation to
methods depend on expanding the size of parameters to address these difficulties [1], [2], [3]. These AR systems
improve their performance. There, designing a lightweight superimpose preoperative data with intraoperative endoscopic
self-supervised model that can obtain competitive results data through registration techniques [4], [5]. The accuracy
is a hot topic. We propose a lightweight network with a
tight coupling of convolutional neural network (CNN) and of video-CT registration algorithms primarily relies on the
Transformer for depth estimation. Unlike other methods that quality of intraoperative reconstructions from endoscopic
use CNN and Transformer to extract features separately videos [6]. In addition, there are many tasks, such as surgical
and then fuse them on the deepest layer, we utilize the robots [7], medical image segmentation [8], surgery planning
modules of CNN and Transformer to extract features at assistance [9], and surgical instrument recognition [10], that
different scales in the encoder. This hierarchical structure
leverages the advantages of CNN in texture perception can benefit from the results of depth estimation.
and Transformer in shape extraction. In the same scale of Previous methods for depth estimation from image
feature extraction, the CNN is used to acquire local features sequences are based on multi-view geometry principles, such
while the Transformer encodes global information. Finally, as structure from motion (SfM) [11] and simultaneous local-
we add multi-head attention modules to the pose network ization and mapping (SLAM) [12]. Although depth estimation
to improve the accuracy of predicted poses. Experiments
demonstrate that our approach obtains comparable results tasks have been developed in natural scenes for many years,
while effectively compressing the model parameters on two this problem is more difficult in endoscopic scenes due to
datasets. inconsistent lighting, sparse texture features, and soft tis-
Index Terms— Depth and ego-motion estimation, sues with non-Lambertian reflection characteristics. Geometry-
endoscopy, lightweight architecture, self-supervised based methods [13] rely heavily on feature extraction and
learning, transformer and CNN. matching. The smooth and repetitive soft tissue texture usually
results in sparse features and wrong feature matching. Thus,
I. I NTRODUCTION traditional methods still fall short of desirable performance.
NDOSCOPIC minimally invasive surgery is widely used Deep learning-based methods in harsh natural environments
E because of less bleeding and shorter recovery time
compared with open surgery in recent years. However, due
for depth estimation [14], segmentation [8], and detection [15]
have rapidly developed due to the publication of large datasets.
However, it is very difficult to obtain large amounts of
Manuscript received 1 August 2023; revised 6 November 2023; data with ground truth in endoscopic scenes. Unsupervised
accepted 4 January 2024. Date of publication 10 January 2024; date
of current version 2 May 2024. This work was supported in part by the learning methods that only use visual images have gained
National Key Research and Development Program of China under Grant increasing attention in recent years. Researchers have tried
2022ZD0115902; and in part by the National Natural Science Foundation to relieve these limitations for endoscopy images by uti-
of China under Grant U20A20195, Grant 62272017, Grant 62172437,
and Grant 62102208. (Corresponding authors: Junjun Pan; Ju Dai.) lizing self-supervised training strategy [6], [16], [17], [18].
Zhuoyue Yang and Junjun Pan are with the State Key Labora- Although many self-supervised methods have emerged, the
tory of Virtual Reality Technology and Systems, Beihang University, depth networks for most of the work are similar and based
Haidian, Beijing 100191, China, and also with the Peng Cheng Lab-
oratory, Nanshan, Shenzhen 518000, China (e-mail: yangzhuoyue@ on convolution layers. Some works design more complex and
buaa.edu.cn; [email protected]). heavy networks to achieve better results.
Ju Dai is with the Peng Cheng Laboratory, Nanshan, Shenzhen 518000, For navigation applications, depth estimation networks not
China (e-mail: [email protected]).
Zhen Sun and Yi Xiao are with the Division of Colorectal Surgery, only need to ensure accuracy but also integrate with other
the Department of General Surgery, the Chinese Academy of Med- modules such as registration. An effective and lightweight net-
ical Sciences, and the Peking Union Medical College, Peking Union work structure is an important topic. Currently, there are many
Medical College Hospital, Dongcheng, Beijing 100730, China (e-mail:
[email protected]; [email protected]). advanced works analyzing existing network architectures and
Digital Object Identifier 10.1109/TMI.2024.3352390 making interesting discoveries. For example, the receptive field

1558-254X © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Chungbuk National Univ. Downloaded on October 10,2024 at 11:35:56 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: SELF-SUPERVISED LIGHTWEIGHT DEPTH ESTIMATION IN ENDOSCOPY 1935

of convolution operation is limited, while Transformer [19] find that the structure in Monodepth2 [23] including a depth
can model global information. The latest work has found that network and a separate pose estimation network could achieve
the most effective part of Transformer is the entire framework better performance. Following [16] and [23], this structure
rather than multi-head attention (MHA) operations [20]. CNN became the baseline for subsequent methods and the unsuper-
exhibits strong texture bias, while Transformers exhibit strong vised depth estimation is regarded as an image reconstruction
shape bias [21]. Based on the above findings, we propose problem at present. To deal with edge conditions, such as
a lightweight self-supervised depth estimation network for object motion and occlusion, predictive interpretable masks
endoscopic images, which combines the advantages of CNN are used. Liu et al. [24] propose a self-monitoring method to
and Transformer at a fine-grained level. train convolutional neural networks for intensive depth esti-
Our contributions are summarized as follows: mation from monocular endoscopic data. Supervised signals
• For the first time, we apply the lightweight network to are derived from the positional and sparse point clouds of
endoscopic scenes. We present a novel hybrid architecture the motion recovery structure. Recasens et al. [25] leverage
with an efficient combination of CNN and Transformer monodepth2 [23] in this work to train an endoscopic depth
at different scales. In order to extract global and shape- estimation network to obtain the depth corresponding to each
aware features, we insert Transformer layers into CNN image. Ozyoruk et al. [18] put forward EndoSfMLearner,
layers which are sensitive to local textures. which is an unsupervised monocular depth and pose estima-
• We propose a pose network with several multi-head tion method. This method combines residual networks and a
attention modules. Attention modules are added at dif- spatial attention module to focus on highly textured tissue
ferent locations in order to find a solution with better areas. Li et al. [26] add the LSTM module in the pose
generalization. We perform experiments on several long estimation network to model time information, thus improving
sequences to verify the performance improvement of the the accuracy of pose estimation. Shao et al. [6] joint use
methods. optical flow appearance flow to deal with the brightness
• Extensive experiments have demonstrated the effective- inconsistency problem. Zhang et al. [27] propose a network
ness of our proposed method, which compresses the that shares an encoder and contains two branches in the
number of model parameters without a significant loss decoder. The two branches estimate the depth information
of accuracy. Qualitative experiments demonstrate that our and normal information respectively. Currently, most of the
method achieves comparable results with current state-of- self-supervised deep networks applied to endoscopic images
the-art methods on the SCARED and clinical datasets. are convolutional neural networks. Most researchers [28],
[29] focus on increasing model complexity and parameters
II. R ELATED W ORK to improve the performance of the network.
In this section, we review the unsupervised depth estima- B. Network Architectures
tion methods applied in endoscopic scenes, as well as the
state-of-the-art (SOTA) network framework combining CNN With the development of the technology, Transformer shows
(convolutional neural network) and Transformer applied in great potential for depth estimation tasks in natural scenes.
natural scenes. Varma et al. [30] first evaluate the impact of transformer
on self-supervised monocular depth estimation. DPT [31]
directly uses the Transformer as the encoder, and then fuses
A. Self-Supervised Learning the results of each layer of the Transformer separately to
Depth estimation methods in natural scenes have been stud- generate depth estimation results. AdaBins [32] uses ViT
ied for several years and typically leverage real depth values after general encoders and decoders, and then adaptively
as supervised signals to model the problem as a regression or divides depth values based on the dynamic changes of the
classification problem. However, true depth values are difficult scene. TransDepth [33] also adds Transformer blocks to the
to obtain in an endoscopic environment. It is not until after ResNet [34] results to obtain long-distance information, then
unsupervised methods are widely used [22], that those deep uses a decoder based on attention and Gate to fuse features,
learning methods are formally applied to endoscopic depth and finally performs depth estimation through prediction head.
estimation tasks. Zhou et al. [17] propose an unsupervised Vision Adaptor [35] designs an adapter that runs in parallel
training method using only monocular video sequences. The with ViT, incorporating prior knowledge of images into the
method uses the computed depth and poses as mediators ViT backbone to provide reconstructed multi-scale features for
and warps nearby views to the target view as supervised dense depth estimation problems, preserving the flexibility of
information. Godard et al. [14] leverage binocular videos ViT and improving performance. DeepFormer [36] performs
instead of depth truth to train the fully convolutional network. ViT and convolution operations separately in the encoder stage
The first article [16] applies unsupervised depth estimation and designs a layered aggregation and interaction module to
to endoscopy. The authors use a fully convolutional depth combine the two parts. To summarize, some researchers build
estimation approach with a similar structure to the method independent Transformer-based encoders to obtain feature
in [17]. Godard et al. [23] propose the Monodepth2 on the maps or add several modules to fuse features from CNN.
basis of [14]’s network framework. The predictor behind the MonoViT [37] is the current state-of-the-art work in natural
decoder in the depth estimation network and the decoder scene depth estimation tasks. The encoder of MonoViT [37] is
in the pose estimation network is deleted. Most researchers constructed by stacking several MPViT [38], and the decoder

Authorized licensed use limited to: Chungbuk National Univ. Downloaded on October 10,2024 at 11:35:56 UTC from IEEE Xplore. Restrictions apply.
1936 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 43, NO. 5, MAY 2024

Fig. 1. Overview of the proposed method. Our method includes a depth network (DepthNet) and a pose network (PoseNet). Our DepthNet
consists of an encoder with a combination of CNN and Transformer and a decoder. Our PoseNet is enhanced by the multi-head attention modules.

is from HR-Depth [39]. Each layer of MPViT has three trans- We use the brightness calibration module proposed in [6]
former heads and a convolution head. MonoFormer [21] still to compensate for lighting changes caused by endoscope
relies on ViT [40], mainly by proposing the attention connec- movement. Then, according to the predicted camera poses and
tion module and feature fusion decoder. Zhang et al. [41] pro- camera internal parameters, the estimated depth is re-projected
pose a dilated convolutional module to extract rich multi-scale back to the two-dimensional plane, and the model is supervised
local features and a self-attention-based feature interaction and optimized by calculating the loss between the recon-
module to encode remote global information into features. structed image and the target image. The details of DepthNet
Yu et al. [20] prove that the general architecture of the and PoseNet are described below. The utilized loss functions
Transformers, instead of the specific token mixer module, are listed.
is more essential to the model’s performance. CMT [42] inserts
Transformer structures between different convolutional layers
of CNN. The ablation experiments have shown that the widely B. DepthNet
used phased design in CNN is a better choice for promoting Following [14] and [23], we design our method as an
Transformer-based architectures. In summary, the integration encoder-decoder architecture. CNN has better performance in
of CNN and Transformer in architecture has evolved from extracting local textures and Transformers are sensitive to
coarse-grained stacking to fine-grained information exchange. global information and contours [21]. We present a novel
The difference between our method and the above methods hybrid encoder that is able to focus on both texture and contour
is that we stack the CNN layers and the Transformer layers features. The first and third layers are stacked with multiple
alternately. We utilize this hybrid structure to obtain local and layers of CNN modules, and then several Transformer blocks
global features while also using textures and contours. are placed in sequence in the middle layer. Multi-scale features
from the encoder are connected into a concise decoder.
III. M ETHOD 1) Depth Encoder: The input image is first passed through
a convolution stem, containing three 3 × 3 convolutions. The
A. Overall Architecture first convolution with a stride of 2 and the next two with a
The framework includes a depth estimation network (Depth- stride of 1. The output channel is C1 , and the size of the output
Net), a pose estimation network (PoseNet), and a brightness feature map is H/2 × W/2. From the following stages, the
calibration network, as shown in Fig. 1. Endoscopic images CNN-based layer and Transformer-based layer are alternately
are segmented into groups of three in chronological order. stacked. Firstly, several symbols are defined to describe the
The DepthNet estimates the multi-scale depth map of a single input and output of each stage. We use Fi to represent the
endoscopy image, while the PoseNet estimates the camera feature map output from the i-th layer. The image that has
motion between adjacent images. We combine convolutional been pooled in the i-th layer is labeled as Ii . The feature
layers with transformer structures to build a hybrid DepthNet. obtained through downsampling modules for each layer is Di .

Authorized licensed use limited to: Chungbuk National Univ. Downloaded on October 10,2024 at 11:35:56 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: SELF-SUPERVISED LIGHTWEIGHT DEPTH ESTIMATION IN ENDOSCOPY 1937

Fig. 3. Transformer-based layers that are adopted in the depth


encoder of DepthNet. Transformer blocks using different operations are
distinguished by different shades of yellow. The specific structure of each
type of block is shown in Fig. 2.

Fig. 2. CNN and Transformer blocks that are adopted in the depth
encoder of DepthNet. (a) is the structure of the CNN block. (b) shows
the architecture of the Transformer blocks. To distinguish between two
different Transformer blocks, we name them based on the different
operations used in the framework.

Following [21] and [41], Fi−1 , Di−1 and Ii are concate-


nated together and fed into the i-th layer. Inspired by [41],
in the CNN-based layer, local and long-range features are
extracted by stacking several dilated convolution blocks and Fig. 4. PoseNet. Multi-head attention.
a Transformer block. In the second layer, Transformer-based
architecture is adopted to enhance shape information, resulting with a pooling operation and one Transformer block with
in the depth feature with size H/8 × W/8 × C2 . Then, the an attention block are used to extract shape-aware and long-
aggregating features are fed into dilated convolution modules, range features (F). Subsequently, we concatenate F, Di , and
and depth maps of H/16 × W/16 × C3 are generated by the Ii+1 together, and fed them into the stacked three Transformer
Transformer block. blocks again. The CNN-based layer consists of several convo-
The encoder consists of three layers, each of which is lution blocks and one Transformer block, as shown in Fig. 1.
composed of multiple stacked blocks. We first introduce the The number of CNN blocks in the third layer is twice the
basic blocks used in the depth encoder, and then explain number of CNN blocks in the first layer.
2) Depth Decoder: Our decoder adopts the concise and
the structures of the CNN-based layer and Transformer-based
layer. The dilated convolutions and Transformer [20], [41] effective U-Net [45] structure, in [23]. Convolution layers
blocks are shown in Fig. 2. We first define the symbols used in and skip connections are employed in the decoder to receive
this paper for convenience. X denotes the input features. And multi-scale features from the encoder. Then, cross-layer con-
X̂ represent the output of the dilated convolution module. BN nections and upsamples are used to increase the resolution.
is a batch normalization and LN is a layer normalization. MLP Finally, three prediction heads output inverse depth maps at
is the abbreviation of a multi-layer perceptron. As shown in different resolutions, according to the aggregated features.
Fig.2(a), X̂ is defined as follows: Each prediction head consists of a convolution layer, a bilinear
upsample, and a sigmoid layer. All predicted multi-scale depth
X̂ = X + MLP(BN(DConv(X))), (1) maps participate in self-supervised learning optimization.
where DConv is the depth-wise dilated convolution operation
with the dilatation rate. We replace MLP with activation func- C. PoseNet
tion (GELU) [43] in some blocks to reduce model parameters. Most of the networks [46] utilize a pose estimation network
There are two types of Transformer blocks, as shown in similar to monodepth2 [23], which takes two adjacent color
Fig.2(b). Ŷ is the output of the Transformer module and can pictures as input and outputs the 6-DoF relative pose between
be computed as follows: the pictures. PoseNet uses the pre-trained ResNet [34], i.e.
a structure with four superimposed convolutional layers, as an
X̃ = MixToken(LN(X)) + X, encoder. Considering the influence of light in the medical
Ŷ = X̃ + MLP(LN(X̃)). (2) scene, we add multi-head attention modules [19] into the above
architecture to improve the performance of the pose estimation
Pooling [20] and cross-covariance attention [44] operations are
network, as shown in Fig. 4.
utilized as MixToken. As shown in Fig. 2(b), for the output
Two adjacent images (H ×W ×3) are first fed into a convo-
of cross-covariance attention block, Ŷ = X + MLP(LN(X̃)).
lution stem to obtain a feature map F of size H/2 × W/2.
The Transformer-based layer is illustrated in Fig. 3. The
After passing through the maximum pooling layer, the output
input of the Transformer layer is the concatenation of Fi−1 ,
(F)
e of multi-head attention can be defined as:
Di−1 , and Ii . We first perform a convolution to reduce the
dimensionality of the input. Then, two Transformer blocks e = MultiHeadAtten(Q, K, V) + F,
F (3)

Authorized licensed use limited to: Chungbuk National Univ. Downloaded on October 10,2024 at 11:35:56 UTC from IEEE Xplore. Restrictions apply.
1938 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 43, NO. 5, MAY 2024

where MultiHeadAtten(Q, K, V) is the concatenated output of published on the endoscopic sub-challenge organized by MIC-
k self-attention operations, which is applied as: CAI2019, containing 9 different sub-datasets collected from
porcine cadavers. Each sub-dataset contains an endoscope
QK T
Attention(Q, K, V) = softmax( √ )V, (4) video, the ground truth of the pose recorded by the surgical
d robot, and the ground truth of depth collected by structured
e and d is the dimension of light equipment. Therefore, we can evaluate the performance
where Q, K, V are projected from F
of pose estimation and depth estimation methods using this
the input. Then, the feature extraction was performed through
dataset. Following [6], we also refer to the Eigen-Zhou [17],
two superimposed ResNet [34] blocks to obtain feature maps
[49] evaluation protocol to separate the training, validation,
with scales of H/4 × W/4 and H/8 × W/8, respectively.
and test datasets, respectively.
In addition, the extracted feature map passes through the
2) Clinical Dataset: In order to verify the generalization
multi-head attention layer again. Finally, the last two feature
performance of the method, we also collect videos during right
maps are obtained through two basic blocks. The feature maps
hemicolectomy surgery with the assistance of surgeons. Four
are converted into pose matrices through convolutions.
representative video clips are selected for quantitative experi-
ments. Each video contains 150-200 images. The contents of
D. Self-Supervised Learning the images include live, colon, small intestine, fat, etc. in the
Like other unsupervised learning methods, we also trans- abdominal cavity. These four sequences are representative
form the task as 2D image reconstruction and supervise the image sequences during the surgical navigation phase. This
consistency and accuracy of depth estimation by minimizing dataset is not utilized in the training process.
the similarity between the re-projected image and the target
image. The image reconstruction loss consists of the photomet- B. Implementation Details
ric loss (L p ) and edge-aware loss (Le ). We define the source Our method is implemented by PyTorch. In our experiments,
image as I† . Utilizing the pose estimation T and intrinsic we utilize a single NVIDIA V100 and the batch size is 12. The
parameters of the camera P, the reconstructed image (Ĩ) can following training augmentations are performed, with 50%
be re-projected (π) from the depth estimation D and I† . The chance: random brightness, contrast, saturation, and hue jitter
reconstructed image (Ĩ) is defined as follows: with respective ranges of ±0.2, ±0.2, ±0.2, and ±0.1. Our
depth estimation network and pose estimation network use
Ĩ = π(I† , T, D, P). (5)
two AdamW [50] optimizers respectively. The initial values
Due to inconsistent lighting in the endoscopic environment, of learning rates are 1e-4. Drop-path is used to mitigate
the photometric loss is inaccurate. We apply a pre-trained overfitting and the training epoch is set to 50. The specific
optical flow network to calibrate the rotation and translation values of C1 , C2 and C3 are 48, 80 and 128.
changes between two input images and use a pre-trained Following [6], [17], and [23], we compute the 5 standard
appearance flow network that results in C to supplement the metrics (Abs Rel, Sq Rel, RMSE, RMSE log, δ < 1.25)
illumination. The modified image (Î) resulting from the target proposed in [49] for evaluation. These metrics are defined as
image I is as follows: follows:
1 X ∗
Î = I + C. (6) Abs Rel = |d −d|/d ∗ (9)
|D|
d∈D
The image similarity (F) between the modified image (Î) 1 X ∗
and the reconstructed image (Ĩ) is defined as follows: Sq Rel = |d −d|2 /d ∗ (10)
|D|
d∈D
s
1 − SS I M(Î, Ĩ) 1 X
F =α· + (1 − α) · Î − Ĩ , (7) R M S E log = | log d ∗ − log d|2 (11)
2 |D|
d∈D
where SS I M is the structural similarity index [47] and α = s
1 X ∗
0.85. The photometric loss L p is the minimum value of F RMSE = |d −d|2 , (12)
among two adjacent images with the visibility mak [6], [23]. |D|
d∈D
In order to maintain the edges, edge-aware loss is also used. d∗ d
 
1
As in previous work [17] and [23], the edge-aware loss is δ= d ∈ D|max( , ∗ < 1.25) × 100%
|D| d d
defined as: (13)
Le = |∂x d|e−|∂x I| + |∂ y d|e−|∂ y I| , (8) where D is the set of the predicted depth. d and d ∗ denote the
where d represents the mean-normalized inverse depth of I. predicted depth and the ground truth, respectively. We perform
a 5-frame pose evaluation following [17] and adopt the metric
of absolute trajectory error (ATE) [51].
IV. DATASET AND R ESULTS
A. Dataset C. Depth Estimation
1) SCARED Dataset: We utilize SCARED [48] dataset to 1) Performance Comparision: We run experiments on the
evaluate our methods’ performance. The SCARED dataset is SCARED dataset to evaluate the depth error and accuracy of

Authorized licensed use limited to: Chungbuk National Univ. Downloaded on October 10,2024 at 11:35:56 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: SELF-SUPERVISED LIGHTWEIGHT DEPTH ESTIMATION IN ENDOSCOPY 1939

TABLE I
D EPTH N ET P ERFORMANCE . ‘E NCODER ’, ‘D ECODER ’, AND ‘OVERALL’ R EPRESENTS THE N UMBER OF PARAMETERS U TILIZED IN D EPTH N ET.
AUXILIARY R EPRESENTS THE N UMBER OF AUXILIARY M ODELS ’ PARAMETERS . †M EANS THE B EST R ESULT W E R EPRODUCE ON O UR M ACHINE

Fig. 5. Qualitative depth comparison. There are four examples from the SCARED dataset. The first row is original images and the others are
depth maps. The second and third rows are results from [6] and [23]. The last row shows our results.

our model. The proposed method is compared with several the smallest parameters in the inference phase. We achieve
SOTA self-supervised methods, including AF-SfM [6], Endo- the second-highest ranking result in accuracy. In Table I, the
SfM [18], Monodepth2 [23], Fang et al. [52], DeFeat-Net [53] auxiliary parameters refer to the network parameters proposed
and SC-SfMLearner [54]. To make up the monocular scale in AF-SfM [6] for correcting illumination. With these two
ambiguity, following the same strategies indicated in [6] auxiliary networks only utilizing the training phase, both
and [23], the estimated depth is scaled by the per-image our model and AF-SfM [6] can achieve better performance.
median ground truth. Table I collects the quantitative results The performance of compared methods on depth estimation
of our model against other typical self-supervised methods. is from [6]. According to Table I, our method achieves a
The encoder, decoder, and overall columns in Table I report lower result on RMSE. Fig. 5 shows that our model obtains
the size of parameters in the DepthNet. Our method achieves satisfactory results compared with other methods. We can
comparable performance to the state-of-the-art methods with observe that our method provides a more accurate depth

Authorized licensed use limited to: Chungbuk National Univ. Downloaded on October 10,2024 at 11:35:56 UTC from IEEE Xplore. Restrictions apply.
1940 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 43, NO. 5, MAY 2024

TABLE II
A BLATION S TUDY ON THE N UMBER OF T RANSFORMER
B LOCKS IN O NE L AYER

estimation of the edge of organs while maintaining the global


smoothness of soft tissues. These quantitative and qualitative
results demonstrate the superiority of our method.
2) Ablation Study on DepthNet Architecture: To further Fig. 6. CNN and Transformer architectures that can be adopted
in the depth encoder of DepthNet. (a), (b), and (c) are the hybrid
demonstrate the validity of the proposed model, an ablation architecture with 3, 4 and 5 layers.
study is conducted to assess the importance of different
designs in the architecture. We conduct experiments on the TABLE III
number of Transformer blocks and the structure of Transform- A BLATION S TUDY ON T RANSFORMER AND CNN A RCHITECTURES
ers layers. The number of modules in a single layer determines
the foundation of the framework.
a) Ablation study on the number of transformer blocks:
Table II collects the results with different numbers of trans-
former blocks with pooling operation in the middle Trans-
former layer. The transformer block with attention remains
TABLE IV
unchanged in each experiment. The baseline is a simplified
P OSE P ERFORMANCE
model in the second layer that does not replace CNN layers
with Transformer layers. We test the depth estimation results
of 2, 3, and 4 Transformer blocks. Based on the results of
the second and third rows, we find that adding a block with
pooing can improve the performance of the model. However,
based on the results of the third and fourth rows, we find
that consistently stacking pooling Transformer blocks result
in a decrease in performance. Therefore, we use the structure
in Fig. 3 to achieve stable performance improvement while
comparison of the proposed method with the other five
increasing pool formers through cascading and convolution
methods. The performance of compared methods is from
operations. Based on the results in the last row of Table I,
AF-SfM [6]. Our method achieves the lowest error on the
our current structure can strike a balance between the depth
ATE. Most of the work use the same pose estimation net-
estimation accuracy and the model size.
work. We concatenate two input images and then estimate
b) Ablation study on the architecture of transformer layers: The
the 6DoF between the two images using features extracted
influence of different architecture on accuracy has been stud-
by ResNet [34]. Feature-dependent approaches have higher
ied. We compare the following three frameworks, as shown
immunity against light variations. We add attention mecha-
in Fig. 6. These three subgraphs show the basic hybrid
nisms to enhance features, emphasize differences, and thus
structure, each consisting of 3, 4, and 5 layers. In both
improve performance.
Fig. 6(a) and Fig. 6(c), CNN-based layers are used as the first
To further analyze the effect of the multi-head attention
and last layers. In both (b) and (c), there are two Transformer
mechanism on the pose estimation network, we conduct
layers in the architecture.
ablation experiments. Table V collects the results of adding
Table III shows the different results obtained by these
multiple attention mechanisms at different locations. These
three structures. Both (a) and (c) achieve good performance,
insertion locations include the first layer of convolution, the
which is comparable to the most advanced methods. However,
second layer of convolution, the third layer of convolution,
the model parameters of (a) are the smallest. So in the
and various combinations of these locations. For the scheme,
performance analysis experiment, we report the results of (a).
after MHA is added to the first layer of convolution, we note
However, the structure in (c) can achieve smaller errors on Sq
that while it achieves better results than Monodepth2 [23],
Rel, RMSE, and RMSE log metrics.
it is not as good as the combined use of appearance flow.
Interestingly for Sequence-1, we find that adding multi-head
D. Pose Estimation attention in both the second and third layers achieves lower
We select two sequences with longer trajectories [6] in errors. However, for Sequence-2, only the MHA in the second
the SCARED dataset and label them as Sequence-1 (Seq.1) layer yields a performance gain. Therefore, we use the addition
and Sequence-2 (Seq.2) respectively. Table IV shows the of the MHA mechanism in the middle layer to obtain better

Authorized licensed use limited to: Chungbuk National Univ. Downloaded on October 10,2024 at 11:35:56 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: SELF-SUPERVISED LIGHTWEIGHT DEPTH ESTIMATION IN ENDOSCOPY 1941

Fig. 7. Qualitative pose comparison. The first three columns are the trajectory results by using comparative methods( [6], [18], [23]). The results
in the last column are our trajectory results.

TABLE V
A BLATION S TUDY ON P OSE N ET

generalization. Fig. 7 reports qualitative examples from these


two trajectories. The performance of our model is superior to
other competitors in the middle of trajectories.

E. Surface Reconstruction
We can recover point clouds from camera intrinsics and
depth estimates, as shown in Fig. 8. The point cloud shown Fig. 8. Point clouds on the SCARED and clinical datasets.
(a) (b) show two examples. Images in the first row are original images
in Fig. 8 does not have any added colors, in order to and figures in the second row are reconstructed point clouds.
display the geometric structure. We use truncated signed
distance function (TSDF) [55] to fuse multiple point clouds
TABLE VI
in order to extend the 3D model of the tissue surface. The S URFACE R ECONSTRUCTION
implementation is developed by Open3d [56]. Readers can
reference [25] to get the procedure of expanding multiple point
clouds based on pose estimates. We further utilize laparo-
scopic images obtained from surgery for visual performance
analysis.
Fig. 9 shows the surface reconstructed from the SCARED
dataset. Subfigures in Fig. 10 are the surfaces recovered from
the clinical dataset. The images in the first row demonstrate the
texture and the second row shows the mesh. Through mesh, The average number of points for surface models in the scared
we can more clearly see the structure of soft tissues in different dataset and the real dataset is 1.8 and 1.5 million, respectively.
scenarios. By adding textures, the entire scene can be visually The average processing time for each image is 0.2 seconds.
reflected. Fig. 9 shows that our method preserves distinct tissue We do not include network inference time here. The inference
structures and keeps local soft tissues smooth and continuous. time of our method and other methods are shown in TableVII.
TableVI reflects our scenes containing a large number of verts. Our method also reduces inference time.

Authorized licensed use limited to: Chungbuk National Univ. Downloaded on October 10,2024 at 11:35:56 UTC from IEEE Xplore. Restrictions apply.
1942 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 43, NO. 5, MAY 2024

Fig. 9. Our surface reconstructions on the SCARED dataset. (a), (b), (c) and (d) are 3D surfaces recovered from four images captured from
porcine cadavers.

Fig. 10. Recoverd surfaces on the clinical dataset.(a), (b), (c), and (d) are 3D surfaces recovered from four representative laparoscopic images
obtained during surgery, mainly including fat, intestines, and liver.

TABLE VII
D EPTH N ET I NFERENCE S PEED

F. Limitations

Although our method is mainly trained and tested on


laparoscopic images, we have also tested it in clinical exper-
iments. However, there are still some disadvantages to the
depth fusion, such as the presence of discrete points on Fig. 11. The example of unsatisfactory reconstruction result. The
green box region shows the overlap.
the edge of the fourth image in Fig. 9. In addition, for
dynamic scenarios, such as device movement and interaction
between devices and soft tissues, current fusion methods may multiple images. The problem of inconsistent depth between
have a significant overlap (Fig. 11). The reason for this different laparoscopic images still exists due to the similar
phenomenon may be due to inconsistent depth estimates across texture.

Authorized licensed use limited to: Chungbuk National Univ. Downloaded on October 10,2024 at 11:35:56 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: SELF-SUPERVISED LIGHTWEIGHT DEPTH ESTIMATION IN ENDOSCOPY 1943

V. C ONCLUSION [12] Ó. G. Grasa, E. Bernal, S. Casado, I. Gil, and J. M. M. Montiel, “Visual


SLAM for handheld monocular endoscope,” IEEE Trans. Med. Imag.,
A lightweight depth estimation network is first applied for vol. 33, no. 1, pp. 135–146, Jan. 2014, doi: 10.1109/TMI.2013.2282997.
endoscopy images in this paper. We propose a self-supervised [13] M. Ye, S. Giannarou, A. Meining, and G.-Z. Yang, “Online tracking
and retargeting with applications to optical biopsy in gastrointestinal
depth estimation network with a combination of CNN and endoscopic examinations,” Med. Image Anal., vol. 30, pp. 144–157,
Transformer for endoscopy images. CNN-based layers mixed May 2016, doi: 10.1016/j.media.2015.10.003.
with transformer-based layers are utilized as the encoder [14] C. Godard, O. M. Aodha, and G. J. Brostow, “Unsupervised monocular
depth estimation with left-right consistency,” in Proc. IEEE Conf.
to aggregate local texture information and global contour Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 6602–6611, doi:
features. Our method achieves competitive results while also 10.1109/CVPR.2017.699.
reducing the number of parameters. The proposed pose net- [15] Q. Xu, Y. Li, M. Zhang, and W. Li, “COCO-Net: A dual-
supervised network with unified ROI-loss for low-resolution ship
work obtains the minimum error on the SCARED dataset detection from optical satellite image sequences,” IEEE Trans. Geosci.
compared to the previous approaches. Detailed quantitative Remote Sens., vol. 60, pp. 1–15, 2022, Art. no. 5629115, doi:
and qualitative experiments demonstrate the effectiveness of 10.1109/TGRS.2022.3201530.
[16] M. Turan et al., “Unsupervised odometry and depth learning for endo-
our method. scopic capsule robots,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst.
However, there are still some issues that need to be (IROS), Oct. 2018, pp. 1801–1807, doi: 10.1109/IROS.2018.8593623.
improved in the depth fusion task. The newest implicit scene [17] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervised
learning of depth and ego-motion from video,” in Proc. IEEE Conf.
representation methods, such as NeRF [57], can be used to Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 6612–6619, doi:
solve the above challenge. In the future, we attempt to improve 10.1109/CVPR.2017.700.
the performance of networks in dynamic object scenarios, [18] K. B. Ozyoruk et al., “EndoSLAM dataset and an unsupervised monoc-
such as surgical instruments and deformable tissues. Further ular visual odometry and depth estimation approach for endoscopic
videos,” Med. Image Anal., vol. 71, Jul. 2021, Art. no. 102058, doi:
validation is needed to apply our method in actual surgical 10.1016/j.media.2021.102058.
scenarios. Animal studies with pigs will be done in the future. [19] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural
Pigs’ gut environment and structure resemble those of humans. Inf. Process. Syst. (NIPS), Dec. 2017, vol. 30, pp. 6000–6010, doi:
10.48550/arXiv.1706.03762.
We attempt to extend the method in this study for use in human [20] W. Yu et al., “MetaFormer is actually what you need for vision,” in Proc.
research after conducting animal tests. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022,
pp. 10809–10819, doi: 10.1109/CVPR52688.2022.01055.
[21] J. Bae, S. Moon, and S. Im, “Deep digging into the generalization of self-
R EFERENCES supervised monocular depth estimation,” presented at the Proc. AAAI
Conf. Artif. Intell., 2023, doi: 10.48550/arXiv.2205.11083.
[1] T. Collins et al., “Augmented reality guided laparoscopic surgery of the [22] Q. Xu, Y. Li, J. Nie, Q. Liu, and M. Guo, “UPanGAN: Unsupervised
uterus,” IEEE Trans. Med. Imag., vol. 40, no. 1, pp. 371–380, Jan. 2021, pansharpening based on the spectral and spatial loss constrained gener-
doi: 10.1109/TMI.2020.3027442. ative adversarial network,” Inf. Fusion, vol. 91, pp. 31–46, Mar. 2023,
[2] P. Zhang et al., “Real-time navigation for laparoscopic hepatectomy doi: 10.1016/j.inffus.2022.10.001.
using image fusion of preoperative 3D surgical plan and intraoperative [23] C. Godard, O. M. Aodha, M. Firman, and G. Brostow, “Digging
indocyanine green fluorescence imaging,” Surgical Endoscopy, vol. 34, into self-supervised monocular depth estimation,” in Proc. IEEE/CVF
no. 8, pp. 3449–3459, Aug. 2020, doi: 10.1007/s00464-019-07121-1. Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 3827–3837, doi:
[3] R. Hussain, A. Lalande, R. Marroquin, K. B. Girum, C. Guigou, and 10.1109/ICCV.2019.00393.
A. B. Grayeli, “Real-time augmented reality for ear surgery,” in Proc. [24] X. Liu et al., “Dense depth estimation in monocular endoscopy with
Int. Conf. Med. Image Comput. Comput.-Assist. Intervent. (MICCAI), self-supervised learning methods,” IEEE Trans. Med. Imag., vol. 39,
Sep. 2018, pp. 324–331, doi: 10.1007/978-3-030-00937-3_38. no. 5, pp. 1438–1447, May 2020, doi: 10.1109/TMI.2019.2950936.
[4] H. Luo et al., “Augmented reality navigation for liver resection with [25] D. Recasens, J. Lamarca, J. M. Fácil, J. M. M. Montiel, and J. Civera,
a stereoscopic laparoscope,” Comput. Methods Programs Biomed., “Endo-depth-and-motion: Reconstruction and tracking in endoscopic
vol. 187, Apr. 2020, Art. no. 105099, doi: 10.1016/j.cmpb.2019.105099. videos using depth networks and photometric constraints,” IEEE
[5] R. Wei et al., “Stereo dense scene reconstruction and accurate localiza- Robot. Autom. Lett., vol. 6, no. 4, pp. 7225–7232, Oct. 2021, doi:
tion for learning-based navigation of laparoscope in minimally invasive 10.1109/LRA.2021.3095528.
surgery,” IEEE Trans. Biomed. Eng., vol. 70, no. 2, pp. 488–500, [26] L. Li, X. Li, S. Yang, S. Ding, A. Jolfaei, and X. Zheng, “Unsupervised-
Feb. 2023, doi: 10.1109/TBME.2022.3195027. learning-based continuous depth and motion estimation with monoc-
[6] S. Shao et al., “Self-supervised monocular depth and ego-motion estima- ular endoscopy for virtual reality minimally invasive surgery,” IEEE
tion in endoscopy: Appearance flow to the rescue,” Med. Image Anal., Trans. Ind. Informat., vol. 17, no. 6, pp. 3920–3928, Jun. 2021, doi:
vol. 77, Apr. 2022, Art. no. 102338, doi: 10.1016/j.media.2021.102338. 10.1109/TII.2020.3011067.
[7] Y. Li et al., “SuPer: A surgical perception framework for endo- [27] Y. Zhang et al., “Colde: A depth estimation framework for
scopic tissue manipulation with surgical robotics,” IEEE Robot. colonoscopy reconstruction,” Nov. 2021, arXiv:2111.10371., doi:
Autom. Lett., vol. 5, no. 2, pp. 2294–2301, Apr. 2020, doi: 10.48550/arXiv.2111.10371.
10.1109/LRA.2020.2970659. [28] Y. Liu and S. Zuo, “Self-supervised monocular depth estimation for gas-
[8] H. Itoh et al., “Binary polyp-size classification based on deep-learned trointestinal endoscopy,” Comput. Methods Programs Biomed., vol. 238,
spatial information,” Int. J. Comput. Assist. Radiol. Surg., vol. 16, no. 10, Aug. 2023, Art. no. 107619, doi: 10.1016/j.cmpb.2023.107619.
pp. 1817–1828, Oct. 2021, doi: 10.1007/s11548-021-02477-z. [29] Y. Yang et al., “A geometry-aware deep network for depth estimation
[9] R. Tang et al., “Augmented reality technology for preoperative plan- in monocular endoscopy,” Eng. Appl. Artif. Intell., vol. 122, Jun. 2023,
ning and intraoperative navigation during hepatobiliary surgery: A Art. no. 105989, doi: 10.1016/j.engappai.2023.105989.
review of current methods,” Hepatobiliary Pancreatic Diseases Int., [30] A. Varma, H. Chawla, B. Zonooz, and E. Arani, “Transformers in self-
vol. 17, no. 2, pp. 101–112, Apr. 2018, doi: 10.1016/j.hbpd.2018. supervised monocular depth estimation with unknown camera intrinsics,”
02.002. Feb. 2022, arXiv:2202.03131.
[10] D. Psychogyios, E. Mazomenos, F. Vasconcelos, and D. Stoyanov, [31] R. Ranftl, A. Bochkovskiy, and V. Koltun, “Vision transformers
“MSDESIS: Multitask stereo disparity estimation and surgical instru- for dense prediction,” in Proc. IEEE/CVF Int. Conf. Comput. Vis.
ment segmentation,” IEEE Trans. Med. Imag., vol. 41, no. 11, (ICCV), Oct. 2021, pp. 12159–12168, doi: 10.1109/ICCV48922.2021.
pp. 3218–3230, Nov. 2022, doi: 10.1109/TMI.2022.3181229. 01196.
[11] S. Rattanalappaiboon, T. Bhongmakapat, and P. Ritthipravat, “Fuzzy [32] S. Farooq Bhat, I. Alhashim, and P. Wonka, “AdaBins: Depth
zoning for feature matching technique in 3D reconstruction of nasal estimation using adaptive bins,” in Proc. IEEE/CVF Conf. Com-
endoscopic images,” Comput. Biol. Med., vol. 67, pp. 83–94, Dec. 2015, put. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 4008–4017, doi:
doi: 10.1016/j.compbiomed.2015.09.021. 10.1109/CVPR46437.2021.00400.

Authorized licensed use limited to: Chungbuk National Univ. Downloaded on October 10,2024 at 11:35:56 UTC from IEEE Xplore. Restrictions apply.
1944 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 43, NO. 5, MAY 2024

[33] G. Yang, H. Tang, M. Ding, N. Sebe, and E. Ricci, “Transformer- [46] Z. Zhou, X. Fan, P. Shi, and Y. Xin, “R-MSFM: Recurrent multi-
based attention networks for continuous pixel-wise prediction,” in Proc. scale feature modulation for monocular depth estimating,” in Proc.
IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 16249–16259, IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 12757–12766,
doi: 10.1109/ICCV48922.2021.01596. doi: 10.1109/ICCV48922.2021.01254.
[34] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for [47] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. quality assessment: From error visibility to structural similarity,” IEEE
(CVPR), Jun. 2016, pp. 770–778, doi: 10.1109/CVPR.2016.90. Trans. Image Process., vol. 13, no. 4, pp. 600–612, Apr. 2004, doi:
[35] Z. Chen et al., “Vision transformer adapter for dense predictions,” 10.1109/TIP.2003.819861.
in Proc. Int. Conf. Learn. Represent. (ICLR), Feb. 2023, doi: [48] M. Allan et al., “Stereo correspondence and reconstruction of endoscopic
10.48550/arXiv.2205.08534. data challenge,” 2021, arXiv:2101.01133.
[36] Z. Li, Z. Chen, X. Liu, and J. Jiang, “DepthFormer: Exploiting long- [49] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a
range correlation and local information for accurate monocular depth single image using a multi-scale deep network,” in Proc. Int. Conf.
estimation,” 2022, arXiv:2203.14211. Neural Inf. Process. Syst. (NIPS), Dec. 2014, pp. 2366–2374, doi:
[37] C. Zhao et al., “MonoViT: Self-supervised monocular depth estimation 10.48550/arXiv.1406.2283.
with a vision transformer,” in Proc. Int. Conf. 3D Vis. (3DV), Sep. 2022, [50] I. Loshchilov and F. Hutter, “Decoupled weight decay regulariza-
pp. 668–678, doi: 10.1109/3DV57658.2022.00077. tion,” in Proc. Int. Conf. Learn. Represent. (ICLR), Dec. 2018, doi:
[38] Y. Lee, J. Kim, J. Willette, and S. J. Hwang, “MPViT: Multi-path 10.48550/arXiv.1711.05101.
vision transformer for dense prediction,” in Proc. IEEE/CVF Conf. [51] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardós, “ORB-
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 7277–7286, doi: SLAM: A versatile and accurate monocular SLAM system,” IEEE
10.1109/CVPR52688.2022.00714. Trans. Robot., vol. 31, no. 5, pp. 1147–1163, Oct. 2015, doi:
[39] X. Lyu et al., “HR-depth: High resolution self-supervised monocular 10.1109/TRO.2015.2463671.
depth estimation,” in Proc. AAAI Conf. Artif. Intell., 2021, vol. 35, no. 3, [52] Z. Fang, X. Chen, Y. Chen, and L. Van Gool, “Towards good practice
pp. 2294–2301, doi: 10.1609/aaai.v35i3.16329. for CNN-based monocular depth estimation,” in Proc. IEEE Winter
[40] A. Dosovitskiy et al., “An image is worth 16×16 words: Transformers Conf. Appl. Comput. Vis. (WACV), Mar. 2020, pp. 1080–1089, doi:
for image recognition at scale,” in Proc. Int. Conf. Learn. Represent. 10.1109/WACV45572.2020.9093334.
(ICLR), Jan. 2021, doi: 10.48550/arXiv.2010.11929. [53] J. Spencer, R. Bowden, and S. Hadfield, “DeFeat-Net: General
[41] N. Zhang, F. Nex, G. Vosselman, and N. Kerle, “Lite- monocular depth via simultaneous unsupervised representation learn-
mono: A lightweight CNN and transformer architecture for ing,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
self-supervised monocular depth estimation,” in Proc. IEEE Conf. (CVPR), Jun. 2020, pp. 14390–14401, doi: 10.1109/CVPR42600.2020.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 18537–18546, 01441.
doi: 10.48550/arXiv.2211.13202. [54] J. Bian et al., “Unsupervised scale-consistent depth and ego-motion
[42] J. Guo et al., “CMT: Convolutional neural networks meet learning from monocular video,” in Proc. 33rd Conf. Neural Inf.
vision transformers,” in Proc. IEEE/CVF Conf. Comput. Vis. Process. Syst., Dec. 2019, pp. 35–45, doi: 10.48550/arXiv.1908.
Pattern Recognit. (CVPR), Jun. 2022, pp. 12165–12175, doi: 10553.
10.1109/CVPR52688.2022.01186. [55] B. Curless and M. Levoy, “A volumetric method for building complex
[43] D. Hendrycks and K. Gimpel, “Gaussian error linear units (GELUs),” models from range images,” in Proc. 23rd Annu. Conf. Comput. Graph.
Jun. 2016, arXiv:1606.08415, doi: 10.48550/arXiv.1606.08415. Interact. Techn., Aug. 1996, pp. 303–312, doi: 10.1145/237170.237269.
[44] A. Ali et al., “XCiT: Cross-covariance image transformers,” in Proc. Adv. [56] Q.-Y. Zhou, J. Park, and V. Koltun, “Open3D: A modern
Neural Inf. Process. Syst. (NIPS), vol. 34, Dec. 2021, pp. 20014–20027, library for 3D data processing,” Jan. 2018, arXiv:1801.09847, doi:
doi: 10.48550/arXiv.2106.0968. 10.48550/arXiv.1801.09847.
[45] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional [57] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi,
networks for biomedical image segmentation,” in Proc. Int. Conf. and R. Ng, “NeRF: Representing scenes as neural radiance fields for
Med. Image Comput. Comput.-Assist. Intervent. (MICCAI), Nov. 2015, view synthesis,” Commun. ACM, vol. 65, no. 1, pp. 99–106, Dec. 2021,
pp. 234–241, doi: 10.1007/978-3-319-24574-4_28. doi: 10.1145/3503250.

Authorized licensed use limited to: Chungbuk National Univ. Downloaded on October 10,2024 at 11:35:56 UTC from IEEE Xplore. Restrictions apply.

You might also like