0% found this document useful (0 votes)

50 views11 pages

Self-Supervised Lightweight Depth Estimation in Endoscopy Combining CNN and Transformer

Journal paper for Depth Estimation

Uploaded by

Saad khalil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views11 pages

Self-Supervised Lightweight Depth Estimation in Endoscopy Combining CNN and Transformer

Journal paper for Depth Estimation

Uploaded by

Saad khalil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

1934 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 43, NO.

5, MAY 2024

Self-Supervised Lightweight Depth Estimation in

Endoscopy Combining CNN and Transformer
Zhuoyue Yang , Junjun Pan , Ju Dai , Zhen Sun, and Yi Xiao

Abstract — In recent years, an increasing number of to the narrow field of view and lack of depth perception,
medical engineering tasks, such as surgical navigation, endoscopic surgeries place stringent demands on the expe-
pre-operative registration, and surgical robotics, rely on rience and skills of the surgeon. Nowadays, with the rapid
3D reconstruction techniques. Self-supervised depth esti-
mation has attracted interest in endoscopic scenarios development of VR/AR technology, an increasing number
because it does not require ground truth. Most existing of researchers are choosing AR-based surgical navigation to
methods depend on expanding the size of parameters to address these difficulties [1], [2], [3]. These AR systems
improve their performance. There, designing a lightweight superimpose preoperative data with intraoperative endoscopic
self-supervised model that can obtain competitive results data through registration techniques [4], [5]. The accuracy
is a hot topic. We propose a lightweight network with a
tight coupling of convolutional neural network (CNN) and of video-CT registration algorithms primarily relies on the
Transformer for depth estimation. Unlike other methods that quality of intraoperative reconstructions from endoscopic
use CNN and Transformer to extract features separately videos [6]. In addition, there are many tasks, such as surgical
and then fuse them on the deepest layer, we utilize the robots [7], medical image segmentation [8], surgery planning
modules of CNN and Transformer to extract features at assistance [9], and surgical instrument recognition [10], that
different scales in the encoder. This hierarchical structure
leverages the advantages of CNN in texture perception can benefit from the results of depth estimation.
and Transformer in shape extraction. In the same scale of Previous methods for depth estimation from image
feature extraction, the CNN is used to acquire local features sequences are based on multi-view geometry principles, such
while the Transformer encodes global information. Finally, as structure from motion (SfM) [11] and simultaneous local-
we add multi-head attention modules to the pose network ization and mapping (SLAM) [12]. Although depth estimation
to improve the accuracy of predicted poses. Experiments
demonstrate that our approach obtains comparable results tasks have been developed in natural scenes for many years,
while effectively compressing the model parameters on two this problem is more difficult in endoscopic scenes due to
datasets. inconsistent lighting, sparse texture features, and soft tis-
Index Terms— Depth and ego-motion estimation, sues with non-Lambertian reflection characteristics. Geometry-
endoscopy, lightweight architecture, self-supervised based methods [13] rely heavily on feature extraction and
learning, transformer and CNN. matching. The smooth and repetitive soft tissue texture usually
results in sparse features and wrong feature matching. Thus,
I. I NTRODUCTION traditional methods still fall short of desirable performance.
NDOSCOPIC minimally invasive surgery is widely used Deep learning-based methods in harsh natural environments
E because of less bleeding and shorter recovery time
compared with open surgery in recent years. However, due
for depth estimation [14], segmentation [8], and detection [15]
have rapidly developed due to the publication of large datasets.
However, it is very difficult to obtain large amounts of
Manuscript received 1 August 2023; revised 6 November 2023; data with ground truth in endoscopic scenes. Unsupervised
accepted 4 January 2024. Date of publication 10 January 2024; date
of current version 2 May 2024. This work was supported in part by the learning methods that only use visual images have gained
National Key Research and Development Program of China under Grant increasing attention in recent years. Researchers have tried
2022ZD0115902; and in part by the National Natural Science Foundation to relieve these limitations for endoscopy images by uti-
of China under Grant U20A20195, Grant 62272017, Grant 62172437,
and Grant 62102208. (Corresponding authors: Junjun Pan; Ju Dai.) lizing self-supervised training strategy [6], [16], [17], [18].
Zhuoyue Yang and Junjun Pan are with the State Key Labora- Although many self-supervised methods have emerged, the
tory of Virtual Reality Technology and Systems, Beihang University, depth networks for most of the work are similar and based
Haidian, Beijing 100191, China, and also with the Peng Cheng Lab-
oratory, Nanshan, Shenzhen 518000, China (e-mail: yangzhuoyue@ on convolution layers. Some works design more complex and
buaa.edu.cn; [email protected]). heavy networks to achieve better results.
Ju Dai is with the Peng Cheng Laboratory, Nanshan, Shenzhen 518000, For navigation applications, depth estimation networks not
China (e-mail: [email protected]).
Zhen Sun and Yi Xiao are with the Division of Colorectal Surgery, only need to ensure accuracy but also integrate with other
the Department of General Surgery, the Chinese Academy of Med- modules such as registration. An effective and lightweight net-
ical Sciences, and the Peking Union Medical College, Peking Union work structure is an important topic. Currently, there are many
Medical College Hospital, Dongcheng, Beijing 100730, China (e-mail:
[email protected]; [email protected]). advanced works analyzing existing network architectures and
Digital Object Identifier 10.1109/TMI.2024.3352390 making interesting discoveries. For example, the receptive field

1558-254X © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Chungbuk National Univ. Downloaded on October 10,2024 at 11:35:56 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: SELF-SUPERVISED LIGHTWEIGHT DEPTH ESTIMATION IN ENDOSCOPY 1935

of convolution operation is limited, while Transformer [19] find that the structure in Monodepth2 [23] including a depth
can model global information. The latest work has found that network and a separate pose estimation network could achieve
the most effective part of Transformer is the entire framework better performance. Following [16] and [23], this structure
rather than multi-head attention (MHA) operations [20]. CNN became the baseline for subsequent methods and the unsuper-
exhibits strong texture bias, while Transformers exhibit strong vised depth estimation is regarded as an image reconstruction
shape bias [21]. Based on the above findings, we propose problem at present. To deal with edge conditions, such as
a lightweight self-supervised depth estimation network for object motion and occlusion, predictive interpretable masks
endoscopic images, which combines the advantages of CNN are used. Liu et al. [24] propose a self-monitoring method to
and Transformer at a fine-grained level. train convolutional neural networks for intensive depth esti-
Our contributions are summarized as follows: mation from monocular endoscopic data. Supervised signals
• For the first time, we apply the lightweight network to are derived from the positional and sparse point clouds of
endoscopic scenes. We present a novel hybrid architecture the motion recovery structure. Recasens et al. [25] leverage
with an efficient combination of CNN and Transformer monodepth2 [23] in this work to train an endoscopic depth
at different scales. In order to extract global and shape- estimation network to obtain the depth corresponding to each
aware features, we insert Transformer layers into CNN image. Ozyoruk et al. [18] put forward EndoSfMLearner,
layers which are sensitive to local textures. which is an unsupervised monocular depth and pose estima-
• We propose a pose network with several multi-head tion method. This method combines residual networks and a
attention modules. Attention modules are added at dif- spatial attention module to focus on highly textured tissue
ferent locations in order to find a solution with better areas. Li et al. [26] add the LSTM module in the pose
generalization. We perform experiments on several long estimation network to model time information, thus improving
sequences to verify the performance improvement of the the accuracy of pose estimation. Shao et al. [6] joint use
methods. optical flow appearance flow to deal with the brightness
• Extensive experiments have demonstrated the effective- inconsistency problem. Zhang et al. [27] propose a network
ness of our proposed method, which compresses the that shares an encoder and contains two branches in the
number of model parameters without a significant loss decoder. The two branches estimate the depth information
of accuracy. Qualitative experiments demonstrate that our and normal information respectively. Currently, most of the
method achieves comparable results with current state-of- self-supervised deep networks applied to endoscopic images
the-art methods on the SCARED and clinical datasets. are convolutional neural networks. Most researchers [28],
[29] focus on increasing model complexity and parameters
II. R ELATED W ORK to improve the performance of the network.
In this section, we review the unsupervised depth estima- B. Network Architectures
tion methods applied in endoscopic scenes, as well as the
state-of-the-art (SOTA) network framework combining CNN With the development of the technology, Transformer shows
(convolutional neural network) and Transformer applied in great potential for depth estimation tasks in natural scenes.
natural scenes. Varma et al. [30] first evaluate the impact of transformer
on self-supervised monocular depth estimation. DPT [31]
directly uses the Transformer as the encoder, and then fuses
A. Self-Supervised Learning the results of each layer of the Transformer separately to
Depth estimation methods in natural scenes have been stud- generate depth estimation results. AdaBins [32] uses ViT
ied for several years and typically leverage real depth values after general encoders and decoders, and then adaptively
as supervised signals to model the problem as a regression or divides depth values based on the dynamic changes of the
classification problem. However, true depth values are difficult scene. TransDepth [33] also adds Transformer blocks to the
to obtain in an endoscopic environment. It is not until after ResNet [34] results to obtain long-distance information, then
unsupervised methods are widely used [22], that those deep uses a decoder based on attention and Gate to fuse features,
learning methods are formally applied to endoscopic depth and finally performs depth estimation through prediction head.
estimation tasks. Zhou et al. [17] propose an unsupervised Vision Adaptor [35] designs an adapter that runs in parallel
training method using only monocular video sequences. The with ViT, incorporating prior knowledge of images into the
method uses the computed depth and poses as mediators ViT backbone to provide reconstructed multi-scale features for
and warps nearby views to the target view as supervised dense depth estimation problems, preserving the flexibility of
information. Godard et al. [14] leverage binocular videos ViT and improving performance. DeepFormer [36] performs
instead of depth truth to train the fully convolutional network. ViT and convolution operations separately in the encoder stage
The first article [16] applies unsupervised depth estimation and designs a layered aggregation and interaction module to
to endoscopy. The authors use a fully convolutional depth combine the two parts. To summarize, some researchers build
estimation approach with a similar structure to the method independent Transformer-based encoders to obtain feature
in [17]. Godard et al. [23] propose the Monodepth2 on the maps or add several modules to fuse features from CNN.
basis of [14]’s network framework. The predictor behind the MonoViT [37] is the current state-of-the-art work in natural
decoder in the depth estimation network and the decoder scene depth estimation tasks. The encoder of MonoViT [37] is
in the pose estimation network is deleted. Most researchers constructed by stacking several MPViT [38], and the decoder

Authorized licensed use limited to: Chungbuk National Univ. Downloaded on October 10,2024 at 11:35:56 UTC from IEEE Xplore. Restrictions apply.
1936 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 43, NO. 5, MAY 2024

Fig. 1. Overview of the proposed method. Our method includes a depth network (DepthNet) and a pose network (PoseNet). Our DepthNet
consists of an encoder with a combination of CNN and Transformer and a decoder. Our PoseNet is enhanced by the multi-head attention modules.

is from HR-Depth [39]. Each layer of MPViT has three trans- We use the brightness calibration module proposed in [6]
former heads and a convolution head. MonoFormer [21] still to compensate for lighting changes caused by endoscope
relies on ViT [40], mainly by proposing the attention connec- movement. Then, according to the predicted camera poses and
tion module and feature fusion decoder. Zhang et al. [41] pro- camera internal parameters, the estimated depth is re-projected
pose a dilated convolutional module to extract rich multi-scale back to the two-dimensional plane, and the model is supervised
local features and a self-attention-based feature interaction and optimized by calculating the loss between the recon-
module to encode remote global information into features. structed image and the target image. The details of DepthNet
Yu et al. [20] prove that the general architecture of the and PoseNet are described below. The utilized loss functions
Transformers, instead of the specific token mixer module, are listed.
is more essential to the model’s performance. CMT [42] inserts
Transformer structures between different convolutional layers
of CNN. The ablation experiments have shown that the widely B. DepthNet
used phased design in CNN is a better choice for promoting Following [14] and [23], we design our method as an
Transformer-based architectures. In summary, the integration encoder-decoder architecture. CNN has better performance in
of CNN and Transformer in architecture has evolved from extracting local textures and Transformers are sensitive to
coarse-grained stacking to fine-grained information exchange. global information and contours [21]. We present a novel
The difference between our method and the above methods hybrid encoder that is able to focus on both texture and contour
is that we stack the CNN layers and the Transformer layers features. The first and third layers are stacked with multiple
alternately. We utilize this hybrid structure to obtain local and layers of CNN modules, and then several Transformer blocks
global features while also using textures and contours. are placed in sequence in the middle layer. Multi-scale features
from the encoder are connected into a concise decoder.
III. M ETHOD 1) Depth Encoder: The input image is first passed through
a convolution stem, containing three 3 × 3 convolutions. The
A. Overall Architecture first convolution with a stride of 2 and the next two with a
The framework includes a depth estimation network (Depth- stride of 1. The output channel is C1 , and the size of the output
Net), a pose estimation network (PoseNet), and a brightness feature map is H/2 × W/2. From the following stages, the
calibration network, as shown in Fig. 1. Endoscopic images CNN-based layer and Transformer-based layer are alternately
are segmented into groups of three in chronological order. stacked. Firstly, several symbols are defined to describe the
The DepthNet estimates the multi-scale depth map of a single input and output of each stage. We use Fi to represent the
endoscopy image, while the PoseNet estimates the camera feature map output from the i-th layer. The image that has
motion between adjacent images. We combine convolutional been pooled in the i-th layer is labeled as Ii . The feature
layers with transformer structures to build a hybrid DepthNet. obtained through downsampling modules for each layer is Di .

Authorized licensed use limited to: Chungbuk National Univ. Downloaded on October 10,2024 at 11:35:56 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: SELF-SUPERVISED LIGHTWEIGHT DEPTH ESTIMATION IN ENDOSCOPY 1937

Fig. 3. Transformer-based layers that are adopted in the depth

encoder of DepthNet. Transformer blocks using different operations are
distinguished by different shades of yellow. The specific structure of each
type of block is shown in Fig. 2.

Fig. 2. CNN and Transformer blocks that are adopted in the depth
encoder of DepthNet. (a) is the structure of the CNN block. (b) shows
the architecture of the Transformer blocks. To distinguish between two
different Transformer blocks, we name them based on the different
operations used in the framework.

Following [21] and [41], Fi−1 , Di−1 and Ii are concate-

nated together and fed into the i-th layer. Inspired by [41],
in the CNN-based layer, local and long-range features are
extracted by stacking several dilated convolution blocks and Fig. 4. PoseNet. Multi-head attention.
a Transformer block. In the second layer, Transformer-based
architecture is adopted to enhance shape information, resulting with a pooling operation and one Transformer block with
in the depth feature with size H/8 × W/8 × C2 . Then, the an attention block are used to extract shape-aware and long-
aggregating features are fed into dilated convolution modules, range features (F). Subsequently, we concatenate F, Di , and
and depth maps of H/16 × W/16 × C3 are generated by the Ii+1 together, and fed them into the stacked three Transformer
Transformer block. blocks again. The CNN-based layer consists of several convo-
The encoder consists of three layers, each of which is lution blocks and one Transformer block, as shown in Fig. 1.
composed of multiple stacked blocks. We first introduce the The number of CNN blocks in the third layer is twice the
basic blocks used in the depth encoder, and then explain number of CNN blocks in the first layer.
2) Depth Decoder: Our decoder adopts the concise and
the structures of the CNN-based layer and Transformer-based
layer. The dilated convolutions and Transformer [20], [41] effective U-Net [45] structure, in [23]. Convolution layers
blocks are shown in Fig. 2. We first define the symbols used in and skip connections are employed in the decoder to receive
this paper for convenience. X denotes the input features. And multi-scale features from the encoder. Then, cross-layer con-
X̂ represent the output of the dilated convolution module. BN nections and upsamples are used to increase the resolution.
is a batch normalization and LN is a layer normalization. MLP Finally, three prediction heads output inverse depth maps at
is the abbreviation of a multi-layer perceptron. As shown in different resolutions, according to the aggregated features.
Fig.2(a), X̂ is defined as follows: Each prediction head consists of a convolution layer, a bilinear
upsample, and a sigmoid layer. All predicted multi-scale depth
X̂ = X + MLP(BN(DConv(X))), (1) maps participate in self-supervised learning optimization.
where DConv is the depth-wise dilated convolution operation
with the dilatation rate. We replace MLP with activation func- C. PoseNet
tion (GELU) [43] in some blocks to reduce model parameters. Most of the networks [46] utilize a pose estimation network
There are two types of Transformer blocks, as shown in similar to monodepth2 [23], which takes two adjacent color
Fig.2(b). Ŷ is the output of the Transformer module and can pictures as input and outputs the 6-DoF relative pose between
be computed as follows: the pictures. PoseNet uses the pre-trained ResNet [34], i.e.
a structure with four superimposed convolutional layers, as an
X̃ = MixToken(LN(X)) + X, encoder. Considering the influence of light in the medical
Ŷ = X̃ + MLP(LN(X̃)). (2) scene, we add multi-head attention modules [19] into the above
architecture to improve the performance of the pose estimation
Pooling [20] and cross-covariance attention [44] operations are
network, as shown in Fig. 4.
utilized as MixToken. As shown in Fig. 2(b), for the output
Two adjacent images (H ×W ×3) are first fed into a convo-
of cross-covariance attention block, Ŷ = X + MLP(LN(X̃)).
lution stem to obtain a feature map F of size H/2 × W/2.
The Transformer-based layer is illustrated in Fig. 3. The
After passing through the maximum pooling layer, the output
input of the Transformer layer is the concatenation of Fi−1 ,
(F)
e of multi-head attention can be defined as:
Di−1 , and Ii . We first perform a convolution to reduce the
dimensionality of the input. Then, two Transformer blocks e = MultiHeadAtten(Q, K, V) + F,
F (3)

Authorized licensed use limited to: Chungbuk National Univ. Downloaded on October 10,2024 at 11:35:56 UTC from IEEE Xplore. Restrictions apply.
1938 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 43, NO. 5, MAY 2024

where MultiHeadAtten(Q, K, V) is the concatenated output of published on the endoscopic sub-challenge organized by MIC-
k self-attention operations, which is applied as: CAI2019, containing 9 different sub-datasets collected from
porcine cadavers. Each sub-dataset contains an endoscope
QK T
Attention(Q, K, V) = softmax( √ )V, (4) video, the ground truth of the pose recorded by the surgical
d robot, and the ground truth of depth collected by structured
e and d is the dimension of light equipment. Therefore, we can evaluate the performance
where Q, K, V are projected from F
of pose estimation and depth estimation methods using this
the input. Then, the feature extraction was performed through
dataset. Following [6], we also refer to the Eigen-Zhou [17],
two superimposed ResNet [34] blocks to obtain feature maps
[49] evaluation protocol to separate the training, validation,
with scales of H/4 × W/4 and H/8 × W/8, respectively.
and test datasets, respectively.
In addition, the extracted feature map passes through the
2) Clinical Dataset: In order to verify the generalization
multi-head attention layer again. Finally, the last two feature
performance of the method, we also collect videos during right
maps are obtained through two basic blocks. The feature maps
hemicolectomy surgery with the assistance of surgeons. Four
are converted into pose matrices through convolutions.
representative video clips are selected for quantitative experi-
ments. Each video contains 150-200 images. The contents of
D. Self-Supervised Learning the images include live, colon, small intestine, fat, etc. in the
Like other unsupervised learning methods, we also trans- abdominal cavity. These four sequences are representative
form the task as 2D image reconstruction and supervise the image sequences during the surgical navigation phase. This
consistency and accuracy of depth estimation by minimizing dataset is not utilized in the training process.
the similarity between the re-projected image and the target
image. The image reconstruction loss consists of the photomet- B. Implementation Details
ric loss (L p ) and edge-aware loss (Le ). We define the source Our method is implemented by PyTorch. In our experiments,
image as I† . Utilizing the pose estimation T and intrinsic we utilize a single NVIDIA V100 and the batch size is 12. The
parameters of the camera P, the reconstructed image (Ĩ) can following training augmentations are performed, with 50%
be re-projected (π) from the depth estimation D and I† . The chance: random brightness, contrast, saturation, and hue jitter
reconstructed image (Ĩ) is defined as follows: with respective ranges of ±0.2, ±0.2, ±0.2, and ±0.1. Our
depth estimation network and pose estimation network use
Ĩ = π(I† , T, D, P). (5)
two AdamW [50] optimizers respectively. The initial values
Due to inconsistent lighting in the endoscopic environment, of learning rates are 1e-4. Drop-path is used to mitigate
the photometric loss is inaccurate. We apply a pre-trained overfitting and the training epoch is set to 50. The specific
optical flow network to calibrate the rotation and translation values of C1 , C2 and C3 are 48, 80 and 128.
changes between two input images and use a pre-trained Following [6], [17], and [23], we compute the 5 standard
appearance flow network that results in C to supplement the metrics (Abs Rel, Sq Rel, RMSE, RMSE log, δ < 1.25)
illumination. The modified image (Î) resulting from the target proposed in [49] for evaluation. These metrics are defined as
image I is as follows: follows:
1 X ∗
Î = I + C. (6) Abs Rel = |d −d|/d ∗ (9)
|D|
d∈D
The image similarity (F) between the modified image (Î) 1 X ∗
and the reconstructed image (Ĩ) is defined as follows: Sq Rel = |d −d|2 /d ∗ (10)
|D|
d∈D
s
1 − SS I M(Î, Ĩ) 1 X
F =α· + (1 − α) · Î − Ĩ , (7) R M S E log = | log d ∗ − log d|2 (11)
2 |D|
d∈D
where SS I M is the structural similarity index [47] and α = s
1 X ∗
0.85. The photometric loss L p is the minimum value of F RMSE = |d −d|2 , (12)
among two adjacent images with the visibility mak [6], [23]. |D|
d∈D
In order to maintain the edges, edge-aware loss is also used. d∗ d

1
As in previous work [17] and [23], the edge-aware loss is δ= d ∈ D|max( , ∗ < 1.25) × 100%
|D| d d
defined as: (13)
Le = |∂x d|e−|∂x I| + |∂ y d|e−|∂ y I| , (8) where D is the set of the predicted depth. d and d ∗ denote the
where d represents the mean-normalized inverse depth of I. predicted depth and the ground truth, respectively. We perform
a 5-frame pose evaluation following [17] and adopt the metric
of absolute trajectory error (ATE) [51].
IV. DATASET AND R ESULTS
A. Dataset C. Depth Estimation
1) SCARED Dataset: We utilize SCARED [48] dataset to 1) Performance Comparision: We run experiments on the
evaluate our methods’ performance. The SCARED dataset is SCARED dataset to evaluate the depth error and accuracy of

TABLE I
D EPTH N ET P ERFORMANCE . ‘E NCODER ’, ‘D ECODER ’, AND ‘OVERALL’ R EPRESENTS THE N UMBER OF PARAMETERS U TILIZED IN D EPTH N ET.
AUXILIARY R EPRESENTS THE N UMBER OF AUXILIARY M ODELS ’ PARAMETERS . †M EANS THE B EST R ESULT W E R EPRODUCE ON O UR M ACHINE

Fig. 5. Qualitative depth comparison. There are four examples from the SCARED dataset. The first row is original images and the others are
depth maps. The second and third rows are results from [6] and [23]. The last row shows our results.

our model. The proposed method is compared with several the smallest parameters in the inference phase. We achieve
SOTA self-supervised methods, including AF-SfM [6], Endo- the second-highest ranking result in accuracy. In Table I, the
SfM [18], Monodepth2 [23], Fang et al. [52], DeFeat-Net [53] auxiliary parameters refer to the network parameters proposed
and SC-SfMLearner [54]. To make up the monocular scale in AF-SfM [6] for correcting illumination. With these two
ambiguity, following the same strategies indicated in [6] auxiliary networks only utilizing the training phase, both
and [23], the estimated depth is scaled by the per-image our model and AF-SfM [6] can achieve better performance.
median ground truth. Table I collects the quantitative results The performance of compared methods on depth estimation
of our model against other typical self-supervised methods. is from [6]. According to Table I, our method achieves a
The encoder, decoder, and overall columns in Table I report lower result on RMSE. Fig. 5 shows that our model obtains
the size of parameters in the DepthNet. Our method achieves satisfactory results compared with other methods. We can
comparable performance to the state-of-the-art methods with observe that our method provides a more accurate depth

Authorized licensed use limited to: Chungbuk National Univ. Downloaded on October 10,2024 at 11:35:56 UTC from IEEE Xplore. Restrictions apply.
1940 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 43, NO. 5, MAY 2024

TABLE II
A BLATION S TUDY ON THE N UMBER OF T RANSFORMER
B LOCKS IN O NE L AYER

estimation of the edge of organs while maintaining the global

smoothness of soft tissues. These quantitative and qualitative
results demonstrate the superiority of our method.
2) Ablation Study on DepthNet Architecture: To further Fig. 6. CNN and Transformer architectures that can be adopted
in the depth encoder of DepthNet. (a), (b), and (c) are the hybrid
demonstrate the validity of the proposed model, an ablation architecture with 3, 4 and 5 layers.
study is conducted to assess the importance of different
designs in the architecture. We conduct experiments on the TABLE III
number of Transformer blocks and the structure of Transform- A BLATION S TUDY ON T RANSFORMER AND CNN A RCHITECTURES
ers layers. The number of modules in a single layer determines
the foundation of the framework.
a) Ablation study on the number of transformer blocks:
Table II collects the results with different numbers of trans-
former blocks with pooling operation in the middle Trans-
former layer. The transformer block with attention remains
TABLE IV
unchanged in each experiment. The baseline is a simplified
P OSE P ERFORMANCE
model in the second layer that does not replace CNN layers
with Transformer layers. We test the depth estimation results
of 2, 3, and 4 Transformer blocks. Based on the results of
the second and third rows, we find that adding a block with
pooing can improve the performance of the model. However,
based on the results of the third and fourth rows, we find
that consistently stacking pooling Transformer blocks result
in a decrease in performance. Therefore, we use the structure
in Fig. 3 to achieve stable performance improvement while
comparison of the proposed method with the other five
increasing pool formers through cascading and convolution
methods. The performance of compared methods is from
operations. Based on the results in the last row of Table I,
AF-SfM [6]. Our method achieves the lowest error on the
our current structure can strike a balance between the depth
ATE. Most of the work use the same pose estimation net-
estimation accuracy and the model size.
work. We concatenate two input images and then estimate
b) Ablation study on the architecture of transformer layers: The
the 6DoF between the two images using features extracted
influence of different architecture on accuracy has been stud-
by ResNet [34]. Feature-dependent approaches have higher
ied. We compare the following three frameworks, as shown
immunity against light variations. We add attention mecha-
in Fig. 6. These three subgraphs show the basic hybrid
nisms to enhance features, emphasize differences, and thus
structure, each consisting of 3, 4, and 5 layers. In both
improve performance.
Fig. 6(a) and Fig. 6(c), CNN-based layers are used as the first
To further analyze the effect of the multi-head attention
and last layers. In both (b) and (c), there are two Transformer
mechanism on the pose estimation network, we conduct
layers in the architecture.
ablation experiments. Table V collects the results of adding
Table III shows the different results obtained by these
multiple attention mechanisms at different locations. These
three structures. Both (a) and (c) achieve good performance,
insertion locations include the first layer of convolution, the
which is comparable to the most advanced methods. However,
second layer of convolution, the third layer of convolution,
the model parameters of (a) are the smallest. So in the
and various combinations of these locations. For the scheme,
performance analysis experiment, we report the results of (a).
after MHA is added to the first layer of convolution, we note
However, the structure in (c) can achieve smaller errors on Sq
that while it achieves better results than Monodepth2 [23],
Rel, RMSE, and RMSE log metrics.
it is not as good as the combined use of appearance flow.
Interestingly for Sequence-1, we find that adding multi-head
D. Pose Estimation attention in both the second and third layers achieves lower
We select two sequences with longer trajectories [6] in errors. However, for Sequence-2, only the MHA in the second
the SCARED dataset and label them as Sequence-1 (Seq.1) layer yields a performance gain. Therefore, we use the addition
and Sequence-2 (Seq.2) respectively. Table IV shows the of the MHA mechanism in the middle layer to obtain better

Fig. 7. Qualitative pose comparison. The first three columns are the trajectory results by using comparative methods( [6], [18], [23]). The results
in the last column are our trajectory results.

TABLE V
A BLATION S TUDY ON P OSE N ET

generalization. Fig. 7 reports qualitative examples from these

two trajectories. The performance of our model is superior to
other competitors in the middle of trajectories.

E. Surface Reconstruction
We can recover point clouds from camera intrinsics and
depth estimates, as shown in Fig. 8. The point cloud shown Fig. 8. Point clouds on the SCARED and clinical datasets.
(a) (b) show two examples. Images in the first row are original images
in Fig. 8 does not have any added colors, in order to and figures in the second row are reconstructed point clouds.
display the geometric structure. We use truncated signed
distance function (TSDF) [55] to fuse multiple point clouds
TABLE VI
in order to extend the 3D model of the tissue surface. The S URFACE R ECONSTRUCTION
implementation is developed by Open3d [56]. Readers can
reference [25] to get the procedure of expanding multiple point
clouds based on pose estimates. We further utilize laparo-
scopic images obtained from surgery for visual performance
analysis.
Fig. 9 shows the surface reconstructed from the SCARED
dataset. Subfigures in Fig. 10 are the surfaces recovered from
the clinical dataset. The images in the first row demonstrate the
texture and the second row shows the mesh. Through mesh, The average number of points for surface models in the scared
we can more clearly see the structure of soft tissues in different dataset and the real dataset is 1.8 and 1.5 million, respectively.
scenarios. By adding textures, the entire scene can be visually The average processing time for each image is 0.2 seconds.
reflected. Fig. 9 shows that our method preserves distinct tissue We do not include network inference time here. The inference
structures and keeps local soft tissues smooth and continuous. time of our method and other methods are shown in TableVII.
TableVI reflects our scenes containing a large number of verts. Our method also reduces inference time.

Authorized licensed use limited to: Chungbuk National Univ. Downloaded on October 10,2024 at 11:35:56 UTC from IEEE Xplore. Restrictions apply.
1942 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 43, NO. 5, MAY 2024

Fig. 9. Our surface reconstructions on the SCARED dataset. (a), (b), (c) and (d) are 3D surfaces recovered from four images captured from
porcine cadavers.

Fig. 10. Recoverd surfaces on the clinical dataset.(a), (b), (c), and (d) are 3D surfaces recovered from four representative laparoscopic images
obtained during surgery, mainly including fat, intestines, and liver.

TABLE VII
D EPTH N ET I NFERENCE S PEED

F. Limitations

Although our method is mainly trained and tested on

laparoscopic images, we have also tested it in clinical exper-
iments. However, there are still some disadvantages to the
depth fusion, such as the presence of discrete points on Fig. 11. The example of unsatisfactory reconstruction result. The
green box region shows the overlap.
the edge of the fourth image in Fig. 9. In addition, for
dynamic scenarios, such as device movement and interaction
between devices and soft tissues, current fusion methods may multiple images. The problem of inconsistent depth between
have a significant overlap (Fig. 11). The reason for this different laparoscopic images still exists due to the similar
phenomenon may be due to inconsistent depth estimates across texture.

V. C ONCLUSION [12] Ó. G. Grasa, E. Bernal, S. Casado, I. Gil, and J. M. M. Montiel, “Visual

SLAM for handheld monocular endoscope,” IEEE Trans. Med. Imag.,
A lightweight depth estimation network is first applied for vol. 33, no. 1, pp. 135–146, Jan. 2014, doi: 10.1109/TMI.2013.2282997.
endoscopy images in this paper. We propose a self-supervised [13] M. Ye, S. Giannarou, A. Meining, and G.-Z. Yang, “Online tracking
and retargeting with applications to optical biopsy in gastrointestinal
depth estimation network with a combination of CNN and endoscopic examinations,” Med. Image Anal., vol. 30, pp. 144–157,
Transformer for endoscopy images. CNN-based layers mixed May 2016, doi: 10.1016/j.media.2015.10.003.
with transformer-based layers are utilized as the encoder [14] C. Godard, O. M. Aodha, and G. J. Brostow, “Unsupervised monocular
depth estimation with left-right consistency,” in Proc. IEEE Conf.
to aggregate local texture information and global contour Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 6602–6611, doi:
features. Our method achieves competitive results while also 10.1109/CVPR.2017.699.
reducing the number of parameters. The proposed pose net- [15] Q. Xu, Y. Li, M. Zhang, and W. Li, “COCO-Net: A dual-
supervised network with unified ROI-loss for low-resolution ship
work obtains the minimum error on the SCARED dataset detection from optical satellite image sequences,” IEEE Trans. Geosci.
compared to the previous approaches. Detailed quantitative Remote Sens., vol. 60, pp. 1–15, 2022, Art. no. 5629115, doi:
and qualitative experiments demonstrate the effectiveness of 10.1109/TGRS.2022.3201530.
[16] M. Turan et al., “Unsupervised odometry and depth learning for endo-
our method. scopic capsule robots,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst.
However, there are still some issues that need to be (IROS), Oct. 2018, pp. 1801–1807, doi: 10.1109/IROS.2018.8593623.
improved in the depth fusion task. The newest implicit scene [17] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervised
learning of depth and ego-motion from video,” in Proc. IEEE Conf.
representation methods, such as NeRF [57], can be used to Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 6612–6619, doi:
solve the above challenge. In the future, we attempt to improve 10.1109/CVPR.2017.700.
the performance of networks in dynamic object scenarios, [18] K. B. Ozyoruk et al., “EndoSLAM dataset and an unsupervised monoc-
such as surgical instruments and deformable tissues. Further ular visual odometry and depth estimation approach for endoscopic
videos,” Med. Image Anal., vol. 71, Jul. 2021, Art. no. 102058, doi:
validation is needed to apply our method in actual surgical 10.1016/j.media.2021.102058.
scenarios. Animal studies with pigs will be done in the future. [19] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural
Pigs’ gut environment and structure resemble those of humans. Inf. Process. Syst. (NIPS), Dec. 2017, vol. 30, pp. 6000–6010, doi:
10.48550/arXiv.1706.03762.
We attempt to extend the method in this study for use in human [20] W. Yu et al., “MetaFormer is actually what you need for vision,” in Proc.
research after conducting animal tests. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022,
pp. 10809–10819, doi: 10.1109/CVPR52688.2022.01055.
[21] J. Bae, S. Moon, and S. Im, “Deep digging into the generalization of self-
R EFERENCES supervised monocular depth estimation,” presented at the Proc. AAAI
Conf. Artif. Intell., 2023, doi: 10.48550/arXiv.2205.11083.
[1] T. Collins et al., “Augmented reality guided laparoscopic surgery of the [22] Q. Xu, Y. Li, J. Nie, Q. Liu, and M. Guo, “UPanGAN: Unsupervised
uterus,” IEEE Trans. Med. Imag., vol. 40, no. 1, pp. 371–380, Jan. 2021, pansharpening based on the spectral and spatial loss constrained gener-
doi: 10.1109/TMI.2020.3027442. ative adversarial network,” Inf. Fusion, vol. 91, pp. 31–46, Mar. 2023,
[2] P. Zhang et al., “Real-time navigation for laparoscopic hepatectomy doi: 10.1016/j.inffus.2022.10.001.
using image fusion of preoperative 3D surgical plan and intraoperative [23] C. Godard, O. M. Aodha, M. Firman, and G. Brostow, “Digging
indocyanine green fluorescence imaging,” Surgical Endoscopy, vol. 34, into self-supervised monocular depth estimation,” in Proc. IEEE/CVF
no. 8, pp. 3449–3459, Aug. 2020, doi: 10.1007/s00464-019-07121-1. Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 3827–3837, doi:
[3] R. Hussain, A. Lalande, R. Marroquin, K. B. Girum, C. Guigou, and 10.1109/ICCV.2019.00393.
A. B. Grayeli, “Real-time augmented reality for ear surgery,” in Proc. [24] X. Liu et al., “Dense depth estimation in monocular endoscopy with
Int. Conf. Med. Image Comput. Comput.-Assist. Intervent. (MICCAI), self-supervised learning methods,” IEEE Trans. Med. Imag., vol. 39,
Sep. 2018, pp. 324–331, doi: 10.1007/978-3-030-00937-3_38. no. 5, pp. 1438–1447, May 2020, doi: 10.1109/TMI.2019.2950936.
[4] H. Luo et al., “Augmented reality navigation for liver resection with [25] D. Recasens, J. Lamarca, J. M. Fácil, J. M. M. Montiel, and J. Civera,
a stereoscopic laparoscope,” Comput. Methods Programs Biomed., “Endo-depth-and-motion: Reconstruction and tracking in endoscopic
vol. 187, Apr. 2020, Art. no. 105099, doi: 10.1016/j.cmpb.2019.105099. videos using depth networks and photometric constraints,” IEEE
[5] R. Wei et al., “Stereo dense scene reconstruction and accurate localiza- Robot. Autom. Lett., vol. 6, no. 4, pp. 7225–7232, Oct. 2021, doi:
tion for learning-based navigation of laparoscope in minimally invasive 10.1109/LRA.2021.3095528.
surgery,” IEEE Trans. Biomed. Eng., vol. 70, no. 2, pp. 488–500, [26] L. Li, X. Li, S. Yang, S. Ding, A. Jolfaei, and X. Zheng, “Unsupervised-
Feb. 2023, doi: 10.1109/TBME.2022.3195027. learning-based continuous depth and motion estimation with monoc-
[6] S. Shao et al., “Self-supervised monocular depth and ego-motion estima- ular endoscopy for virtual reality minimally invasive surgery,” IEEE
tion in endoscopy: Appearance flow to the rescue,” Med. Image Anal., Trans. Ind. Informat., vol. 17, no. 6, pp. 3920–3928, Jun. 2021, doi:
vol. 77, Apr. 2022, Art. no. 102338, doi: 10.1016/j.media.2021.102338. 10.1109/TII.2020.3011067.
[7] Y. Li et al., “SuPer: A surgical perception framework for endo- [27] Y. Zhang et al., “Colde: A depth estimation framework for
scopic tissue manipulation with surgical robotics,” IEEE Robot. colonoscopy reconstruction,” Nov. 2021, arXiv:2111.10371., doi:
Autom. Lett., vol. 5, no. 2, pp. 2294–2301, Apr. 2020, doi: 10.48550/arXiv.2111.10371.
10.1109/LRA.2020.2970659. [28] Y. Liu and S. Zuo, “Self-supervised monocular depth estimation for gas-
[8] H. Itoh et al., “Binary polyp-size classification based on deep-learned trointestinal endoscopy,” Comput. Methods Programs Biomed., vol. 238,
spatial information,” Int. J. Comput. Assist. Radiol. Surg., vol. 16, no. 10, Aug. 2023, Art. no. 107619, doi: 10.1016/j.cmpb.2023.107619.
pp. 1817–1828, Oct. 2021, doi: 10.1007/s11548-021-02477-z. [29] Y. Yang et al., “A geometry-aware deep network for depth estimation
[9] R. Tang et al., “Augmented reality technology for preoperative plan- in monocular endoscopy,” Eng. Appl. Artif. Intell., vol. 122, Jun. 2023,
ning and intraoperative navigation during hepatobiliary surgery: A Art. no. 105989, doi: 10.1016/j.engappai.2023.105989.
review of current methods,” Hepatobiliary Pancreatic Diseases Int., [30] A. Varma, H. Chawla, B. Zonooz, and E. Arani, “Transformers in self-
vol. 17, no. 2, pp. 101–112, Apr. 2018, doi: 10.1016/j.hbpd.2018. supervised monocular depth estimation with unknown camera intrinsics,”
02.002. Feb. 2022, arXiv:2202.03131.
[10] D. Psychogyios, E. Mazomenos, F. Vasconcelos, and D. Stoyanov, [31] R. Ranftl, A. Bochkovskiy, and V. Koltun, “Vision transformers
“MSDESIS: Multitask stereo disparity estimation and surgical instru- for dense prediction,” in Proc. IEEE/CVF Int. Conf. Comput. Vis.
ment segmentation,” IEEE Trans. Med. Imag., vol. 41, no. 11, (ICCV), Oct. 2021, pp. 12159–12168, doi: 10.1109/ICCV48922.2021.
pp. 3218–3230, Nov. 2022, doi: 10.1109/TMI.2022.3181229. 01196.
[11] S. Rattanalappaiboon, T. Bhongmakapat, and P. Ritthipravat, “Fuzzy [32] S. Farooq Bhat, I. Alhashim, and P. Wonka, “AdaBins: Depth
zoning for feature matching technique in 3D reconstruction of nasal estimation using adaptive bins,” in Proc. IEEE/CVF Conf. Com-
endoscopic images,” Comput. Biol. Med., vol. 67, pp. 83–94, Dec. 2015, put. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 4008–4017, doi:
doi: 10.1016/j.compbiomed.2015.09.021. 10.1109/CVPR46437.2021.00400.

Authorized licensed use limited to: Chungbuk National Univ. Downloaded on October 10,2024 at 11:35:56 UTC from IEEE Xplore. Restrictions apply.
1944 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 43, NO. 5, MAY 2024

[33] G. Yang, H. Tang, M. Ding, N. Sebe, and E. Ricci, “Transformer- [46] Z. Zhou, X. Fan, P. Shi, and Y. Xin, “R-MSFM: Recurrent multi-
based attention networks for continuous pixel-wise prediction,” in Proc. scale feature modulation for monocular depth estimating,” in Proc.
IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 16249–16259, IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 12757–12766,
doi: 10.1109/ICCV48922.2021.01596. doi: 10.1109/ICCV48922.2021.01254.
[34] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for [47] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. quality assessment: From error visibility to structural similarity,” IEEE
(CVPR), Jun. 2016, pp. 770–778, doi: 10.1109/CVPR.2016.90. Trans. Image Process., vol. 13, no. 4, pp. 600–612, Apr. 2004, doi:
[35] Z. Chen et al., “Vision transformer adapter for dense predictions,” 10.1109/TIP.2003.819861.
in Proc. Int. Conf. Learn. Represent. (ICLR), Feb. 2023, doi: [48] M. Allan et al., “Stereo correspondence and reconstruction of endoscopic
10.48550/arXiv.2205.08534. data challenge,” 2021, arXiv:2101.01133.
[36] Z. Li, Z. Chen, X. Liu, and J. Jiang, “DepthFormer: Exploiting long- [49] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a
range correlation and local information for accurate monocular depth single image using a multi-scale deep network,” in Proc. Int. Conf.
estimation,” 2022, arXiv:2203.14211. Neural Inf. Process. Syst. (NIPS), Dec. 2014, pp. 2366–2374, doi:
[37] C. Zhao et al., “MonoViT: Self-supervised monocular depth estimation 10.48550/arXiv.1406.2283.
with a vision transformer,” in Proc. Int. Conf. 3D Vis. (3DV), Sep. 2022, [50] I. Loshchilov and F. Hutter, “Decoupled weight decay regulariza-
pp. 668–678, doi: 10.1109/3DV57658.2022.00077. tion,” in Proc. Int. Conf. Learn. Represent. (ICLR), Dec. 2018, doi:
[38] Y. Lee, J. Kim, J. Willette, and S. J. Hwang, “MPViT: Multi-path 10.48550/arXiv.1711.05101.
vision transformer for dense prediction,” in Proc. IEEE/CVF Conf. [51] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardós, “ORB-
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 7277–7286, doi: SLAM: A versatile and accurate monocular SLAM system,” IEEE
10.1109/CVPR52688.2022.00714. Trans. Robot., vol. 31, no. 5, pp. 1147–1163, Oct. 2015, doi:
[39] X. Lyu et al., “HR-depth: High resolution self-supervised monocular 10.1109/TRO.2015.2463671.
depth estimation,” in Proc. AAAI Conf. Artif. Intell., 2021, vol. 35, no. 3, [52] Z. Fang, X. Chen, Y. Chen, and L. Van Gool, “Towards good practice
pp. 2294–2301, doi: 10.1609/aaai.v35i3.16329. for CNN-based monocular depth estimation,” in Proc. IEEE Winter
[40] A. Dosovitskiy et al., “An image is worth 16×16 words: Transformers Conf. Appl. Comput. Vis. (WACV), Mar. 2020, pp. 1080–1089, doi:
for image recognition at scale,” in Proc. Int. Conf. Learn. Represent. 10.1109/WACV45572.2020.9093334.
(ICLR), Jan. 2021, doi: 10.48550/arXiv.2010.11929. [53] J. Spencer, R. Bowden, and S. Hadfield, “DeFeat-Net: General
[41] N. Zhang, F. Nex, G. Vosselman, and N. Kerle, “Lite- monocular depth via simultaneous unsupervised representation learn-
mono: A lightweight CNN and transformer architecture for ing,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
self-supervised monocular depth estimation,” in Proc. IEEE Conf. (CVPR), Jun. 2020, pp. 14390–14401, doi: 10.1109/CVPR42600.2020.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 18537–18546, 01441.
doi: 10.48550/arXiv.2211.13202. [54] J. Bian et al., “Unsupervised scale-consistent depth and ego-motion
[42] J. Guo et al., “CMT: Convolutional neural networks meet learning from monocular video,” in Proc. 33rd Conf. Neural Inf.
vision transformers,” in Proc. IEEE/CVF Conf. Comput. Vis. Process. Syst., Dec. 2019, pp. 35–45, doi: 10.48550/arXiv.1908.
Pattern Recognit. (CVPR), Jun. 2022, pp. 12165–12175, doi: 10553.
10.1109/CVPR52688.2022.01186. [55] B. Curless and M. Levoy, “A volumetric method for building complex
[43] D. Hendrycks and K. Gimpel, “Gaussian error linear units (GELUs),” models from range images,” in Proc. 23rd Annu. Conf. Comput. Graph.
Jun. 2016, arXiv:1606.08415, doi: 10.48550/arXiv.1606.08415. Interact. Techn., Aug. 1996, pp. 303–312, doi: 10.1145/237170.237269.
[44] A. Ali et al., “XCiT: Cross-covariance image transformers,” in Proc. Adv. [56] Q.-Y. Zhou, J. Park, and V. Koltun, “Open3D: A modern
Neural Inf. Process. Syst. (NIPS), vol. 34, Dec. 2021, pp. 20014–20027, library for 3D data processing,” Jan. 2018, arXiv:1801.09847, doi:
doi: 10.48550/arXiv.2106.0968. 10.48550/arXiv.1801.09847.
[45] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional [57] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi,
networks for biomedical image segmentation,” in Proc. Int. Conf. and R. Ng, “NeRF: Representing scenes as neural radiance fields for
Med. Image Comput. Comput.-Assist. Intervent. (MICCAI), Nov. 2015, view synthesis,” Commun. ACM, vol. 65, no. 1, pp. 99–106, Dec. 2021,
pp. 234–241, doi: 10.1007/978-3-319-24574-4_28. doi: 10.1145/3503250.

Authorized licensed use limited to: Chungbuk National Univ. Downloaded on October 10,2024 at 11:35:56 UTC from IEEE Xplore. Restrictions apply.

GMCPTechnicalWeekly31 10 2023
100% (1)
GMCPTechnicalWeekly31 10 2023
18 pages
1402 Hill Cipher Part II PDF
No ratings yet
1402 Hill Cipher Part II PDF
35 pages
DARPA Grand Challenge 2005 Technical Paper: 24 August 2005
No ratings yet
DARPA Grand Challenge 2005 Technical Paper: 24 August 2005
15 pages
Bodyslam: A Generalized Monocular Visual Slam Framework For Surgical Applications
No ratings yet
Bodyslam: A Generalized Monocular Visual Slam Framework For Surgical Applications
16 pages
Paper 1
No ratings yet
Paper 1
12 pages
Luo and Mori - 2014 - A Discriminative Structural Similarity Measure and
No ratings yet
Luo and Mori - 2014 - A Discriminative Structural Similarity Measure and
14 pages
Mahmoud Et Al - 2019 - Live Tracking and Dense Reconstruction For Handheld Monocular Endos
No ratings yet
Mahmoud Et Al - 2019 - Live Tracking and Dense Reconstruction For Handheld Monocular Endos
10 pages
SimCol Challenge Paper R1
No ratings yet
SimCol Challenge Paper R1
19 pages
Wang Et Al 2024 EndoGSLAM
No ratings yet
Wang Et Al 2024 EndoGSLAM
11 pages
A Systematic Review On Application of Deep Learning in Digestive System Image Processing
No ratings yet
A Systematic Review On Application of Deep Learning in Digestive System Image Processing
16 pages
Learning Dynamic Spatial Relations The Case of A Knowledge Based Endoscopic Camera Guidance Robot ISBN 3658149132, 9783658149130 Exclusive Download
No ratings yet
Learning Dynamic Spatial Relations The Case of A Knowledge Based Endoscopic Camera Guidance Robot ISBN 3658149132, 9783658149130 Exclusive Download
14 pages
Global-Local Path Networks For Monocular Depth Estimation With Vertical Cutdepth
No ratings yet
Global-Local Path Networks For Monocular Depth Estimation With Vertical Cutdepth
11 pages
Endoscopic Image Classification Based On Explainable Deep Learning
No ratings yet
Endoscopic Image Classification Based On Explainable Deep Learning
14 pages
Learning Dynamic Spatial Relations The Case of A Knowledge Based Endoscopic Camera Guidance Robot Unrestricted Download
100% (20)
Learning Dynamic Spatial Relations The Case of A Knowledge Based Endoscopic Camera Guidance Robot Unrestricted Download
17 pages
Image-Based Laparoscopic Tool Detection and Tracki
No ratings yet
Image-Based Laparoscopic Tool Detection and Tracki
15 pages
Robotic Arm Platform
No ratings yet
Robotic Arm Platform
8 pages
Perspective Isosurface and Direct Volume Rendering For Virtual Endoscopy Applications
No ratings yet
Perspective Isosurface and Direct Volume Rendering For Virtual Endoscopy Applications
9 pages
Mi06 Abstracts
No ratings yet
Mi06 Abstracts
187 pages
NanoNet Real Time Polyp Segmentation in
No ratings yet
NanoNet Real Time Polyp Segmentation in
7 pages
Endoscopic Image Sequence
No ratings yet
Endoscopic Image Sequence
18 pages
Pore 2022 Colonos
No ratings yet
Pore 2022 Colonos
7 pages
Monovit: Self-Supervised Monocular Depth Estimation With A Vision Transformer
No ratings yet
Monovit: Self-Supervised Monocular Depth Estimation With A Vision Transformer
11 pages
An Anchor-Free Convolutional Neural Network For Re
No ratings yet
An Anchor-Free Convolutional Neural Network For Re
9 pages
DaFoEs - Mixing Datasets Towards The Generalization of Vision-State Deep-Learning Force Estimation in Minimally Invasive Robotic Surgery
No ratings yet
DaFoEs - Mixing Datasets Towards The Generalization of Vision-State Deep-Learning Force Estimation in Minimally Invasive Robotic Surgery
8 pages
A Reinforcement Learning Approach For Real-Time Articulated Surgical Instrument 3D Pose Reconstruction
No ratings yet
A Reinforcement Learning Approach For Real-Time Articulated Surgical Instrument 3D Pose Reconstruction
10 pages
Vitreoretinal Surgical Instrument Tracking in Three Dimensions Using Deep Learning
No ratings yet
Vitreoretinal Surgical Instrument Tracking in Three Dimensions Using Deep Learning
12 pages
JMI-24284GRR Online
No ratings yet
JMI-24284GRR Online
20 pages
Robotics 13 00047
No ratings yet
Robotics 13 00047
15 pages
CNN-SLAM: Real-Time Dense Monocular SLAM With Learned Depth Prediction
No ratings yet
CNN-SLAM: Real-Time Dense Monocular SLAM With Learned Depth Prediction
10 pages
Wireless Capsule Endoscopy Image Classification An Explainable AI Approach
No ratings yet
Wireless Capsule Endoscopy Image Classification An Explainable AI Approach
19 pages
Icsse2023 209
No ratings yet
Icsse2023 209
6 pages
Efficient and Accurate Mapping of Subsurface Anato
No ratings yet
Efficient and Accurate Mapping of Subsurface Anato
19 pages
Desing and Use of Endoscopic Robitic Arm
No ratings yet
Desing and Use of Endoscopic Robitic Arm
8 pages
Unsupervised Domain Adaptation For Depth Prediction From Images
No ratings yet
Unsupervised Domain Adaptation For Depth Prediction From Images
14 pages
Deep Learning Based Monocular Depth Estimation For Object Distance Inference in 2D Images
No ratings yet
Deep Learning Based Monocular Depth Estimation For Object Distance Inference in 2D Images
5 pages
Self-Supervised Learning For Endoscopic Video Analysis
No ratings yet
Self-Supervised Learning For Endoscopic Video Analysis
12 pages
Wu Et Al. - 2024 - Efficient Domain Adaptation For Endoscopic Visual
No ratings yet
Wu Et Al. - 2024 - Efficient Domain Adaptation For Endoscopic Visual
10 pages
Improving Structured Light Based Depth and Pose Estimation Using Cnns
No ratings yet
Improving Structured Light Based Depth and Pose Estimation Using Cnns
77 pages
Beccarini Thesis 2021
No ratings yet
Beccarini Thesis 2021
237 pages
Group 09
No ratings yet
Group 09
9 pages
Robotic Navigation Autonomy For Subretinal Injection Via Intelligent Real-Time Virtual iOCT Volume Slicing
No ratings yet
Robotic Navigation Autonomy For Subretinal Injection Via Intelligent Real-Time Virtual iOCT Volume Slicing
8 pages
Fast 5DOF Needle Tracking in iOCT
No ratings yet
Fast 5DOF Needle Tracking in iOCT
10 pages
Three-Dimensional Robotic-Assisted Endomicroscopy With A Force Adaptive Robotic Arm
No ratings yet
Three-Dimensional Robotic-Assisted Endomicroscopy With A Force Adaptive Robotic Arm
6 pages
MITI: SLAM Benchmark For Laparoscopic Surgery
No ratings yet
MITI: SLAM Benchmark For Laparoscopic Surgery
3 pages
POSTER Ciutti
No ratings yet
POSTER Ciutti
1 page
10.48550 Arxiv.2409.13430
No ratings yet
10.48550 Arxiv.2409.13430
17 pages
(Ebook PDF) Deep Learning For Medical Image Analysis by S. Kevin Zhou Install Download
No ratings yet
(Ebook PDF) Deep Learning For Medical Image Analysis by S. Kevin Zhou Install Download
55 pages
Fdsafdsfsafasdfbrwa
No ratings yet
Fdsafdsfsafasdfbrwa
14 pages
Brown 2008
No ratings yet
Brown 2008
3 pages
Thesis
No ratings yet
Thesis
142 pages
Hu Visualization of Convolutional Neural Networks For Monocular Depth Estimation ICCV 2019 Paper
No ratings yet
Hu Visualization of Convolutional Neural Networks For Monocular Depth Estimation ICCV 2019 Paper
10 pages
C-Arm Positioning Using Virtual Fluoros
No ratings yet
C-Arm Positioning Using Virtual Fluoros
6 pages
Sensors Management in Robotic Neurosurgery: The ROBOCAST Project
No ratings yet
Sensors Management in Robotic Neurosurgery: The ROBOCAST Project
4 pages
Project Synopsis Template
No ratings yet
Project Synopsis Template
5 pages
Review
No ratings yet
Review
21 pages
Robotics and Autonomous Systems: Somayeh Norouzi-Ghazbi Ali Mehrkish Mostafa M.H. Fallah Farrokh Janabi-Sharifi
No ratings yet
Robotics and Autonomous Systems: Somayeh Norouzi-Ghazbi Ali Mehrkish Mostafa M.H. Fallah Farrokh Janabi-Sharifi
16 pages
Fully Automated 3D Colon Segmentation and Volume Rendering in Virtual Reality
No ratings yet
Fully Automated 3D Colon Segmentation and Volume Rendering in Virtual Reality
9 pages
A Real-Time Interactive Augmented Reality Depth Estimation PDF
No ratings yet
A Real-Time Interactive Augmented Reality Depth Estimation PDF
7 pages
Artificial Intelligence in Surgery
No ratings yet
Artificial Intelligence in Surgery
50 pages
Index - 2018 - Academic Press Library in Signal Processing Volume 6
No ratings yet
Index - 2018 - Academic Press Library in Signal Processing Volume 6
11 pages
Robot Design - Lec 01 - Introduction
No ratings yet
Robot Design - Lec 01 - Introduction
42 pages
7 Lec 7 CV S2021
No ratings yet
7 Lec 7 CV S2021
75 pages
SLAM - Lecture 3 - Bayesian Filter
No ratings yet
SLAM - Lecture 3 - Bayesian Filter
38 pages
3 Lec 3 CV S2021
No ratings yet
3 Lec 3 CV S2021
163 pages
8 Lec 8 CV S2021
No ratings yet
8 Lec 8 CV S2021
89 pages
Lec 9
No ratings yet
Lec 9
51 pages
5 Lec 5 CV S2021
No ratings yet
5 Lec 5 CV S2021
94 pages
Dji Robomaster Ai Challenge Technical Report
No ratings yet
Dji Robomaster Ai Challenge Technical Report
10 pages
RoboMaster 2018 Rules ManualV1.1
No ratings yet
RoboMaster 2018 Rules ManualV1.1
127 pages
This Study Resource Was Shared Via: Title Submission: Sept 25 (Next Week) Presentations: Oct 2 (Next AI Class)
No ratings yet
This Study Resource Was Shared Via: Title Submission: Sept 25 (Next Week) Presentations: Oct 2 (Next AI Class)
5 pages
How To Program Your Healy Resonance With Custom Vibration Programs Rev080621
No ratings yet
How To Program Your Healy Resonance With Custom Vibration Programs Rev080621
28 pages
Unit-17 Assignment Brief
No ratings yet
Unit-17 Assignment Brief
6 pages
CN Unit-4 Notes
No ratings yet
CN Unit-4 Notes
24 pages
All India Institute of Medical Sciences, Patna
No ratings yet
All India Institute of Medical Sciences, Patna
14 pages
Extreme Homeopathic Dilutions Retain
No ratings yet
Extreme Homeopathic Dilutions Retain
12 pages
Ronald Raajesh. P: Resume
No ratings yet
Ronald Raajesh. P: Resume
4 pages
DRUM LLVL Manual
No ratings yet
DRUM LLVL Manual
30 pages
Kautilya's Arthashastra Strategic Cultural Roots of India's Contemporary Statecraft (Kajari Kamal) (Z-Library)
100% (3)
Kautilya's Arthashastra Strategic Cultural Roots of India's Contemporary Statecraft (Kajari Kamal) (Z-Library)
261 pages
Animals You Can See at The Zoo
No ratings yet
Animals You Can See at The Zoo
2 pages
ETOPS Training PDF
100% (1)
ETOPS Training PDF
64 pages
TQM MCQs
100% (2)
TQM MCQs
8 pages
Final Exam Solution
No ratings yet
Final Exam Solution
5 pages
DPPS - 2 - Itf - PDF
No ratings yet
DPPS - 2 - Itf - PDF
2 pages
OPC Unified Architecture Specification Part 5 - Information Model Version 1.00
No ratings yet
OPC Unified Architecture Specification Part 5 - Information Model Version 1.00
77 pages
PhET Projectile Motion Lab
No ratings yet
PhET Projectile Motion Lab
9 pages
Lyla Borins - Data Bias and Misleading Graphs
100% (1)
Lyla Borins - Data Bias and Misleading Graphs
3 pages
Alumina
No ratings yet
Alumina
3 pages
Class 5 English Grammar Ncert Solutions Pronoun and Its Kinds
No ratings yet
Class 5 English Grammar Ncert Solutions Pronoun and Its Kinds
8 pages
The Dragons of Eden
No ratings yet
The Dragons of Eden
2 pages
Reason 4 Manual (English)
100% (11)
Reason 4 Manual (English)
402 pages
My Childhood 6. My Childhood 6. My Childhood 6. My Childhood 6. My Childhood
No ratings yet
My Childhood 6. My Childhood 6. My Childhood 6. My Childhood 6. My Childhood
14 pages
MODULE-3 - Introduction To Graph Theory
No ratings yet
MODULE-3 - Introduction To Graph Theory
31 pages
Christian Religious Studies New
No ratings yet
Christian Religious Studies New
6 pages
Prospectus UGCAP 2021-22
No ratings yet
Prospectus UGCAP 2021-22
64 pages
Mentor Mentee Details 2023-2024 - Final
No ratings yet
Mentor Mentee Details 2023-2024 - Final
37 pages
Data Analytics Basics: A Beginner's Guide
No ratings yet
Data Analytics Basics: A Beginner's Guide
15 pages
Mod 2 Flumac
No ratings yet
Mod 2 Flumac
39 pages
Windows Registry
No ratings yet
Windows Registry
17 pages
Menu Dakara Coffee 2024 020224
No ratings yet
Menu Dakara Coffee 2024 020224
22 pages