Self-Supervised Lightweight Depth Estimation in Endoscopy Combining CNN and Transformer
Self-Supervised Lightweight Depth Estimation in Endoscopy Combining CNN and Transformer
5, MAY 2024
Abstract — In recent years, an increasing number of to the narrow field of view and lack of depth perception,
medical engineering tasks, such as surgical navigation, endoscopic surgeries place stringent demands on the expe-
pre-operative registration, and surgical robotics, rely on rience and skills of the surgeon. Nowadays, with the rapid
3D reconstruction techniques. Self-supervised depth esti-
mation has attracted interest in endoscopic scenarios development of VR/AR technology, an increasing number
because it does not require ground truth. Most existing of researchers are choosing AR-based surgical navigation to
methods depend on expanding the size of parameters to address these difficulties [1], [2], [3]. These AR systems
improve their performance. There, designing a lightweight superimpose preoperative data with intraoperative endoscopic
self-supervised model that can obtain competitive results data through registration techniques [4], [5]. The accuracy
is a hot topic. We propose a lightweight network with a
tight coupling of convolutional neural network (CNN) and of video-CT registration algorithms primarily relies on the
Transformer for depth estimation. Unlike other methods that quality of intraoperative reconstructions from endoscopic
use CNN and Transformer to extract features separately videos [6]. In addition, there are many tasks, such as surgical
and then fuse them on the deepest layer, we utilize the robots [7], medical image segmentation [8], surgery planning
modules of CNN and Transformer to extract features at assistance [9], and surgical instrument recognition [10], that
different scales in the encoder. This hierarchical structure
leverages the advantages of CNN in texture perception can benefit from the results of depth estimation.
and Transformer in shape extraction. In the same scale of Previous methods for depth estimation from image
feature extraction, the CNN is used to acquire local features sequences are based on multi-view geometry principles, such
while the Transformer encodes global information. Finally, as structure from motion (SfM) [11] and simultaneous local-
we add multi-head attention modules to the pose network ization and mapping (SLAM) [12]. Although depth estimation
to improve the accuracy of predicted poses. Experiments
demonstrate that our approach obtains comparable results tasks have been developed in natural scenes for many years,
while effectively compressing the model parameters on two this problem is more difficult in endoscopic scenes due to
datasets. inconsistent lighting, sparse texture features, and soft tis-
Index Terms— Depth and ego-motion estimation, sues with non-Lambertian reflection characteristics. Geometry-
endoscopy, lightweight architecture, self-supervised based methods [13] rely heavily on feature extraction and
learning, transformer and CNN. matching. The smooth and repetitive soft tissue texture usually
results in sparse features and wrong feature matching. Thus,
I. I NTRODUCTION traditional methods still fall short of desirable performance.
NDOSCOPIC minimally invasive surgery is widely used Deep learning-based methods in harsh natural environments
E because of less bleeding and shorter recovery time
compared with open surgery in recent years. However, due
for depth estimation [14], segmentation [8], and detection [15]
have rapidly developed due to the publication of large datasets.
However, it is very difficult to obtain large amounts of
Manuscript received 1 August 2023; revised 6 November 2023; data with ground truth in endoscopic scenes. Unsupervised
accepted 4 January 2024. Date of publication 10 January 2024; date
of current version 2 May 2024. This work was supported in part by the learning methods that only use visual images have gained
National Key Research and Development Program of China under Grant increasing attention in recent years. Researchers have tried
2022ZD0115902; and in part by the National Natural Science Foundation to relieve these limitations for endoscopy images by uti-
of China under Grant U20A20195, Grant 62272017, Grant 62172437,
and Grant 62102208. (Corresponding authors: Junjun Pan; Ju Dai.) lizing self-supervised training strategy [6], [16], [17], [18].
Zhuoyue Yang and Junjun Pan are with the State Key Labora- Although many self-supervised methods have emerged, the
tory of Virtual Reality Technology and Systems, Beihang University, depth networks for most of the work are similar and based
Haidian, Beijing 100191, China, and also with the Peng Cheng Lab-
oratory, Nanshan, Shenzhen 518000, China (e-mail: yangzhuoyue@ on convolution layers. Some works design more complex and
buaa.edu.cn; [email protected]). heavy networks to achieve better results.
Ju Dai is with the Peng Cheng Laboratory, Nanshan, Shenzhen 518000, For navigation applications, depth estimation networks not
China (e-mail: [email protected]).
Zhen Sun and Yi Xiao are with the Division of Colorectal Surgery, only need to ensure accuracy but also integrate with other
the Department of General Surgery, the Chinese Academy of Med- modules such as registration. An effective and lightweight net-
ical Sciences, and the Peking Union Medical College, Peking Union work structure is an important topic. Currently, there are many
Medical College Hospital, Dongcheng, Beijing 100730, China (e-mail:
[email protected]; [email protected]). advanced works analyzing existing network architectures and
Digital Object Identifier 10.1109/TMI.2024.3352390 making interesting discoveries. For example, the receptive field
1558-254X © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Chungbuk National Univ. Downloaded on October 10,2024 at 11:35:56 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: SELF-SUPERVISED LIGHTWEIGHT DEPTH ESTIMATION IN ENDOSCOPY 1935
of convolution operation is limited, while Transformer [19] find that the structure in Monodepth2 [23] including a depth
can model global information. The latest work has found that network and a separate pose estimation network could achieve
the most effective part of Transformer is the entire framework better performance. Following [16] and [23], this structure
rather than multi-head attention (MHA) operations [20]. CNN became the baseline for subsequent methods and the unsuper-
exhibits strong texture bias, while Transformers exhibit strong vised depth estimation is regarded as an image reconstruction
shape bias [21]. Based on the above findings, we propose problem at present. To deal with edge conditions, such as
a lightweight self-supervised depth estimation network for object motion and occlusion, predictive interpretable masks
endoscopic images, which combines the advantages of CNN are used. Liu et al. [24] propose a self-monitoring method to
and Transformer at a fine-grained level. train convolutional neural networks for intensive depth esti-
Our contributions are summarized as follows: mation from monocular endoscopic data. Supervised signals
• For the first time, we apply the lightweight network to are derived from the positional and sparse point clouds of
endoscopic scenes. We present a novel hybrid architecture the motion recovery structure. Recasens et al. [25] leverage
with an efficient combination of CNN and Transformer monodepth2 [23] in this work to train an endoscopic depth
at different scales. In order to extract global and shape- estimation network to obtain the depth corresponding to each
aware features, we insert Transformer layers into CNN image. Ozyoruk et al. [18] put forward EndoSfMLearner,
layers which are sensitive to local textures. which is an unsupervised monocular depth and pose estima-
• We propose a pose network with several multi-head tion method. This method combines residual networks and a
attention modules. Attention modules are added at dif- spatial attention module to focus on highly textured tissue
ferent locations in order to find a solution with better areas. Li et al. [26] add the LSTM module in the pose
generalization. We perform experiments on several long estimation network to model time information, thus improving
sequences to verify the performance improvement of the the accuracy of pose estimation. Shao et al. [6] joint use
methods. optical flow appearance flow to deal with the brightness
• Extensive experiments have demonstrated the effective- inconsistency problem. Zhang et al. [27] propose a network
ness of our proposed method, which compresses the that shares an encoder and contains two branches in the
number of model parameters without a significant loss decoder. The two branches estimate the depth information
of accuracy. Qualitative experiments demonstrate that our and normal information respectively. Currently, most of the
method achieves comparable results with current state-of- self-supervised deep networks applied to endoscopic images
the-art methods on the SCARED and clinical datasets. are convolutional neural networks. Most researchers [28],
[29] focus on increasing model complexity and parameters
II. R ELATED W ORK to improve the performance of the network.
In this section, we review the unsupervised depth estima- B. Network Architectures
tion methods applied in endoscopic scenes, as well as the
state-of-the-art (SOTA) network framework combining CNN With the development of the technology, Transformer shows
(convolutional neural network) and Transformer applied in great potential for depth estimation tasks in natural scenes.
natural scenes. Varma et al. [30] first evaluate the impact of transformer
on self-supervised monocular depth estimation. DPT [31]
directly uses the Transformer as the encoder, and then fuses
A. Self-Supervised Learning the results of each layer of the Transformer separately to
Depth estimation methods in natural scenes have been stud- generate depth estimation results. AdaBins [32] uses ViT
ied for several years and typically leverage real depth values after general encoders and decoders, and then adaptively
as supervised signals to model the problem as a regression or divides depth values based on the dynamic changes of the
classification problem. However, true depth values are difficult scene. TransDepth [33] also adds Transformer blocks to the
to obtain in an endoscopic environment. It is not until after ResNet [34] results to obtain long-distance information, then
unsupervised methods are widely used [22], that those deep uses a decoder based on attention and Gate to fuse features,
learning methods are formally applied to endoscopic depth and finally performs depth estimation through prediction head.
estimation tasks. Zhou et al. [17] propose an unsupervised Vision Adaptor [35] designs an adapter that runs in parallel
training method using only monocular video sequences. The with ViT, incorporating prior knowledge of images into the
method uses the computed depth and poses as mediators ViT backbone to provide reconstructed multi-scale features for
and warps nearby views to the target view as supervised dense depth estimation problems, preserving the flexibility of
information. Godard et al. [14] leverage binocular videos ViT and improving performance. DeepFormer [36] performs
instead of depth truth to train the fully convolutional network. ViT and convolution operations separately in the encoder stage
The first article [16] applies unsupervised depth estimation and designs a layered aggregation and interaction module to
to endoscopy. The authors use a fully convolutional depth combine the two parts. To summarize, some researchers build
estimation approach with a similar structure to the method independent Transformer-based encoders to obtain feature
in [17]. Godard et al. [23] propose the Monodepth2 on the maps or add several modules to fuse features from CNN.
basis of [14]’s network framework. The predictor behind the MonoViT [37] is the current state-of-the-art work in natural
decoder in the depth estimation network and the decoder scene depth estimation tasks. The encoder of MonoViT [37] is
in the pose estimation network is deleted. Most researchers constructed by stacking several MPViT [38], and the decoder
Authorized licensed use limited to: Chungbuk National Univ. Downloaded on October 10,2024 at 11:35:56 UTC from IEEE Xplore. Restrictions apply.
1936 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 43, NO. 5, MAY 2024
Fig. 1. Overview of the proposed method. Our method includes a depth network (DepthNet) and a pose network (PoseNet). Our DepthNet
consists of an encoder with a combination of CNN and Transformer and a decoder. Our PoseNet is enhanced by the multi-head attention modules.
is from HR-Depth [39]. Each layer of MPViT has three trans- We use the brightness calibration module proposed in [6]
former heads and a convolution head. MonoFormer [21] still to compensate for lighting changes caused by endoscope
relies on ViT [40], mainly by proposing the attention connec- movement. Then, according to the predicted camera poses and
tion module and feature fusion decoder. Zhang et al. [41] pro- camera internal parameters, the estimated depth is re-projected
pose a dilated convolutional module to extract rich multi-scale back to the two-dimensional plane, and the model is supervised
local features and a self-attention-based feature interaction and optimized by calculating the loss between the recon-
module to encode remote global information into features. structed image and the target image. The details of DepthNet
Yu et al. [20] prove that the general architecture of the and PoseNet are described below. The utilized loss functions
Transformers, instead of the specific token mixer module, are listed.
is more essential to the model’s performance. CMT [42] inserts
Transformer structures between different convolutional layers
of CNN. The ablation experiments have shown that the widely B. DepthNet
used phased design in CNN is a better choice for promoting Following [14] and [23], we design our method as an
Transformer-based architectures. In summary, the integration encoder-decoder architecture. CNN has better performance in
of CNN and Transformer in architecture has evolved from extracting local textures and Transformers are sensitive to
coarse-grained stacking to fine-grained information exchange. global information and contours [21]. We present a novel
The difference between our method and the above methods hybrid encoder that is able to focus on both texture and contour
is that we stack the CNN layers and the Transformer layers features. The first and third layers are stacked with multiple
alternately. We utilize this hybrid structure to obtain local and layers of CNN modules, and then several Transformer blocks
global features while also using textures and contours. are placed in sequence in the middle layer. Multi-scale features
from the encoder are connected into a concise decoder.
III. M ETHOD 1) Depth Encoder: The input image is first passed through
a convolution stem, containing three 3 × 3 convolutions. The
A. Overall Architecture first convolution with a stride of 2 and the next two with a
The framework includes a depth estimation network (Depth- stride of 1. The output channel is C1 , and the size of the output
Net), a pose estimation network (PoseNet), and a brightness feature map is H/2 × W/2. From the following stages, the
calibration network, as shown in Fig. 1. Endoscopic images CNN-based layer and Transformer-based layer are alternately
are segmented into groups of three in chronological order. stacked. Firstly, several symbols are defined to describe the
The DepthNet estimates the multi-scale depth map of a single input and output of each stage. We use Fi to represent the
endoscopy image, while the PoseNet estimates the camera feature map output from the i-th layer. The image that has
motion between adjacent images. We combine convolutional been pooled in the i-th layer is labeled as Ii . The feature
layers with transformer structures to build a hybrid DepthNet. obtained through downsampling modules for each layer is Di .
Authorized licensed use limited to: Chungbuk National Univ. Downloaded on October 10,2024 at 11:35:56 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: SELF-SUPERVISED LIGHTWEIGHT DEPTH ESTIMATION IN ENDOSCOPY 1937
Fig. 2. CNN and Transformer blocks that are adopted in the depth
encoder of DepthNet. (a) is the structure of the CNN block. (b) shows
the architecture of the Transformer blocks. To distinguish between two
different Transformer blocks, we name them based on the different
operations used in the framework.
Authorized licensed use limited to: Chungbuk National Univ. Downloaded on October 10,2024 at 11:35:56 UTC from IEEE Xplore. Restrictions apply.
1938 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 43, NO. 5, MAY 2024
where MultiHeadAtten(Q, K, V) is the concatenated output of published on the endoscopic sub-challenge organized by MIC-
k self-attention operations, which is applied as: CAI2019, containing 9 different sub-datasets collected from
porcine cadavers. Each sub-dataset contains an endoscope
QK T
Attention(Q, K, V) = softmax( √ )V, (4) video, the ground truth of the pose recorded by the surgical
d robot, and the ground truth of depth collected by structured
e and d is the dimension of light equipment. Therefore, we can evaluate the performance
where Q, K, V are projected from F
of pose estimation and depth estimation methods using this
the input. Then, the feature extraction was performed through
dataset. Following [6], we also refer to the Eigen-Zhou [17],
two superimposed ResNet [34] blocks to obtain feature maps
[49] evaluation protocol to separate the training, validation,
with scales of H/4 × W/4 and H/8 × W/8, respectively.
and test datasets, respectively.
In addition, the extracted feature map passes through the
2) Clinical Dataset: In order to verify the generalization
multi-head attention layer again. Finally, the last two feature
performance of the method, we also collect videos during right
maps are obtained through two basic blocks. The feature maps
hemicolectomy surgery with the assistance of surgeons. Four
are converted into pose matrices through convolutions.
representative video clips are selected for quantitative experi-
ments. Each video contains 150-200 images. The contents of
D. Self-Supervised Learning the images include live, colon, small intestine, fat, etc. in the
Like other unsupervised learning methods, we also trans- abdominal cavity. These four sequences are representative
form the task as 2D image reconstruction and supervise the image sequences during the surgical navigation phase. This
consistency and accuracy of depth estimation by minimizing dataset is not utilized in the training process.
the similarity between the re-projected image and the target
image. The image reconstruction loss consists of the photomet- B. Implementation Details
ric loss (L p ) and edge-aware loss (Le ). We define the source Our method is implemented by PyTorch. In our experiments,
image as I† . Utilizing the pose estimation T and intrinsic we utilize a single NVIDIA V100 and the batch size is 12. The
parameters of the camera P, the reconstructed image (Ĩ) can following training augmentations are performed, with 50%
be re-projected (π) from the depth estimation D and I† . The chance: random brightness, contrast, saturation, and hue jitter
reconstructed image (Ĩ) is defined as follows: with respective ranges of ±0.2, ±0.2, ±0.2, and ±0.1. Our
depth estimation network and pose estimation network use
Ĩ = π(I† , T, D, P). (5)
two AdamW [50] optimizers respectively. The initial values
Due to inconsistent lighting in the endoscopic environment, of learning rates are 1e-4. Drop-path is used to mitigate
the photometric loss is inaccurate. We apply a pre-trained overfitting and the training epoch is set to 50. The specific
optical flow network to calibrate the rotation and translation values of C1 , C2 and C3 are 48, 80 and 128.
changes between two input images and use a pre-trained Following [6], [17], and [23], we compute the 5 standard
appearance flow network that results in C to supplement the metrics (Abs Rel, Sq Rel, RMSE, RMSE log, δ < 1.25)
illumination. The modified image (Î) resulting from the target proposed in [49] for evaluation. These metrics are defined as
image I is as follows: follows:
1 X ∗
Î = I + C. (6) Abs Rel = |d −d|/d ∗ (9)
|D|
d∈D
The image similarity (F) between the modified image (Î) 1 X ∗
and the reconstructed image (Ĩ) is defined as follows: Sq Rel = |d −d|2 /d ∗ (10)
|D|
d∈D
s
1 − SS I M(Î, Ĩ) 1 X
F =α· + (1 − α) · Î − Ĩ , (7) R M S E log = | log d ∗ − log d|2 (11)
2 |D|
d∈D
where SS I M is the structural similarity index [47] and α = s
1 X ∗
0.85. The photometric loss L p is the minimum value of F RMSE = |d −d|2 , (12)
among two adjacent images with the visibility mak [6], [23]. |D|
d∈D
In order to maintain the edges, edge-aware loss is also used. d∗ d
1
As in previous work [17] and [23], the edge-aware loss is δ= d ∈ D|max( , ∗ < 1.25) × 100%
|D| d d
defined as: (13)
Le = |∂x d|e−|∂x I| + |∂ y d|e−|∂ y I| , (8) where D is the set of the predicted depth. d and d ∗ denote the
where d represents the mean-normalized inverse depth of I. predicted depth and the ground truth, respectively. We perform
a 5-frame pose evaluation following [17] and adopt the metric
of absolute trajectory error (ATE) [51].
IV. DATASET AND R ESULTS
A. Dataset C. Depth Estimation
1) SCARED Dataset: We utilize SCARED [48] dataset to 1) Performance Comparision: We run experiments on the
evaluate our methods’ performance. The SCARED dataset is SCARED dataset to evaluate the depth error and accuracy of
Authorized licensed use limited to: Chungbuk National Univ. Downloaded on October 10,2024 at 11:35:56 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: SELF-SUPERVISED LIGHTWEIGHT DEPTH ESTIMATION IN ENDOSCOPY 1939
TABLE I
D EPTH N ET P ERFORMANCE . ‘E NCODER ’, ‘D ECODER ’, AND ‘OVERALL’ R EPRESENTS THE N UMBER OF PARAMETERS U TILIZED IN D EPTH N ET.
AUXILIARY R EPRESENTS THE N UMBER OF AUXILIARY M ODELS ’ PARAMETERS . †M EANS THE B EST R ESULT W E R EPRODUCE ON O UR M ACHINE
Fig. 5. Qualitative depth comparison. There are four examples from the SCARED dataset. The first row is original images and the others are
depth maps. The second and third rows are results from [6] and [23]. The last row shows our results.
our model. The proposed method is compared with several the smallest parameters in the inference phase. We achieve
SOTA self-supervised methods, including AF-SfM [6], Endo- the second-highest ranking result in accuracy. In Table I, the
SfM [18], Monodepth2 [23], Fang et al. [52], DeFeat-Net [53] auxiliary parameters refer to the network parameters proposed
and SC-SfMLearner [54]. To make up the monocular scale in AF-SfM [6] for correcting illumination. With these two
ambiguity, following the same strategies indicated in [6] auxiliary networks only utilizing the training phase, both
and [23], the estimated depth is scaled by the per-image our model and AF-SfM [6] can achieve better performance.
median ground truth. Table I collects the quantitative results The performance of compared methods on depth estimation
of our model against other typical self-supervised methods. is from [6]. According to Table I, our method achieves a
The encoder, decoder, and overall columns in Table I report lower result on RMSE. Fig. 5 shows that our model obtains
the size of parameters in the DepthNet. Our method achieves satisfactory results compared with other methods. We can
comparable performance to the state-of-the-art methods with observe that our method provides a more accurate depth
Authorized licensed use limited to: Chungbuk National Univ. Downloaded on October 10,2024 at 11:35:56 UTC from IEEE Xplore. Restrictions apply.
1940 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 43, NO. 5, MAY 2024
TABLE II
A BLATION S TUDY ON THE N UMBER OF T RANSFORMER
B LOCKS IN O NE L AYER
Authorized licensed use limited to: Chungbuk National Univ. Downloaded on October 10,2024 at 11:35:56 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: SELF-SUPERVISED LIGHTWEIGHT DEPTH ESTIMATION IN ENDOSCOPY 1941
Fig. 7. Qualitative pose comparison. The first three columns are the trajectory results by using comparative methods( [6], [18], [23]). The results
in the last column are our trajectory results.
TABLE V
A BLATION S TUDY ON P OSE N ET
E. Surface Reconstruction
We can recover point clouds from camera intrinsics and
depth estimates, as shown in Fig. 8. The point cloud shown Fig. 8. Point clouds on the SCARED and clinical datasets.
(a) (b) show two examples. Images in the first row are original images
in Fig. 8 does not have any added colors, in order to and figures in the second row are reconstructed point clouds.
display the geometric structure. We use truncated signed
distance function (TSDF) [55] to fuse multiple point clouds
TABLE VI
in order to extend the 3D model of the tissue surface. The S URFACE R ECONSTRUCTION
implementation is developed by Open3d [56]. Readers can
reference [25] to get the procedure of expanding multiple point
clouds based on pose estimates. We further utilize laparo-
scopic images obtained from surgery for visual performance
analysis.
Fig. 9 shows the surface reconstructed from the SCARED
dataset. Subfigures in Fig. 10 are the surfaces recovered from
the clinical dataset. The images in the first row demonstrate the
texture and the second row shows the mesh. Through mesh, The average number of points for surface models in the scared
we can more clearly see the structure of soft tissues in different dataset and the real dataset is 1.8 and 1.5 million, respectively.
scenarios. By adding textures, the entire scene can be visually The average processing time for each image is 0.2 seconds.
reflected. Fig. 9 shows that our method preserves distinct tissue We do not include network inference time here. The inference
structures and keeps local soft tissues smooth and continuous. time of our method and other methods are shown in TableVII.
TableVI reflects our scenes containing a large number of verts. Our method also reduces inference time.
Authorized licensed use limited to: Chungbuk National Univ. Downloaded on October 10,2024 at 11:35:56 UTC from IEEE Xplore. Restrictions apply.
1942 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 43, NO. 5, MAY 2024
Fig. 9. Our surface reconstructions on the SCARED dataset. (a), (b), (c) and (d) are 3D surfaces recovered from four images captured from
porcine cadavers.
Fig. 10. Recoverd surfaces on the clinical dataset.(a), (b), (c), and (d) are 3D surfaces recovered from four representative laparoscopic images
obtained during surgery, mainly including fat, intestines, and liver.
TABLE VII
D EPTH N ET I NFERENCE S PEED
F. Limitations
Authorized licensed use limited to: Chungbuk National Univ. Downloaded on October 10,2024 at 11:35:56 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: SELF-SUPERVISED LIGHTWEIGHT DEPTH ESTIMATION IN ENDOSCOPY 1943
Authorized licensed use limited to: Chungbuk National Univ. Downloaded on October 10,2024 at 11:35:56 UTC from IEEE Xplore. Restrictions apply.
1944 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 43, NO. 5, MAY 2024
[33] G. Yang, H. Tang, M. Ding, N. Sebe, and E. Ricci, “Transformer- [46] Z. Zhou, X. Fan, P. Shi, and Y. Xin, “R-MSFM: Recurrent multi-
based attention networks for continuous pixel-wise prediction,” in Proc. scale feature modulation for monocular depth estimating,” in Proc.
IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 16249–16259, IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 12757–12766,
doi: 10.1109/ICCV48922.2021.01596. doi: 10.1109/ICCV48922.2021.01254.
[34] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for [47] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. quality assessment: From error visibility to structural similarity,” IEEE
(CVPR), Jun. 2016, pp. 770–778, doi: 10.1109/CVPR.2016.90. Trans. Image Process., vol. 13, no. 4, pp. 600–612, Apr. 2004, doi:
[35] Z. Chen et al., “Vision transformer adapter for dense predictions,” 10.1109/TIP.2003.819861.
in Proc. Int. Conf. Learn. Represent. (ICLR), Feb. 2023, doi: [48] M. Allan et al., “Stereo correspondence and reconstruction of endoscopic
10.48550/arXiv.2205.08534. data challenge,” 2021, arXiv:2101.01133.
[36] Z. Li, Z. Chen, X. Liu, and J. Jiang, “DepthFormer: Exploiting long- [49] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a
range correlation and local information for accurate monocular depth single image using a multi-scale deep network,” in Proc. Int. Conf.
estimation,” 2022, arXiv:2203.14211. Neural Inf. Process. Syst. (NIPS), Dec. 2014, pp. 2366–2374, doi:
[37] C. Zhao et al., “MonoViT: Self-supervised monocular depth estimation 10.48550/arXiv.1406.2283.
with a vision transformer,” in Proc. Int. Conf. 3D Vis. (3DV), Sep. 2022, [50] I. Loshchilov and F. Hutter, “Decoupled weight decay regulariza-
pp. 668–678, doi: 10.1109/3DV57658.2022.00077. tion,” in Proc. Int. Conf. Learn. Represent. (ICLR), Dec. 2018, doi:
[38] Y. Lee, J. Kim, J. Willette, and S. J. Hwang, “MPViT: Multi-path 10.48550/arXiv.1711.05101.
vision transformer for dense prediction,” in Proc. IEEE/CVF Conf. [51] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardós, “ORB-
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 7277–7286, doi: SLAM: A versatile and accurate monocular SLAM system,” IEEE
10.1109/CVPR52688.2022.00714. Trans. Robot., vol. 31, no. 5, pp. 1147–1163, Oct. 2015, doi:
[39] X. Lyu et al., “HR-depth: High resolution self-supervised monocular 10.1109/TRO.2015.2463671.
depth estimation,” in Proc. AAAI Conf. Artif. Intell., 2021, vol. 35, no. 3, [52] Z. Fang, X. Chen, Y. Chen, and L. Van Gool, “Towards good practice
pp. 2294–2301, doi: 10.1609/aaai.v35i3.16329. for CNN-based monocular depth estimation,” in Proc. IEEE Winter
[40] A. Dosovitskiy et al., “An image is worth 16×16 words: Transformers Conf. Appl. Comput. Vis. (WACV), Mar. 2020, pp. 1080–1089, doi:
for image recognition at scale,” in Proc. Int. Conf. Learn. Represent. 10.1109/WACV45572.2020.9093334.
(ICLR), Jan. 2021, doi: 10.48550/arXiv.2010.11929. [53] J. Spencer, R. Bowden, and S. Hadfield, “DeFeat-Net: General
[41] N. Zhang, F. Nex, G. Vosselman, and N. Kerle, “Lite- monocular depth via simultaneous unsupervised representation learn-
mono: A lightweight CNN and transformer architecture for ing,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
self-supervised monocular depth estimation,” in Proc. IEEE Conf. (CVPR), Jun. 2020, pp. 14390–14401, doi: 10.1109/CVPR42600.2020.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 18537–18546, 01441.
doi: 10.48550/arXiv.2211.13202. [54] J. Bian et al., “Unsupervised scale-consistent depth and ego-motion
[42] J. Guo et al., “CMT: Convolutional neural networks meet learning from monocular video,” in Proc. 33rd Conf. Neural Inf.
vision transformers,” in Proc. IEEE/CVF Conf. Comput. Vis. Process. Syst., Dec. 2019, pp. 35–45, doi: 10.48550/arXiv.1908.
Pattern Recognit. (CVPR), Jun. 2022, pp. 12165–12175, doi: 10553.
10.1109/CVPR52688.2022.01186. [55] B. Curless and M. Levoy, “A volumetric method for building complex
[43] D. Hendrycks and K. Gimpel, “Gaussian error linear units (GELUs),” models from range images,” in Proc. 23rd Annu. Conf. Comput. Graph.
Jun. 2016, arXiv:1606.08415, doi: 10.48550/arXiv.1606.08415. Interact. Techn., Aug. 1996, pp. 303–312, doi: 10.1145/237170.237269.
[44] A. Ali et al., “XCiT: Cross-covariance image transformers,” in Proc. Adv. [56] Q.-Y. Zhou, J. Park, and V. Koltun, “Open3D: A modern
Neural Inf. Process. Syst. (NIPS), vol. 34, Dec. 2021, pp. 20014–20027, library for 3D data processing,” Jan. 2018, arXiv:1801.09847, doi:
doi: 10.48550/arXiv.2106.0968. 10.48550/arXiv.1801.09847.
[45] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional [57] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi,
networks for biomedical image segmentation,” in Proc. Int. Conf. and R. Ng, “NeRF: Representing scenes as neural radiance fields for
Med. Image Comput. Comput.-Assist. Intervent. (MICCAI), Nov. 2015, view synthesis,” Commun. ACM, vol. 65, no. 1, pp. 99–106, Dec. 2021,
pp. 234–241, doi: 10.1007/978-3-319-24574-4_28. doi: 10.1145/3503250.
Authorized licensed use limited to: Chungbuk National Univ. Downloaded on October 10,2024 at 11:35:56 UTC from IEEE Xplore. Restrictions apply.