0% found this document useful (0 votes)
89 views30 pages

Metric3D v2: A Versatile Monocular Geometric Foundation Model For Zero-Shot Metric Depth and Surface Normal Estimation

Uploaded by

endargame
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views30 pages

Metric3D v2: A Versatile Monocular Geometric Foundation Model For Zero-Shot Metric Depth and Surface Normal Estimation

Uploaded by

endargame
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

1

Metric3D v2: A Versatile Monocular Geometric


Foundation Model for Zero-shot Metric Depth
and Surface Normal Estimation
Mu Hu1∗ , Wei Yin2∗† , Chi Zhang3 , Zhipeng Cai4 , Xiaoxiao Long5† , Hao Chen6 ,
Kaixuan Wang1 , Gang Yu3 , Chunhua Shen6 , Shaojie Shen1
Abstract—We introduce Metric3D v2, a geometric foundation model for zero-shot metric depth and surface normal estimation from a
single image, which is crucial for metric 3D recovery. While depth and normal are geometrically related and highly complimentary, they
present distinct challenges. State-of-the-art (SoTA) monocular depth methods achieve zero-shot generalization by learning
arXiv:2404.15506v1 [cs.CV] 22 Mar 2024

affine-invariant depths, which cannot recover real-world metrics. Meanwhile, SoTA normal estimation methods have limited zero-shot
performance due to the lack of large-scale labeled data. To tackle these issues, we propose solutions for both metric depth estimation
and surface normal estimation. For metric depth estimation, we show that the key to a zero-shot single-view model lies in resolving the
metric ambiguity from various camera models and large-scale data training. We propose a canonical camera space transformation
module, which explicitly addresses the ambiguity problem and can be effortlessly plugged into existing monocular models. For surface
normal estimation, we propose a joint depth-normal optimization module to distill diverse data knowledge from metric depth, enabling
normal estimators to learn beyond normal labels. Equipped with these modules, our depth-normal models can be stably trained with
over 16 million of images from thousands of camera models with different-type annotations, resulting in zero-shot generalization to
in-the-wild images with unseen camera settings. Our method current ranks the 1st on various zero-shot and non-zero-shot benchmarks
for metric depth, affine-invariant-depth as well as surface-normal prediction, shown in Fig. 1. Notably, we surpassed the ultra-recent
MarigoldDepth and DepthAnything on various depth benchmarks including NYUv2 and KITTI. Our method enables the accurate
recovery of metric 3D structures on randomly collected internet images, paving the way for plausible single-image metrology. The
potential benefits extend to downstream tasks, which can be significantly improved by simply plugging in our model. For example, our
model relieves the scale drift issues of monocular-SLAM (Fig. 3), leading to high-quality metric scale dense mapping. These
applications highlight the versatility of Metric3D v2 models as geometric foundation models.
Our project page is at https://JUGGHM.github.io/Metric3Dv2.

Index Terms—Monocular metric depth estimation, surface normal estimation, 3D scene shape estimation

1 I NTRODUCTION ■ Ours Metric 3D


KITTI
■ SoTA
Monocular metric depth and surface normal estimation is ♠ DIODE-Full
♠ Zero-shot testing (Affine-invariant-depth)
(Metric-depth)
100
NYU
(Metric-depth)
the task of predicting absolute distance and surface direction 98.9
98.9
96.9 IEBins
95
♠ Eth3d
from a single image. As crucial 3D representations, depth (Affine-invariant-depth)
96.9 Polymax ScanNet
(Metric-depth)
and normals are geometrically related and highly comple- 98.7
94.6 DPT-L
89.1
90
93.9 HDN
99.3

mentary. While metric depth excels in capturing data at ♠ ibims1 85 DDAD


(Affine-invariant-depth) (Metric-depth)
scale, surface normals offer superior preservation of local 96.9
88.5 LeRes
80
78.0 HDN
geometry and are devoid of metric ambiguity compared to 75
85.1 ZeroDepth 86.5

♠ DIODE-Outdoor
metric depth. These unique attributes render both depth and (Metric-depth) 84.7 40.0 ZoeDepth70 87.8 Polymax 88.7 NYU (Normal)
surface normals indispensable in various computer vision 26.9 ZoeDepth
75.4 ZeroDepth
71.6 IronDepth.

applications, including 3D reconstruction [1], [2], [3], neural 93.2 78.5


89.8 Bae etal.
92.3
♠ DIODE-Indoor 87.8 Polymax
rendering (NeRF) [4], [5], [6], [7], autonomous driving [8], (Metric-depth) 87.1 88.1 ScanNet (Normal)
[9], [10], and robotics [11], [12], [13]. Currently, the commu- 87.8 Polymax
91.0 ZeroDepth
nity still lacks a robust, generalizable geometry foundation ♠ Nuscenes
(Metric-depth) 96.9 Polymax ♠ NYU (Normal)
model [14], [15], [16] capable of producing high-quality 97.7
98.0
metric depth and surface normal from a single image. ♠ KITTI ♠ibims1 (Normal)
(Metric-depth) ♠ NYU
Metric depth estimation and surface normal estima- (Metric-depth)
tion confront distinct challenges. Existing depth estimation Fig. 1 – Comparisons with SoTA methods on 16 depth
methods are categorized into learning metric depth [17], and normal benchmarks. Radar-map of our Metric3D V2
v.s. SoTA methods from different works, on (1) Metric
depth benchmarks, see ‘(Metric-depth)’. (2) Affine-invariant
• ∗ Equal contribution. depth benchmarks, see ‘(Affine-invariant-depth)’. (3) Sur-
• † Both WY and XL are corresponding authors ([email protected], face normal benchmarks, see ‘(Normal)’. Zero-shot testing
[email protected]).
is denoted by ‘♠’. Here δ1 percentage accuracy is used
• Contact MH for technical concerns ([email protected]).
• 1 HKUST 2 Adelaide University for depth benchmarks and 30◦ percentage accuracy is for
• 3 Tencent 4 Intel 5 HKU 6 Zhejiang University normal. Both higher values are for better performance. We
establish new SoTA on a wide range of depth and normal
April 25, 2024. benchmarks.
2

RGB
Ours-D
Marigold-D
Ours-N
Marigold-N

Fig. 2 – Surface normal (N) and monocular depth (D) comparisons on diverse web images. Our method, directly estimating
metric depths and surface normals, shows powerful generalization in a variety of scenarios, including indoor, outdoor, poor-
visibility, motion blurred, and fisheye images. Visualized results come from our ViT-large-backbone estimator. Marigold is
a strong and robust diffusion-based monocular depth estimation method, but its recovered surface normals from the depth
show various artifacts.

[18], [19], [20], relative depth [21], [22], [23], [24], and on high-quality normal annotations. These annotations, un-
affine-invariant depth [25], [26], [27], [28], [29]. Although like sensor-captured ground-truth (GT), are derived from
the metric depth methods [17], [18], [19], [20], [30] have meticulously and densely reconstructed scenes, which have
achieved impressive accuracy on various benchmarks, they extremely rigorous requirements for both the capturing
must train and test on the dataset with the same camera equipment and the scene. Consequently, data sources pri-
intrinsics. Therefore, the training datasets of metric depth marily consist of either synthetic creation or 3D indoor
methods are often small, as it is hard to collect a large reconstruction [35]. Real and diverse outdoor scenes are
dataset covering diverse scenes using one identical camera. exceedingly rare. (refer to our data statistics in Tab. 5). Lim-
The consequence is that all these models generalize poorly ited by this label deficiency, SoTA surface normal methods
in zero-shot testing, not to mention the camera parameters [32], [33], [34] typically struggle with strong zero-shot gen-
of test images can vary too. A compromise is to learn the eralization. This work endeavors to tackle these challenges
relative depth [21], [23], which only represents one point by developing a multi-task foundation model for zero-shot,
being further or closer to another one. The application single view, metric depth, and surface normal estimation.
of relative depth is very limited. Learning affine-invariant We propose targeted solutions for the challenges of
depth finds a trade-off between the above two categories of zero-shot metric depth and surface normal estimation. For
methods, i.e. the depth is up to an unknown scale and shift. metric-scale recovery, we first analyze the metric ambiguity
With large-scale data, they decouple the metric information issues in monocular depth estimation and study different
during training and achieve impressive robustness and gen- camera parameters in depth, including the pixel size, focal
eralization ability, such as MiDaS [27], DPT [28], LeReS [25], length, and sensor size. We observe that the focal length
[26], HDN [29]. The problem is the unknown shift will cause is the critical factor for accurate metric recovery. By de-
3D reconstruction distortions [26] and non-metric depth sign, affine-invariant depth methods do not take the focal
cannot satisfy various downstream applications. length information into account during training. As shown
In the meantime, these models cannot generate surface in Sec. 3.1, only from the image appearance, various focal
normals. Although lifting depths to 3D point clouds can lengths may cause metric ambiguity, thus they decouple the
do so, it places high demands on the accuracy and fine depth scale in training. To solve the problem of varying focal
details of predicted depths. Otherwise, various artifacts will lengths, CamConv [38] encodes the camera model in the
remain in such transformed normals. For example, Fig. 2 network, which enforces the network to implicitly under-
shows noisy normals from Marigold [31] depths, which stand camera models from the image appearance and then
excels in producing high-resolution fine depths. Instead of bridges the imaging size to the real-world size. However,
direct transformation, state-of-the-art (SoTA) surface normal training data contains limited images and types of cameras,
estimation methods [32], [33], [34] tend to train estimators which challenges data diversity and network capacity. We
3

RGB Ours ZoeDepth


iPhone 14 pro

Yellow: 53.1𝒄𝒄𝒄𝒄
36.7𝒄𝒄𝒄𝒄
Meta Data: GT size
• Focal: 24𝑚𝑚𝑚𝑚 10.5𝒄𝒄𝒎𝒎
• Pixelsize: 2.44𝜇𝜇𝜇𝜇
• Size: 4032×3024 Single
Metrology

• : 24𝑚𝑚𝑚𝑚 view
recon.
Samsung S23
Red:
measured
Meta Data: size
• Focal: 35 𝑚𝑚𝑚𝑚
• Pixelsize: 1𝜇𝜇𝜇𝜇
• Size: 4000 ×3000

Single view metric reconstruction of photos captured by from different cameras


Droid-SLAM Droid-SLAM +Ours
Dense SLAM Mapping

Ground- 3D 3D
truth recon. recon.
trajectory
Blue: GT size
Red: measured size

Predicted Predicted
trajectory trajectory

Fig. 3 – Top (metrology for a complex scene): we use two phones (iPhone14 pro and Samsung Galaxy S23) to capture the scene
and measure the size of several objects, including a drone which has never occurred in the whole training set. With the photos’
metadata, we perform 3D metric reconstruction and then measure object sizes (marked in red), which are close to the ground
truth (marked in yellow). Compared with ZoeDepth [36], our measured sizes are closer to ground truth. Bottom (dense SLAM
mapping): existing SoTA mono-SLAM methods usually face scale drift problems (see the red arrows) in large-scale scenes and
are unable to achieve the metric scale, while, naively inputting our metric depth, Droid-SLAM [37] can recover much more
accurate trajectory and perform the metric dense mapping (see the red measurements). Note that all testing data are unseen to
our model.

propose a canonical camera transformation method in train- information for normal estimation. Weak supervision from
ing, inspired by the canonical pose space from human body the pseudo normal annotations transformed by learned
reconstruction methods [39]. We transform all training data metric depth can effectively prevent the normal estimator
to a canonical camera space where the processed images from collapsing caused by GT absence. Furthermore, this
are coarsely regarded as captured by the same camera. supervision can guide the normal estimator to generalize
To achieve such transformation, we propose two different on large-scale unlabeled data. Based on such observation,
methods. The first one tries to adjust the image appear- we propose a joint depth-normal optimization module to
ance to simulate the canonical camera, while the other one distill knowledge from diverse depth datasets. During op-
transforms the ground-truth labels for supervision. Camera timization, our normal estimator learns from three sources:
models are not encoded in the network, making our method (1) Groundtruth normal labels, though they are much fewer
easily applicable to existing architectures. During inference, compared to depth annotations (2) An explicit learning
a de-canonical transformation is employed to recover metric objective to constrain depth-normal consistency. (3) Implicit
information. To further boost the depth accuracy, we pro- and thorough knowledge transfer from depth to normal
pose a random proposal normalization loss. It is inspired through feature fusion, which is more tolerant to unsatis-
by the scale-shift invariant loss [25], [27], [29] decoupling factory initial prediction than the explicit counterparts [41],
the depth scale to emphasize the single image’s distribu- [42]. To achieve this, we implement the optimization module
tion. However, they perform on the whole image, which using deep recurrent blocks. While previous researchers
inevitably squeezes the fine-grained depth difference. We have employed similar recurrent modules to optimize depth
propose to randomly crop several patches from images and [42], [43], [44], disparity [45], ego-motion [37], or optical
enforce the scale-shift invariant loss [25], [27] on them. Our flows [46], it is the first time that normal is iteratively
loss emphasizes the local geometry and distribution of the optimized together with depth in a learning-based scheme.
single image. Benefiting from the joint optimization module, our models
For surface normal, the biggest challenge is the lack of can efficiently learn normal knowledge from large-scale
diverse (outdoor) annotations. Compared to reconstruction- depth datasets even without labels.
based annotation methods [35], [40], directly producing With the proposed method, we can stably scale up model
normal labels from network-predicted depth is more ef- training to 16 million images from 18 datasets of diverse
ficient and scalable. The quality of such pseudo-normal scene types (indoor and outdoor, real or synthetic data),
labels, however, is bounded by the accuracy of the depth camera models (tens of thousands of different cameras),
network. Fortunately, we observe that robust metric depth and annotation categories (with or without normal), lead-
models are scalable geometric learners, containing abundant ing to zero-shot transferability and significantly improved
4

Input Ours: End-to-end zero-shot metric depth Ours: Learn normal from depth label

Depth label
Optional

Network Unified metric distribution Metric depth Surface normal Normal label

Train once for all applications … Loss supervision

ZoeDepth: Zero-shot relative depth + Metric heads Omnidata: Annotate normal

Single RGB
image Depth label
Indoor distribution Indoor metric depth
Depth Dense
recon.

Affine-invariant depth Outdoor distribution Outdoor metric depth

Stage1: Large- Stage2: Scene-specific metric finetuning Surface normal Normal label
scale training Existing assets Newly labeled assets

Fig. 4 – Overall methodology. Our method takes a single image to predict the metric depth and surface normal simultaneously.
We apply large-scale data training directly for metric depth estimation rather than affine invariant depth, enabling end-to-end
zero-shot metric depth estimation for various applications using a single model. For normals, we enable learning from depth
labels only, alleviating the demand for dense reconstruction to generate large-scale normal labels.

accuracy. Fig. 4 illustrates how the large-scale data with 2 R ELATED W ORK
depth annotations directly facilitate metric depth and sur- 3D reconstruction from a single image. Reconstructing var-
face normal learning. The metric depth and normal given by ious objects from a single image has been well studied [51],
our model directly broaden the applications in downstream [52], [53]. They can produce high-quality 3D models of cars,
tasks. We achieve state-of-the-art performance on over 16 planes, tables, and human body [54], [55]. The main chal-
depth and normal benchmarks, see Fig. 1. Our model can lenge is how to best recover objects’ details, how to represent
accurately reconstruct metric 3D from randomly collected them with limited memory, and how to generalize to more
Internet images, enabling plausible single-image metrology. diverse objects. However, all these methods rely on learning
As an example (Fig. 3), with the predicted metric depths priors specific to a certain object class or instance, typically
from our model, we significantly reduce the scale drift from 3D supervision, and can therefore not work for full
of monocular SLAM [37], [47] systems, achieving much scene reconstruction. Apart from these reconstructing ob-
better mapping quality with real-world metric recovery. Our jects works, several works focus on scene reconstruction
model also enables large-scale 3D reconstruction [48]. To from a single image. Saxena et al. [56] construct the scene
summarize, our main contributions are: based on the assumption that the whole scene can be seg-
mented into several small planes. With planes’ orientation
and location, the 3D structure can be represented. Recently,
• We propose a canonical and de-canonical camera trans- LeReS [25] proposed to use a strong monocular depth esti-
formation method to solve the metric depth ambiguity mation model to do scene reconstruction. However, they can
problems from various camera settings. It enables the only recover the shape up to a scale. Zhang et al. [57] recently
learning of strong zero-shot monocular metric depth proposed a zero-shot geometry-preserving depth estimation
models from large-scale datasets. model that is capable of making depth predictions up to
• We propose a random proposal normalization loss to an unknown scale, without requiring scale-invariant depth
effectively boost the depth accuracy; annotations for training. In contrast to these works, our
• We propose a joint depth-normal optimization module method can recover the metric 3D structure.
to learn normal on large-scale datasets without normal Supervised monocular depth estimation. After several
annotation, distilling knowledge from the metric depth benchmarks [58], [59] are established, neural network based
estimator. methods [17], [19], [30] have dominated since then. Several
• Our models rank 1st on a wide variety of depth approaches regress the continuous depth from the aggrega-
and surface normal benchmarks. It can perform high- tion of information in an image [60]. As depth distribution
quality 3D metric structure recovery in the wild and corresponding to different RGBs can vary to a large extent,
benefit several downstream tasks, such as mono- some methods [19], [30] discretize the depth and formulate
SLAM [37], [49], 3D scene reconstruction [48], and this problem to a classification [18], which often achieves
metrology [50]. better performance. The generalization issue of deep models
5

for 3D metric recovery is related to two problems. The first but it requires multi-view prior and per-sample post op-
one is to generalize to diverse scenes, while the other one timization. Without multi-view prior, such a non-learnable
is how to predict accurate metric information under various optimization method could fail due to unsatisfactory initial
camera settings. The first problem has been well addressed predictions. All the monocular methods [42], [44], [78],
by recent methods. Some works [18], [21], [22] propose however, iterate over either depth or normal independently.
to construct a large-scale relative depth dataset, such as In contrast, our joint optimization module tightly couples
DIW [24] and OASIS [23], and then they target learning depth and normal with each other.
the relative relations. However, the relative depth loses Large-scale data training. Recently, various natural lan-
geometric structure information. To improve the recovered guage problems and computer vision problems [80], [81],
geometry quality, learning affine-invariant depth methods, [82] have achieved impressive progress with large-scale
such as MiDaS [27], LeReS [25], and HDN [29] are proposed. data training. CLIP [81] is a promising classification model,
By mixing large-scale data, state-of-the-art performance and which is trained on billions of paired image and language
the generalization over scenes are improved continuously. description data. It achieves state-of-the-art performance
Note that by design, these methods are unable to recover over several classification benchmarks by zero-shot testing.
the metric information. How to achieve both strong general- Dinov2 [83] collects 142M images to conduct vision-only
ization and accurate metric information over diverse scenes self-supervised learning for vision transformers [84]. Gener-
is the key problem that we attempt to tackle. ative models like LDM [85] have also undergone billion-
Surface normal estimation. Compared to metric depth, level data pre-training. For depth prediction, large-scale
surface normal suffers no metric ambiguity and preserves data training has been widely applied. Ranft et al. [27] mix
local geometry better. These properties attract researchers over 2 million data in training, LeReS [26] collects over
to apply normal in various vision tasks like localization 300 thousands data, Eftekhar et al. [35] also merge millions
[11], mapping [61], and 3D scene reconstruction [6], [62]. of data to build a strong depth prediction model. For
Currently, learning-based methods [32], [33], [34], [42], [62], surface normal estimation, Omni-data [35] performs dense
[63], [64], [65], [66], [67], [68], [69], [70] have dominated reconstruction to generate 14M frames with surface normal
monocular surface normal estimation. Since normal labels annotations, aggregating data from six different datasets.
required for training cannot be directly captured by sensors, Furthermore, it established multi-task training involving
previous works use [41], [58], [63], [65] kernel functions these annotations.
to annotate normal from dense indoor depth maps [58].
These annotations become incomplete on reflective sur-
faces and inaccurate at object boundaries. To learn from
3 M ETHOD
such imperfect annotations, GeoNet [41] proposes to en- Preliminaries. We consider the pin-hole camera model with
force depth-normal consistency with mutual transformation intrinsic parameters formulated as: [[fˆ/δ, 0, u0 ], [0, fˆ/δ, v0 ],
modules, ASN [69], [70] propose a novel adaptive surface [0, 0, 1]], where fˆ is the focal length (in micrometers), δ is
normal constraint to facilitate joint depth-normal learning, the pixel size (in micrometers), and (u0 , v0 ) is the principle
and Bae et al. [33] propose an uncertainty-based learning center. f = fˆ/δ is the pixel-represented focal length.
objective. Nonetheless, it is challenging for such methods
to further increase their generalization, due to the limited 3.1 Ambiguity Issues in Metric Depth Estimation
dataset size and the diversity of scenes, especially for out-
Fig. 5 presents an example of photos taken by different
door scenarios. Omni-data [35] advances to fill this gap
cameras and at different distances. Only from the image’s
by building 1300M frames of normal annotation. Normal-
appearance, one may think the last two photos are taken at a
in-the-wild [71] proposes a pipeline for efficient normal
similar location by the same camera. In fact, due to different
labeling. However, further scaling up normal labels remains
focal lengths, these are captured at different locations. Thus,
difficult. This underscores research significance in finding an
camera intrinsic parameters are critically important for the
efficient way to distill prior from other types of annotation.
metric estimation from a single image, as otherwise, the
Deep iterative refinement for geometry. Iterative refine-
problem is ill posed. To avoid such metric ambiguity, recent
ment enables multi-step coarse-to-fine prediction and ben-
methods, such as MiDaS [27] and LeReS [25], decouple the
efits a wide range of geometry estimation tasks, such as
metric from the supervision and compromise learning the
optical flow estimation [46], [72], [73], depth completion
affine-invariant depth.
[43], [74], [75], and stereo matching [45], [76], [77]. Classical
Fig. 6 (A) shows a simple pin-hole perspective projection.
iterative refinements [72], [74] optimize directly on high-
Object A locating at da is projected to A′ . Based on the
resolution outputs using high-computing-cost operators,
principle of similarity, we have the equation:
limiting researchers from applying more iterations for better
predictions. To address this limitation, RAFT [46] proposes h fˆ i
to optimize an intermediate low-resolution prediction us- da = Ŝ = Ŝ · α (1)
Ŝ ′
ing ConvGRU modules. For monocular depth estimation, where Ŝ and Ŝ ′ are the real and imaging size respectively. ˆ·
IEBins [44] employs similar methods to optimize depth-bin denotes variables are in the physical metric (e.g., millimeter).
distribution. Differently, IronDepth [42] propagates depth To recover da from a single image, focal length, imaging size
on pre-computed local surfaces. Regarding surface normal of the object, and real-world object size must be available.
refinement, Lenssen et al. [78] propose a deep iterative Estimating the focal length from a single image is a challeng-
method to optimize normal from point clouds. Zhao et al. ing and ill-posed problem. Although several methods [25],
[79] design a solver to refine depth and normal jointly, [86] have explored, the accuracy is still far from being
6

𝑑1 = 2𝑑2 𝑑2
𝑓෡2
𝑓෡1 = 2𝑓෡2

object 𝐴 object 𝐴 𝐴′2


𝐴1′

Fig. 7 – Illustration of two cameras with different focal


length at different distance. As f1 = 2f2 and d1 = 2d2 , A is
focal=26 𝑚𝑚, depth=2 𝑚 focal=52 𝑚𝑚, depth=2 𝑚 focal=26 𝑚𝑚, depth=1 𝑚 projected to two image planes with the same imaging size
′ ′
Fig. 5 – Photos of a chair captured at different distances (i.e. A1 = A2 ).
with different cameras. The first two photos are captured
at the same distance but with different cameras, while the ¡ d2 . After upprojecting the depth to the 3D point cloud, the
last one is taken at a closer distance with the same camera dolls are in different distances, i.e. d1 and d2 and d1 < d2 .
as the first one. However, despite the variation in depth scales, the surface
normals n1 and n2 corresponding to a certain pixel A′ ∈ I
satisfactory. Here, we simplify the problem by assuming remain the same.
the focal length of a training/test image is available. In
contrast, understanding the imaging size is much easier
for a neural network. To obtain the real-world object size, 𝐧𝐧2
a neural network needs to understand the semantic scene 𝐴𝐴2
layout and the object, at which a neural network excels. We 𝐧𝐧1
define α = fˆ/Sˆ′ , so da is proportional to α. 𝐴𝐴1
We make the following observations regarding sensor 𝐴𝐴′
size, pixel size, and focal length. 𝑂𝑂
O1: Sensor size and pixel size do not affect the met- 𝑑𝑑1
ric depth estimation. Based on the perspective projection 𝑑𝑑2
(Fig. 6 (A)), the sensor size only affects the field of view
(FOV) and is irrelevant to α, thus does not affect the metric Fig. 8 – The metric-agnostic property of normal. With

depth estimation. For the pixel size, we assume two cameras differently predicted metrics d1 and d2 , the pixel A on
with different pixel sizes (δ1 = 2δ2 ) but the same focal the image will be back-projected to 3D points A1 and A2 ,
respectively. The surface normal n1 at A1 and n2 at A2
length fˆ to capture the same object locating at da . Fig. 6 remain the same.
(B) shows their captured photos. According to the prelim-
inaries, the pixel-represented focal length f1 = 12 f2 . As
the second camera has a smaller pixel size, although in the 3.2 Canonical Camera Transformation
same projected imaging size Ŝ ′ , the pixel-represented image
ˆ ˆ The core idea is to set up a canonical camera space ((fxc , fyc ),
resolution is S1′ = 12 S2′ . According to Eq. (1), δ1f·S ′ = δ2f·S ′ , fxc = fyc = f c in experiments) and transform all training
1 2
i.e. α1 = α2 , so d1 = d2 . Therefore, different camera sensors data to this space. Consequently, all data can roughly be
would not affect the metric depth estimation. regarded as captured by the canonical camera. We propose
O2: The focal length is vital for metric depth estimation. two transformation methods, i.e. either transforming the
Fig. 5 shows the metric ambiguity issue caused by the input image (I ∈ RH×W ×3 ) or the ground-truth (GT) label
unknown focal length. Fig. 7 illustrates this. If two cameras (D ∈ RH×W ). The original intrinsics are {f, u0 , v0 }.
(fˆ1 = 2fˆ2 ) are at distances d1 = 2d2 , the imaging sizes on Method1: transforming depth labels (CSTM label).
cameras are the same. Thus, only from the appearance, the Fig. 5’s ambiguity is for depths. Thus our first method
network will be confused when supervised with different directly transforms the ground-truth depth labels to solve
labels. Based on this observation, we propose a canonical this problem. Specifically, we scale the ground-truth depth
c
camera transformation method to solve the supervision and (D∗ ) with the ratio ωd = ff in training, i.e., D∗c = ωd D∗ .
image appearance conflicts. The original camera model is transformed to {f c , u0 , v0 }.
Unlike depth, surface normal does not have any metric In inference, the predicted depth (Dc ) is in the canonical
ambiguity problem. In Fig. 8, we illustrate this concept with space and needs to perform a de-canonical transformation
two depth maps at varying scales, denoted as D1 and D2 , to recover the metric information, i.e., D = ω1d Dc . Note the
featuring distinct metrics d1 and d2, respectively, where d1 input I does not perform any transformation, i.e., Ic = I.
Method2: transforming input images (CSTM image).
𝐴𝐴
From another view, the ambiguity is caused by the similar
𝐴𝐴′ image appearance. Thus this method is to transform the
𝑂𝑂 input image to simulate the canonical camera imaging effect.
c
𝑓𝑓̂ Specifically, the image I is resized with the ratio ωr = ff ,
𝑑𝑑𝑎𝑎 Resolution: 8 × 12 Resolution: 16 × 24
(A) (B) i.e., Ic = T (I, ωr ), where T (·) denotes image resize. The
optical center is resized, thus the canonical camera model is
Fig. 6 – Pinhole camera model. (A) Object A at the distance {f c , ωr u0 , ωr v0 }. The ground-truth labels are resized with-
da is projected to the image plane. (B) Using two cameras
out any scaling, i.e., D∗c = T (D∗ , ωr ). In inference, the de-
to capture the car. The left one has a larger pixel size.
Although the projected imaging sizes are the same, the canonical transformation is to resize the prediction to the
pixel-represented images (resolution) are different. original size without scaling, i.e., D = T (Dc , ω1r ).
7

Real world Canonical camera space Real world


(𝑓𝑓, 𝑢𝑢0 , 𝑣𝑣0 ) (𝑓𝑓𝑐𝑐 , 𝑢𝑢𝑐𝑐 , 𝑣𝑣𝑐𝑐 )
(Final) pred. Optional: normal GT 𝐍𝐍 ∗
Initial t = 0, 1, …, T
normal 𝐍𝐍 Supervise 𝐿𝐿𝑛𝑛
normal 𝐍𝐍 0

Canonical Joint
camera En- Supervise 𝐿𝐿𝑑𝑑−𝑛𝑛
Input I Input Ic depth-normal
transform decoder optimization De-canonical
module
transform
Initial
depth 𝐃𝐃0c Pred. depth 𝐃𝐃c Final pred.
Depth GT 𝐃𝐃∗ Depth GT 𝐃𝐃c∗ Supervise 𝐿𝐿𝑑𝑑 depth 𝐃𝐃

Fig. 9 – Pipeline. Given an input image I , we first transform it to the canonical space using CSTM. The transformed image
Ic is fed into a standard depth-normal estimation model to produce the predicted metric depth Dc in the canonical space
and metric-agnostic surface normal N . During training, Dc is supervised by a GT depth Dc∗ which is also transformed into
the canonical space. In inference, after producing the metric depth Dc in the canonical space, we perform a de-canonical
transformation to convert it back to the space of the original input I . The canonical space transformation and de-canonical
transformation are executed using camera intrinsics. The predicted normal N is supervised by depth-normal consistency via
the recovered metric depth D as well as GT normal N ∗ , if available.

Fig. 9 shows the pipeline. After performing either trans- after step t, where t = 0, 1, 2, . . . , T denotes the step index.
formation, we randomly crop a patch for training. The Initially, at step t = 0, D̂0c and N̂0u are given by the decoder.
cropping only adjusts the FOV and the optical center, thus In addition to updating depth and normal, the optimization
not causing any metric ambiguity issues. In the labels trans- module also updates hidden feature maps Ht , which are
c
formation method ωr = 1 and ωd = ff , while ωd = 1 initialized by the decoder. During each iteration, the learned
recurrent block F output updates ∆D̂c , ∆N̂u and renews
c
and ωr = ff in the images transformation method. During
training, the transformed ground-truth depth labels D∗c the hidden features H:
are then used as supervision. Notably, as surface normals ∆D̂t+1 t+1 t+1
= F(D̂tu , N̂tu , Ht , H0 ), (2)
c , ∆N̂u , H
do not suffer from any metric ambiguity, we impose no
transformation to normal labels N∗ . The updates are then applied for updating the predictions:
Mix-data training is an effective way to boost general- D̂t+1 = D̂tc + ∆D̂t+1 t+1
= N̂tu + ∆N̂t+1 (3)
c c , N̂u u ,
ization. We collect 18 datasets for training, see Tab. 5 for
details. In the mixed data, over 10K different cameras are To be more specific, the recurrent block F comprises a
included. All collected training data have included paired ConvGRU sub-block and two projection heads. First, the
camera intrinsic parameters, which are used in our canoni- ConvGRU sub-block updates the hidden features Ht taking
cal transformation module. all the variables as inputs. Subsequently, the two branched
projection heads Gd and Gn estimate the updates ∆D̂t+1 and
∆N̂t+1 respectively. A more comprehensive representation
3.3 Jointly optimizing depth and normal
of Eq. 2, therefore, can be written as:
We propose to optimize metric depth and surface normal
jointly in an end-to-end manner. This optimization is pri- Ht+1 = ConvGRU(D̂t , N̂t , H0 , Ht ),
(4)
marily aimed at leveraging the large amount of annotation ∆D̂t+1 = Gd (Ht+1 ), ∆N̂t+1 = Gn (Ht+1 ).
knowledge available in depth datasets to improve normal For detailed structures of the refinement module F, we
estimation, particularly in outdoor scenarios where depth recommend readers refer to supplementary materials.
datasets contain significantly more annotations than normal After T + 1 iterative steps, we obtain the well-optimized
datasets. In our experiments, we collect from the community
low-resolution predictions D̂Tc +1 and N̂Tu +1 . These predic-
9488K images with depth annotations across 14 outdoor tions are then up-sampled and post-processed to generate
datasets while less than 20K outdoor normal-labeled im-
the final depth Dc and surface normal N:
ages, presented in Tab. 5.
To facilitate knowledge flow across depth and normal, Dc = Hd (upsample(D̂Tc +1 ))
we implement the learning-based optimization with recur- (5)
N = Hn (upsample(N̂Tu +1 )),
rent refinement blocks, as depicted in Fig 10. Unlike previ-
ous monocular methods [42], [44], our method updates both where Hd is the ReLU function to guarantee depth is non-
depth and normal iteratively through these blocks. Inspired negative, and Hn represents normalization to ensure ∥n∥ =
by RAFT [45], [46], we iteratively optimize the intermediate 1 for all pixels.
low-resolution depth D̂c and unnormalized normal N̂u , In a general formulation, the end-to-end network in
H W Fig. 10 can be rewritten as:
where ˆ denotes low resolution prediction D̂c ∈ R 4 × 4 ,
H W
and N̂u ∈ R 4 × 4 ×3 , and the subscript c means the depth Dc , N = Nd−n (Ic , θ) (6)
D̂c is in canonical space. As sketched in Fig. 10, D̂tc and N̂tu where θ is the network’s (Nd−n ) parameters.
represent the low-resolution depth and normal optimized
8
Upsample & Post-process

𝐇𝐇 0 C-GRU C-GRU C-GRU

Proj. Proj. Proj. Depth 𝑫𝑫𝑐𝑐

Δ Δ �2 Δ � 𝑇𝑇+1
� 0𝑐𝑐
𝐃𝐃 � 1𝑐𝑐
𝐃𝐃 𝐃𝐃𝑐𝑐 𝐃𝐃𝑐𝑐
Input Ic En-decoder
� 0𝑢𝑢
𝑵𝑵 � 1𝑢𝑢
𝑵𝑵 � 2𝑢𝑢
𝑵𝑵 � 𝑇𝑇+1
𝑵𝑵𝑢𝑢
Normal 𝑵𝑵
Fig. 10 – Joint depth and normal optimization. In the canonical space, we deploy recurrent blocks composed of ConvGRU
sub-blocks (C-RGU) and projection heads (Proj.) to predict the updates ∆. During optimization, intermediate low-reolsution
depth and normal D̂0c N̂0u are initially given by the decoder, and then iteratively refined by the predicted updates ∆. After
T + 1 iterations, the optimized intermediate predictions D̂Tc +1 N̂Tu +1 are upsampled and post-processed to obtain the final
depth Dc in the canonical space and the final normal N.

3.4 Supervision a consistency loss Ld−n (D, N) to align the predicted depth
The training objective is: and normal. This loss is computed based on the similarity
between a pseudo-normal map generated from the pre-
min L(Nd−n (Ic , θ), D∗c , N∗ ) (7) dicted depth using the least square method [41], and the
θ
where D∗c and Ic are transformed ground-truth depth labels predicted normal itself. Different from previous methods,
and images in the canonical space c, N∗ denotes normal [33], [41], this loss operates as a self-supervision mechanism,
labels, L is the supervision loss to be illustrated as following. requiring no depth or normal ground truth labels. Note that
Random proposal normalization loss. To boost the per- here we use the depth D in the real world instead of the
formance of depth estimation, we propose a random pro- one Dc in the canonical space to calculate depth-normal
posal normalization loss (RPNL). The scale-shift invariant consistency. The overall losses are as follows.
loss [25], [27] is widely applied for the affine-invariant depth L = wd Ld (Dc , D∗c ) + wn Ln (N, N∗ ) + wd−n Ld−n (N, D) (10)
estimation, which decouples the depth scale to emphasize
the single image distribution. However, such normalization , where wd = 0.5, wn = 1, wd−n = 0.01 serve as weights to
based on the whole image inevitably squeezes the fine- balance the loss items.
grained depth difference, particularly in close regions. In-
spired by this, we propose to randomly crop several patches
(pi(i=0,...,M ) ∈ Rhi ×wi ) from the ground truth D∗c and the 4 E XPERIMENTS
predicted depth Dc . Then we employ the median absolute Dataset details. We have curated a comprehensive dataset
deviation normalization [87] for paired patches. By normal- comprising 18 public RGB-D datasets, totaling over 16 mil-
izing the local statistics, we can enhance local contrast. The lion data points for training purposes. It spreads over di-
loss function is as follows: verse indoor and outdoor scenes. Notably, approximately 10
M N
1 XX d∗pi ,j − µ(d∗pi ,j ) million frames are annotated with normal, but the majority
LRPNL = | PN ∗ − of these annotations pertain to indoor scenes exclusively. It’s
M N pi j 1
d − µ(d ∗ )
N j pi ,j pi ,j worth mentioning that all datasets have provided camera
dpi ,j − µ(dpi ,j ) intrinsic parameters. Apart from the test split of training
1 PN | (8) datasets, we collect 7 unseen datasets for robustness and
N j |dpi ,j − µ(dpi ,j )|
generalization evaluation. Details of employed training and
where d∗ ∈ D∗c and d ∈ Dc are the ground truth and testing data are reported in Tab. 5.
predicted depth respectively. µ(·) and is the median of Implementation details. In our experiments, we employ
depth. M is the number of proposal crops, which is set different network architectures and aim to provide diverse
to 32. During training, proposals are randomly cropped choices for the community, including convnets and trans-
from the image by 0.125 to 0.5 of the original size. Fur- formers. For convnets, we employ an UNet architecture
thermore, several other losses are employed, including the with the ConvNext-large [99] backbone. ImageNet-22K pre-
scale-invariant logarithmic loss [60] Lsilog , pair-wise normal trained weights are used for initialization. For transformers,
regression loss [25]LPWN , virtual normal loss [18] LVNL . we apply DINO v2-reg [83], [100] vision transformers [84]
Note Lsilog is a variant of L1 loss. The overall losses are (ViT) as backbones, DPT [28] as decoders.
as follows. We use AdamW with a batch size of 192, an initial
learning rate 0.0001 for all layers, and the polynomial
Ld = LPWN + LVNL + Lsilog + LRPNL . (9)
decaying method with the power of 0.9. We train our
Normal loss. To supervise normal prediction, we employ models on 48 A100 GPUs for 800k iterations. Following
two distinct loss functions depending on the availability the DiverseDepth [18], we balance all datasets in a mini-
of ground-truth (GT) normals N∗ . As presented in Fig. 9, batch to ensure each dataset accounts for an almost equal
when GT normals are provided, we utilize an aleatoric ratio. During training, images are processed by the canonical
uncertainty-aware loss [33] (Ln (·)) to supervise prediction camera transformation module, flipped horizontally with a
N. Alternatively, in the absence of GT normals, we propose 50% chance, and then randomly cropped into 512 × 960
9

TABLE 1 – Quantitative comparison on NYUv2 and KITTI TABLE 2 – Quantitative comparison of surface normals on
metric depth benchmarks. Methods overfitting the bench- NYUv2, ibims-1, and ScanNet normal benchmarks. ‘ZS’
mark are marked with grey, while robust depth estimation means zero-shot testing and ‘FT’ performs post fine-tuneing
methods are in blue. ‘ZS’ denotes the zero-shot testing, and on the target dataset. Methods trained only on NYU are
‘FT’ means the method is further finetuned on the bench- highlighted with grey. Best results are in bold and second
mark. Among all zero-shot testing (ZS) results, our methods bests are underlined. Our method ranks first over all bench-
performs the best and is even better than overfitting meth- marks.
ods. Further fine-tuning (FT) helps our method surpass all Method 11.25◦ ↑ 22.5◦ ↑ 30◦ ↑ mean↓ median↓ RMS normal↓
known methods, ranked by the averaged ranking among NYUv2 Normal Benchmark
Ladicky et al. [67] 0.275 0.490 0.587 33.5 23.1 -
all metrics. Best results are in bold and second bests are Fouhey et al.. [93] 0.405 0.541 0.589 35.2 17.9 -
underlined. Deep3D [64] 0.420 0.612 0.682 20.9 13.2 -
Eigen et al. [65] 0.444 0.672 0.759 20.9 13.2 -
Method δ1 ↑ δ2 ↑ δ3 ↑ AbsRel↓ log10↓ RMS↓ SkipNet [94] 0.479 0.700 0.778 19.8 12.0 28.2
NYUv2 Metric Depth Benchmark SURGE [95] 0.473 0.689 0.766 20.6 12.2 -
Li et al.. [88] 0.788 0.958 0.991 0.143 0.063 0.635 GeoNet [41] 0.484 0.715 0.795 19.0 11.8 26.9
Laina et al.. [89] 0.811 0.953 0.988 0.127 0.055 0.573 PAP [96] 0.488 0.722 0.798 18.6 11.7 25.5
VNL [30] 0.875 0.976 0.994 0.108 0.048 0.416 GeoNet++ [63] 0.502 0.732 0.807 18.5 11.2 26.7
TrDepth [20] 0.900 0.983 0.996 0.106 0.045 0.365 Bae et al. [33] 0.622 0.793 0.852 14.9 7.5 23.5
Adabins [19] 0.903 0.984 0.997 0.103 0.044 0.364 FrameNet [40] ZS 0.507 0.720 0.795 18.6 11.0 26.8
NeWCRFs [17] 0.922 0.992 0.998 0.095 0.041 0.334 VPLNet [97] ZS 0.543 0.738 0.807 18.0 9.8 -
IEBins [44] 0.936 0.992 0.998 0.087 0.038 0.314 TiltedSN [32] ZS 0.598 0.774 0.834 16.1 8.1 25.1
ZeroDepth [90] ZS 0.901 0.961 - 0.100 - 0.380 Bae et al. [33] ZS 0.597 0.775 0.837 16.0 8.4 24.7
Polymax [34] ZS 0.969 0.996 0.999 0.067 0.029 0.250 Polymax [34] ZS 0.656 0.822 0.878 13.1 7.1 20.4
ZoeDepth [36] FT 0.953 0.995 0.999 0.077 0.033 0.277 Ours ViT-l CSTM label ZS 0.662 0.831 0.881 13.1 7.1 21.1
ZeroDepth [90] FT 0.954 0.995 1.000 0.074 0.103 0.269 Ours ViT-g CSTM label ZS 0.664 0.831 0.881 13.3 7.0 21.3
DepthAnything [91] FT 0.984 0.998 1.000 0.056 0.024 0.206 Ours ViT-l CSTM label FT 0.688 0.849 0.898 12.0 6.5 19.2
Ours Conv-L CSTM image ZS 0.925 0.983 0.994 0.092 0.040 0.341 Ours ViT-g CSTM label FT 0.662 0.837 0.889 13.2 7.5 20.2
Ours Conv-L CSTM label ZS 0.944 0.986 0.995 0.083 0.035 0.310
Ours ViT-L CSTM label ZS 0.975 0.994 0.998 0.063 0.028 0.251
Ours ViT-g CSTM label ZS 0.980 0.997 0.999 0.067 0.030 0.260 ibims-1 Normal Benchmark
Ours ViT-L CSTM label FT 0.989 0.998 1.000 0.047 0.020 0.183 VNL [30] ZS 0.179 0.386 0.494 39.8 30.4 51.0
Ours ViT-g CSTM label FT 0.987 0.997 0.999 0.045 0.015 0.187 BTS [98] ZS 0.130 0.295 0.400 44.0 37.8 53.5
Adabins [19] ZS 0.180 0.387 0.506 37.1 29.6 46.9
IronDepth [42] ZS 0.431 0.639 0.716 25.3 14.2 37.4
Method δ1 ↑ δ2 ↑ δ3 ↑ AbsRel ↓ RMS ↓ RMS log ↓ Ours ViT-L CSTM label ZS 0.694 0.758 0.785 19.4 5.7 34.9
KITTI Metric Depth Benchmark Ours ViT-L CSTM label ZS 0.697 0.762 0.788 19.6 5.7 35.2
Guo et al. [92] 0.902 0.969 0.986 0.090 3.258 0.168
VNL [30] 0.938 0.990 0.998 0.072 3.258 0.117
TrDepth [20] 0.956 0.994 0.999 0.064 2.755 0.098 ScanNet Normal Benchmark
Adabins [19] 0.964 0.995 0.999 0.058 2.360 0.088 FrameNet [40] 0.625 0.801 0.858 14.7 7.7 22.8
NeWCRFs [17] 0.974 0.997 0.999 0.052 2.129 0.079 VPLNet [97] 0.663 0.818 0.870 12.6 6.0 21.1
IEBins [44] 0.978 0.998 0.999 0.050 2.011 0.075 TiltedSN [32] 0.693 0.839 0.886 12.6 6.0 21.1
ZeroDepth [90] ZS 0.910 0.980 0.996 0.102 4.044 0.172 Bae et al. [33] 0.711 0.854 0.898 11.8 5.7 20.0
ZoeDepth [36] FT 0.971 0.996 0.999 0.057 2.281 0.082 Ours ViT-L CSTM label 0.760 0.885 0.923 9.9 5.3 16.4
ZeroDepth [90] FT 0.968 0.995 0.999 0.053 2.087 0.083 Ours ViT-g CSTM label 0.778 0.901 0.935 9.2 5.0 15.3
DepthAnything [91] FT 0.982 0.998 1.000 0.046 1.869 0.069
Ours Conv-L CSTM image ZS
Ours Conv-L CSTM label ZS
0.967
0.964
0.995
0.993
0.999
0.998
0.060
0.058
2.843
2.770
0.087
0.092 current affine-invariant depth benchmarks [25], [29] (Tab. 4)
Ours ViT-L CSTM label ZS
Ours ViT-g CSTM label ZS
0.974
0.977
0.995
0.996
0.999
0.999
0.052
0.051
2.511
2.403
0.074
0.080
to evaluate the generalization ability on 5 zero-shot datasets,
Ours ViT-L CSTM label FT
Ours ViT-g CSTM label FT
0.985 0.998
0.989 0.998
0.999
1.000
0.044
0.039
1.985
1.766
0.064
0.060
i.e., NYUv2, DIODE, ETH3D, ScanNet [107], and KITTI. We
mainly compare with large-scale data trained models. Note
pixels for convnets and 616 × 1064 for vision transformers. that in this benchmark we follow existing methods to apply
In the ablation experiments, training settings are different the scale shift alignment before evaluation.
as we sample 5000 images from each dataset for training. We report results with different canonical transformation
We trained on 8 GPUs for 150K iterations. Details of net- methods (CSTM lable and CSTM image) on the ConvNext-
works architectures, training setups, and efficiency analysis Large model (Conv-L in Tab. 1 and Tab. 2). As CSTM label
are presented in the supplementary materials. Fine-tuning is slightly better, more results using this method from multi-
experiments on KITTI and NYU are conducted on 8 GPUs size ViT-models (ViT-S for Small, ViT-L for Large, ViT-g
with 20K further steps. for giant2) are reported. Note that all models for zero-
Evaluation details for monocular depth and normal esti- shot testing use the same checkpoints except for fine-tuning
mation. a) To show the robustness of our metric depth esti- experiments.
mation method, we test on 8 zero-shot benchmarks, includ- Evaluation details for reconstruction and SLAM. a) To
ing NYUv2 [58], KITTI [59], NuScenes [103], iBIMS-1 [104], evaluate our metric 3D reconstruction quality, we randomly
DIODE [105] (indoor and outdoor, and full), ETH3D [106]. sample 9 unseen scenes from NYUv2 and use colmap [108]
Following previous works [17], absolute relative error (Ab- to obtain the camera poses for multi-frame reconstruction.
sRel), the accuracy under threshold (δi < 1.25i , i = 1, 2, 3), Chamfer l1 distance and the F-score [109] are used to eval-
root mean squared error (RMS), root mean squared error uate the reconstruction accuracy. b) In dense-SLAM experi-
in log space (RMS log), and log10 error (log10) metrics are ments, following Li et al. [110], we test on the KITTI odom-
employed. For KITTI and NYU benchmarks, we report the etry benchmark [59] and evaluate the average translational
result of zero-shot and fine-tuning testing results. b) For RMS drift (%, trel ) and rotational RMS drift (◦ /100m, rrel )
normal estimation tasks and ablations, employ several error errors [59].
metrics to assess performance. Specifically, we calculate the Evaluation on metric depth benchmarks. To evaluate the
mean (mean), median (median), and rooted mean square accuracy of predicted metric depth, firstly, we compare with
(RMS normal) of the angular error as well as the accuracy state-of-the-art (SoTA) metric depth prediction methods
under threshold of {11.25◦ , 22.5◦ , 30.0◦ } consistent with on NYUv2 [58], KITTI [111]. We use the same model to
methodologies established in previous studies [33]. We do all evaluations. Results are reported in Tab. 1. Firstly,
conduct in-domain evaluation using the Scannet dataset, comparing with existing overfitting methods, which are
while the NYU and iBIMS-1 datasets are reserved for zero- trained on benchmarks for hundreds of epochs, our zero-
shot generalization testing. c) Furthermore, we also follow shot testing (‘ZS’ in the table) without any fine-tuning or
10

TABLE 3 – Quantitative comparison with SoTA metric depth methods on 5 unseen benchmarks. For SoTA methods, we use
their NYUv2 and KITTI models for indoor and outdoor scene evaluation respectively, while we use the same model for all
zero-shot testing.
DIODE(Indoor) iBIMS-1 DIODE(Outdoor) ETH3D NuScenes
Method Metric Head
Indoor scenes (AbsRel↓/RMS↓) Outdoor scenes (AbsRel↓/RMS↓)
Adabins [19] KITTI or NYU † 0.443 / 1.963 0.212 / 0.901 0.865 / 10.35 1.271 / 6.178 0.445 / 10.658
NewCRFs [17] KITTI or NYU † 0.404 / 1.867 0.206 / 0.861 0.854 / 9.228 0.890 / 5.011 0.400 / 12.139
ZoeDepth [36] KITTI and NYU ‡ 0.400 / 1.581 0.169 / 0.711 0.269 / 6.898 0.545 / 3.112 0.504 / 7.717
Ours Conv-L CSTM label Unified 0.252 / 1.440 0.160 / 0.521 0.414 / 6.934 0.416 / 3.017 0.154 / 7.097
Ours Conv-L CSTM image Unified 0.268 / 1.429 0.144 / 0.646 0.535 / 6.507 0.342 / 2.965 0.147 / 5.889
Ours ViT-L CSTM image Unified 0.093 / 0.389 0.185 / 0.592 0.221 / 3.897 0.357 / 2.980 0.165 / 9.001
Ours ViT-g CSTM image Unified 0.081 / 0.359 0.249 / 0.611 0.201 / 3.671 0.363 / 2.999 0.129 / 6.993

: Two different metric heads are trained on KITTI and NYU respectively. ‡: Both metric heads are ensembled by an additional router.

TABLE 4 – Comparison with SoTA affine-invariant depth methods on 5 zero-shot transfer benchmarks. Our model significantly
outperforms previous methods and sets new state-of-the-art. Following the benchmark setting, all methods have manually
aligned the scale and shift.
#Data NYUv2 KITTI DIODE(Full) ScanNet ETH3D
Method Backbone #Params
Pretrain Train AbsRel↓ δ1 ↑ AbsRel↓ δ1 ↑ AbsRel↓ δ1 ↑ AbsRel↓ δ1 ↑ AbsRel↓ δ1 ↑
DiverseDepth [18] ResNeXt50 [101] 25M 1.3M 320K 0.117 0.875 0.190 0.704 0.376 0.631 0.108 0.882 0.228 0.694
MiDaS [27] ResNeXt101 88M 1.3M 2M 0.111 0.885 0.236 0.630 0.332 0.715 0.111 0.886 0.184 0.752
Leres [25] ResNeXt101 1.3M 354K 0.090 0.916 0.149 0.784 0.271 0.766 0.095 0.912 0.171 0.777
Omnidata [35] ViT-Base 1.3M 12.2M 0.074 0.945 0.149 0.835 0.339 0.742 0.077 0.935 0.166 0.778
HDN [29] ViT-Large [84] 306M 1.3M 300K 0.069 0.948 0.115 0.867 0.246 0.780 0.080 0.939 0.121 0.833
DPT-large [28] ViT-Large 1.3M 188K 0.098 0.903 0.100 0.901 0.182 0.758 0.078 0.938 0.078 0.946
DepthAnything [28] ViT-Large 142M 63.5M 0.043 0.981 0.076 0.947 - - - - 0.127 0.882
Marigold [28] Latent diffusion V2 [85] 899M 5B 74K 0.055 0.961 0.099 0.916 0.308 0.773 0.064 0.951 0.065 0.960
Ours CSTM label ViT-Small 22M 142M 16M 0.056 0.965 0.064 0.950 0.247 0.789 0.033† 0.985† 0.062 0.955
Ours CSTM image ConvNeXt-Large [99] 198M 14.2M 8M 0.058 0.963 0.053 0.965 0.211 0.825 0.074 0.942 0.064 0.965
Ours CSTM label ConvNeXt-Large 14.2M 8M 0.050 0.966 0.058 0.970 0.224 0.805 0.074 0.941 0.066 0.964
† †
Ours CSTM label ViT-Large 306M 142M 16M 0.042 0.980 0.046 0.979 0.141 0.882 0.021 0.993 0.042 0.987
Ours CSTM label ViT-giant [102] 1011M 142M 16M 0.043 0.981 0.044 0.982 0.136 0.895 0.022† 0.994† 0.042 0.983

: ScanNet is partly annotated with normal [40]. For samples without normal annotations, these models use depth labels to facilitate normal learning.

metric adjustment already achieves comparable or even with existing methods which are trained on ScanNet or
better performance on some metrics. Then comparing with Taskonomy and have achieved promising performance on
robust monocular depth estimation methods, such as Ze- them, such as Polymax [34] and Bae et al. [33]. Our method
rodepth [90] and ZoeDepth [36], our zero-shot testing is also surpasses them over most metrics. Then, comparing with
better than them. Further post finetuning (‘FT in the table’) methods that have been overfitted the NYU data domain
lifts our method to the 1st rank. for hundreds of epochs (see methods marked with blue), our
Furthermore, We collect 5 unseen datasets to do more zero-shot testing outperforms them on all metrics. Our post-
metric accuracy evaluation. These datasets contain a wide finetuned models (see ‘FT’ marks) further boost the per-
range of indoor and outdoor scenes, including rooms, build- formance. Similarly, we also achieve SoTA performance on
ings, and driving scenes. The camera models are also varied. iBims-1 and Scannet benchmarks. For the iBims-1 dataset,
We mainly compare with the SoTA metric depth estimation we follow IronDepth [42] to generate the ground-truth
methods and take their NYUv2 and KITTI models for indoor normal annotations.
and outdoor scene evaluation respectively. From Tab. 3, we
observe that although NuScenes is similar to KITTI, existing
methods face a noticeable performance decrease. In contrast, TABLE 5 – Training and testing datasets used for experi-
ments.
our model is more robust. Datasets Scenes Source Label Size #Cam.
Generalization over diverse scenes. Affine-invariant depth DDAD [112] Outdoor
Training Data
Real-world Depth ∼80K 36+
benchmarks decouple the scale’s effect, which aims to eval- Lyft [113]
Driving Stereo (DS) [114]
Outdoor
Outdoor
Real-world
Real-world
Depth
Depth
∼50K
∼181K
6+
1
uate the model’s generalization ability to diverse scenes. DIML [115] Outdoor Real-world Depth ∼122K 10
Arogoverse2 [116] Outdoor Real-world Depth ∼3515K 6+
Recent impact works, such as MiDaS, LeReS, and DPT, Cityscapes [117] Outdoor Real-world Depth ∼170K 1
DSEC [118] Outdoor Real-world Depth ∼26K 1
achieved promising performance on them. Following them, Mapillary PSD [119] Outdoor Real-world Depth 750K 1000+
we test on 5 datasets and manually align the scale and Pandaset [120]
UASOL [121]
Outdoor
Outdoor
Real-world
Real-world
Depth
Depth
∼48K
∼1370K
6
1
shift to the ground-truth depth before evaluation. Results Virtual KITTI [122]
Waymo [123]
Outdoor
Outdoor
Synthesized
Real-world
Depth
Depth
37K
∼1M
2
5
are reported in Tab. 4. Although our method enforces the Matterport3d [124] In/Out Real-world Depth + Normal 144K 3
Taskonomy [124] Indoor Real-world Depth + Normal ∼4M ∼1M
network to recover more challenging metric information, Replica [125] Indoor Real-world Depth + Normal ∼150K 1
ScanNet†[107] Indoor Real-world Depth + Normal ∼2.5M 1
our method outperforms them by a large margin on all HM3d [126] Indoor Real-world Depth + Normal ∼2000K 1
Hypersim [127] Indoor Synthesized Depth + Normal 54K 1
datasets. Testing Data
NYU [58] Indoor Real-world Depth+Normal 654 1
Evaluation on surface normal benchmarks. We evaluate KITTI [59] Outdoor Real-world Depth 652 4
ScanNet†[107]
our methods on ScanNet, NYU, and iBims-1 surface normal NuScenes (NS) [103]
Indoor
Outdoor
Real-world
Real-world
Depth+Normal
Depth
700
10K
1
6
benchmarks. Results are reported in Tab. 2. Firstly, we ETH3D [106]
DIODE [105]
Outdoor
In/Out
Real-world
Real-world
Depth
Depth
431
771
1
1
organize a zero-shot testing benchmark on NYU dataset, iBims-1 [104] Indoor Real-world Depth 100 1

see methods denoted with ‘ZS’ in the table. We compare †


ScanNet is a non-zero-shot testing dataset for our ViT models.
11

RGB GT Depth Ours Depth ZoeDepth GT Normal Ours Normal Bae et al. OmniData
Fig. 11 – Qualitative comparisons of metric depth and surface normals for iBims, DIODE, NYU, Eth3d, Nuscenes, and
self-collected drone datasets. We present visualization results of our predictions (‘Ours Depth’ / ‘Ours Normal’), groundtruth
labels (‘GT Depth’ / ‘GT Normal’) and results from other metric depth (‘ZoeDepth’ [36]) and surface normal methods (‘Bae et
al.’ [33] and ‘OmniData’ [35]).
12

RGB Ours Depth ZoeDepth Ours Normal Bae et al. OmniData


Fig. 12 – Qualitative comparisons of metric depth and surface normals in the wild. We present visualization results of our
predictions (‘Ours Depth’ / ‘Ours Normal’) and results from other metric depth (‘ZoeDepth’ [36]) and surface normal methods
(‘Bae et al.’ [33] and ‘OmniData’ [35]).

4.1 Zero-shot Generalization depth for each frame. Although our method does not aim
Qualitative comparisons of surface normals and depths. for the video or multi-view reconstruction problem, our
We visualize our predictions in Fig. 11. A comparison with method can achieve promising consistency between frames
another widely used generalized metric depth method, and reconstruct much more accurate 3D scenes than others
ZoeDepth [36], demonstrates that our approach produces on these zero-shot scenes. From the qualitative comparison
depth maps with superior details on fine-grained structures in Fig. 13. our reconstructions have much less noise and
(objects in row1, suspension lamp in row4, beam in row8), outliers.
and better foreground/background distinction (row 11, 12). Dense-SLAM mapping. Monocular SLAM is an important
In terms of surface normal prediction, our normal maps robotics application. It only relies on a monocular video
exhibits significantly finer details compared to Bae. et al. [33] input to create the trajectory and dense 3D mapping. Owing
and can handle some cases where their method fail (row7, to limited photometric and geometric constraints, existing
8, 9). Our method not only generalizes well across diverse methods face serious scale drift problems in large scenes
scenarios but can also be directly applied to unseen camera and cannot recover the metric information. Our robust
models like the fisheye camera shown in row 12. More metric depth estimation method is a strong depth prior to
visualization results for in-the-wild images are presented the SLAM system. To demonstrate this benefit, we naively
in Fig. 12, including comic-style (Row 2) and CG(computer input our metric depth to the SoTA SLAM system, Droid-
graphics)-generated objects (Row5) SLAM [37], and evaluate the trajectory on KITTI. We do
not do any tuning on the original system. Trajectory com-
parisons are reported in Tab. 7. As Droid-SLAM can access
4.2 Applications Based on Our Method accurate per-frame metric depth, like an RGB-D SLAM, the
In these experiments, we apply the CSTM image model to translation drift (trel ) decreases significantly. Furthermore,
various tasks. with our depths, Droid-SLAM can perform denser and more
3D scene reconstruction . To demonstrate our work can accurate 3D mapping. An example is shown in Fig. 3 and
recover the 3D metric shape in the wild, we first do the more cases are shown in the supplementary materials.
quantitative comparison on 9 NYUv2 scenes, which are We also test on the ETH3D SLAM benchmarks. Results
unseen during training. We predict the per-frame met- are reported in Tab. 8. Droid with our depths has much
ric depth and then fuse them together with provided better SLAM performance. As the ETH3D scenes are all
camera poses. Results are reported in Tab. 6. We com- small-scale indoor scenes, the performance improvement is
pare with the video consistent depth prediction method less than that on KITTI.
(RCVD [128]), the unsupervised video depth estimation Metrology in the wild. To show the robustness and ac-
method (SC-DepthV2 [129]), the 3D scene shape recov- curacy of our recovered metric 3D, we download Flickr
ery method (LeReS [25]), affine-invariant depth estimation photos captured by various cameras and collect coarse
method (DPT [28]), and the multi-view stereo reconstruction camera intrinsic parameters from their metadata. We use
method (DPSNet [48]). Apart from DPSNet and our method, our CSTM image model to reconstruct their metric shape
other methods have to align the scale with the ground truth and measure structures’ sizes (marked in red in Fig. 14),
13

TABLE 6 – Quantitative comparison of 3D scene reconstruction with LeReS [25], DPT [28], RCVD [128], SC-DepthV2 [129],
and a learning-based MVS method (DPSNet [48]) on 9 unseen NYUv2 scenes. Apart from DPSNet and ours, other methods
have to align the scale with ground truth depth for each frame. As a result, our reconstructed 3D scenes achieve the best
performance.
Basement 0001a Bedroom 0015 Dining room 0004 Kitchen 0008 Classroom 0004 Playroom 0002 Office 0024 Office 0004 Dining room 0033
Method
C-l1 ↓ F-score ↑ C-l1 ↓ F-score ↑ C-l1 ↓ F-score ↑ C-l1 ↓ F-score ↑ C-l1 ↓ F-score ↑ C-l1 ↓ F-score ↑ C-l1 ↓ F-score ↑ C-l1 ↓ F-score ↑ C-l1 ↓ F-score ↑
RCVD [128] 0.364 0.276 0.074 0.582 0.462 0.251 0.053 0.620 0.187 0.327 0.791 0.187 0.324 0.241 0.646 0.217 0.445 0.253
SC-DepthV2 [129] 0.254 0.275 0.064 0.547 0.749 0.229 0.049 0.624 0.167 0.267 0.426 0.263 0.482 0.138 0.516 0.244 0.356 0.247
DPSNet [48] 0.243 0.299 0.195 0.276 0.995 0.186 0.269 0.203 0.296 0.195 0.141 0.485 0.199 0.362 0.210 0.462 0.222 0.493
DPT [25] 0.698 0.251 0.289 0.226 0.396 0.364 0.126 0.388 0.780 0.193 0.605 0.269 0.454 0.245 0.364 0.279 0.751 0.185
LeReS [25] 0.081 0.555 0.064 0.616 0.278 0.427 0.147 0.289 0.143 0.480 0.145 0.503 0.408 0.176 0.096 0.497 0.241 0.325
Ours 0.042 0.736 0.059 0.610 0.159 0.485 0.050 0.645 0.145 0.445 0.036 0.814 0.069 0.638 0.045 0.700 0.060 0.663
LeReS
DPSNet
Ours
GT

Fig. 13 – Reconstruction of zero-shot scenes with multiple views. We sample several NYUv2 scenes for 3D reconstruction
comparison. As our method can predict accurate metric depth, thus all frame’s predictions are fused together for scene
reconstruction. By contrast, LeReS [25]’s depth is up to an unknown scale and shift, which causes noticeable distortions.
DPSNet [48] is a multi-view stereo method, which cannot work well on low-texture regions.

TABLE 7 – Comparison with SoTA SLAM methods on the reconstruction quality of our recovered metric depth, we
KITTI. We input predicted metric depth to the Droid- randomly collect images from the internet and recover their
SLAM [37] (‘Droid+Ours’), which outperforms others by a metric 3D and normals. As there is no focal length provided,
large margin on trajectory accuracy.
we select proper focal lengths according to the reconstructed
Seq 00 Seq 02 Seq 05 Seq 06 Seq 08 Seq 09 Seq 10
Method
Translational RMS drift (trel , ↓) / Rotational RMS drift (rrel , ↓) shape and normal maps. The reconstructed pointclouds are
GeoNet [130] 27.6/5.72 42.24/6.14 20.12/7.67 9.28/4.34 18.59/7.85 23.94/9.81 20.73/9.1
VISO2-M [131] 12.66/2.73 9.47/1.19 15.1/3.65 6.8/1.93 14.82/2.52 3.69/1.25 21.01/3.26
colorized by their corresponding normals (Different views
ORB-V2 [12] 11.43/0.58 10.34/0.26 9.04/0.26 14.56/0.26 11.46/0.28 9.3/0.26 2.57/0.32 are marked by red and orange arrays in Fig. 15).
Droid [37] 33.9/0.29 34.88/0.27 23.4/0.27 17.2/0.26 39.6/0.31 21.7/0.23 7/0.25
Droid+Ours 1.44/0.37 2.64/0.29 1.44/0.25 0.6/0.2 2.2/0.3 1.63/0.22 2.73/0.23
4.3 Ablation Study

TABLE 8 – Comparison of VO error on ETH3D benchmark.


Ablation on canonical transformation. We study the effect
Droid SLAM system is input with our depth (‘Droid + of our proposed canonical transformation for the input
Ours’), and ground-truth depth (‘Droid + GT’). The average images (‘CSTM input’) and the canonical transformation
trajectory error is reported. for the ground-truth labels (‘CSTM output’). Results are
Einstein Plant- sfm house sfm lab
global
Manquin4 Motion1
scene3 loop room2 reported in Tab. 9. We train the model on sampled mixed
Droid 4.7 0.88
Average trajectory error (↓)
0.83 0.78 5.64 0.55
data (90K images) and test it on 6 datasets. A naive base-
Droid + Ours 1.5 0.69 0.62 0.34 4.03 0.53 line (‘Ours w/o CSTM’) is to remove CSTM modules and
Droid + GT 0.7 0.006 0.024 0.006 0.96 0.013
enforce the same supervision as ours. Without CSTM, the
model is unable to converge when training on mixed metric
while the ground-truth sizes are in blue. It shows that our datasets and cannot achieve metric prediction ability on
measured sizes are very close to the ground-truth sizes. zero-shot datasets. This is why recent mixed-data training
Monocular reconstruction in the wild. To further visualize methods compromise learning the affine-invariant depth to
14

Nikon

TABLE 9 – Effectiveness of our CSTM. CamConvs [38]

15.6m/GT: 26m
D200

49.2m /GT: 52m


directly encodes various camera models in the network,
while we perform a simple yet effective transformation
to solve the metric ambiguity. Without CSTM, the model
iPhone X
achieve transferable metric prediction ability.
DDAD Lyft DS NS KITTI NYU
Method
Test set of train. data (AbsRel↓) Zero-shot test set (AbsRel↓)

2.2m/GT 1.7m
w/o CSTM 0.530 0.582 0.394 1.00 0.568 0.584
CamConvs [38] 0.295 0.315 0.213 0.423 0.178 0.333
2.5m/GT 1.9m
Ours CSTM image 0.190 0.235 0.182 0.197 0.097 0.210
Canon
Ours CSTM label 0.183 0.221 0.201 0.213 0.081 0.212
600D

TABLE 10 – Effectiveness of random proposal normaliza-

3.1m/GT3.7m
tion loss. Baseline is supervised by ‘LPWN + LVNL + Lsilog ’.
SSIL is the scale-shift invariant loss proposed in [27].
10.9m /GT 12m
2.6m/GT 2.5m

DDAD Lyft DS NS KITTI NYUv2


Method
Test set of train. data (AbsRel↓) Zero-shot test set (AbsRel↓)
Fig. 14 – Metrology of in-the-wild scenes. We collect sev- baseline 0.204 0.251 0.184 0.207 0.104 0.230
eral Flickr photos, which are captured by various cameras. baseline + SSIL [27] 0.197 0.263 0.259 0.206 0.105 0.216
With photos’ metadata, we reconstruct the 3D metric shape baseline + RPNL 0.190 0.235 0.182 0.197 0.097 0.210
and measure structures’ sizes. Red and blue marks are ours
and ground-truth sizes respectively.
canonical camera, i.e., the canonical focal length. We train
the model on the small sampled dataset and test it on the
validation set of training data and testing data. The average
AbsRel error is calculated. We experiment on 3 different
focal lengths, i.e., 500, 1000, 1500. Experiments show that
f ocal = 1000 has slightly better performance than others,
see supplementary for more details.
Effectiveness of the random proposal normalization loss.
To show the effectiveness of our proposed random proposal
normalization loss (RPNL), we experiment on the sampled
small dataset. Results are shown in Tab. 10. We test on
the DDAD, Lyft, DrivingStereo (DS), NuScenes (NS), KITTI,
and NYUv2. The ‘baseline’ employs all losses except our
RPNL. We compare it with ‘baseline + RPNL’ and ‘baseline
+ SSIL [27]’. We can observe that our proposed random
proposal normalization loss can further improve the per-
formance. In contrast, the scale-shift invariant loss [27],
which does the normalization on the whole image, can only
slightly improve the performance.
Effectiveness of joint optimization. We assess the impact
of joint optimization on both depth and normal estimation
using small datasets sampled with ViT-small models over a
4-step iteration. The evaluation is conducted on the NYU
indoor dataset and the DIODE outdoor dataset, both of
RGB Recon. (View A) Recon. (View B) which include normal labels for the convenience of eval-
Fig. 15 – Reconstruction from in-the-wild single images. uation. In Tab. 11, we start by training the same-architecture
We collect web images and select proper focal lengths. The networks ‘without depth’ or ‘without normal’ prediction.
reconstructed pointclouds are colorized by normals. Compared to our joint optimization approach, both single-
avoid metric issues. In contrast, our two CSTM methods modality models exhibit slightly worse performance.To fur-
both can enable the model to achieve the metric predic- ther demonstrate the benefit of joint optimization and the in-
tion ability, and they can achieve comparable performance. corporation of large-scale outdoor data prior to normal esti-
Tab. 1 also shows comparable performance. Therefore, both mation, we train a model using only the Taskonomy dataset
adjusting the supervision and the input image appearance (i.e., ’W.o. mixed datasets’), which shows inferior results on
during training can solve the metric ambiguity issues. Fur- DIODE(outdoor). We also verify the effectiveness of the re-
thermore, we compare with CamConvs [38], which encodes current blocks and the consistency loss. Removing either of
the camera model in the decoder with a 4-channel feature. them (‘W.o. consistency’ / ‘W.o. recurrent block’) could lead
‘CamConvs’ employ the same training schedule, model, and to drastic performance degradation for normal estimation,
training data as ours. This method enforces the network particularly for outdoor scenarios like DIODE(Outdoor).
to implicitly understand various camera models from the Furthermore, we present some visualization comparisons in
image appearance and then bridges the imaging size to the Fig 16. Training surface normal and depth together without
real-world size. We believe that this method challenges the the consistency loss (’W.o. consistency’) results in notably
data diversity and network capacity, thus their performance poorer predicted normals compared to our full method
is worse than ours. (’Ours normal’). Additionally, if the model learns the normal
Ablation on canonical space. We study the effect of the individually (’W.o. depth’), the performance also degrades.
15

steps to refine depth and normal. Table 13 illustrates that in-


creasing the number of iteration steps does not consistently
improve results. Moreover, the ideal number of steps may
differ based on the model size, with larger models generally
benefiting from more extensive optimization.
TABLE 13 – Select the best joint optimizing steps for
different ViT models. We find the best step varying with
model size and choose the best-fitted steps according to
the following experiment results. All models are trained
following the settings in Tab. 11
Backbone ViT-Small ViT-Large ViT-giant
# Steps KITTI Depth (AbsRel↓) / NYU v2 Normal (Med. err.↓)
2 0.102/9.01 0.070/8.40 0.069/8.25
4 0.088/8.77 0.067/8.24 0.067/8.23
8 0.090/8.80 0.065/8.21 0.064/8.22
16 0.095/8.79 0.068/8.30 0.065/8.27

5 C ONCLUSION
In this paper, we introduce a family of geometric foun-
RGB Ours normal W.o. consistency W.o. depth dation models for zero-shot monocular metric depth and
Fig. 16 – Effect of joint depth-normal optimization. We com- surface normal estimation. We propose solutions to address
pare normal maps learned by different strategies on several challenges in both metric depth estimation and surface
outdoor examples. Learning normal only ‘without depth’ normal estimation. To resolve depth ambiguity caused by
leads to flattened surfaces, since most of the normal labels varying focal lengths, we present a novel canonical camera
lie on planes. In addition, ‘without consistency’ imposed space transformation method. Additionally, to overcome
between depth and normal, the predictions become much the scarcity of outdoor normal data labels, we introduce a
coarser.
joint depth-normal optimization framework that leverages
TABLE 11 – Effectiveness of joint optimization. Joint op-
knowledge from large-scale depth annotations.
timization surpasses independent estimation. For outdoor
normal estimation, this module introduces geometry clues Our approach enables the integration of millions of data
from large-scale depth data. The proposed recurrent block samples captured by over 10,000 cameras to train a unified
and depth-normal consistency constraint are essential for metric-depth and surface-normal model. To enhance the
the optimization model’s robustness, we curate a dataset comprising over 16
Method
DIODE(Outdoor) NYU DIODE(Outdoor) NYU million samples for training. Zero-shot evaluations demon-
Depth (AbsRel↓) Normal (Med. error↓)
W.o. normal 0.315 0.119 - -
strate the effectiveness and robustness of our method. For
W.o. depth - - 16.25 8.78 downstream applications, our models are capable of recon-
W.o mixed datasets 0.614 0.116 18.94 9.50
W.o. recurrent block 0.309 0.127 16.51 9.31 structing metric 3D from a single view, enabling metrology
W.o. consistency 0.310 0.121 16.45 9.72 on randomly collected internet images and dense mapping
Ours 0.304 0.114 14.91 8.77
of large-scale scenes. With their precision, generalization,
and versatility, Metric3D v2 models serve as geometric
The efficiency analysis of the joint optimization module is
foundational models for monocular perception.
presented in the supplementary materials.
Selection of intermediate normal representation. During Authors’ photographs and biographies not available at
optimization, unnormalized normal vectors are utilized as the time of submission.
the intermediate representation. Here we explore three addi-
R EFERENCES
tional representations (1) A vector defined in so3 represent-
ing 3D rotation upon a reference direction. We implement [1] B. Yang, S. Rosa, A. Markham, N. Trigoni, and H. Wen, “Dense
3d object reconstruction from a single depth view,” IEEE trans-
this vector by lietorch [37]. (2) An azimuthal angle and a po- actions on pattern analysis and machine intelligence, vol. 41, no. 12,
lar angle. (3) A 2D homogeneous vector [79]. All the repre- pp. 2820–2834, 2018. 1
sentations investigated are additive and can be surjectively [2] J. Ju, C. W. Tseng, O. Bailo, G. Dikov, and M. Ghafoorian, “Dg-
recon: Depth-guided neural 3d scene reconstruction,” in Proceed-
transferred into surface normal. In this experiment, we only
ings of the IEEE/CVF International Conference on Computer Vision,
change the representations and compare the performances. pp. 18184–18194, 2023. 1
Surprisingly, according to Table 12, the naive unnormalized [3] L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and
normal performs the best. We hypothesize that this simplest A. Geiger, “Occupancy networks: Learning 3d reconstruction
in function space,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn.,
representation reduces the learning difficulty. pp. 4460–4470, 2019. 1
[4] K. Deng, A. Liu, J.-Y. Zhu, and D. Ramanan, “Depth-supervised
TABLE 12 – More selection of intermediate normal repre- nerf: Fewer views and faster training for free,” in Proceedings of the
sentation. IEEE/CVF Conference on Computer Vision and Pattern Recognition,
Taskonomy Scannet DIODE(Outdoor) NYU pp. 12882–12891, 2022. 1
Representation [5] B. Roessle, J. T. Barron, B. Mildenhall, P. P. Srinivasan, and
Test set (Med. err.↓) Zero-shot testing (Med. err.↓)
3D rotation vector 5.28 8.92 16.00 9.45 M. Nießner, “Dense depth priors for neural radiance fields from
Azi. and polar angles 5.34 9.01 15.21 9.21 sparse input views,” in Proceedings of the IEEE/CVF Conference on
Homo. 2D vector [79] 5.02 8.50 15.40 8.79
Ours (unnormalized 3D) 5.01 8.41 14.91 8.77 Computer Vision and Pattern Recognition, pp. 12892–12901, 2022. 1
[6] Z. Yu, S. Peng, M. Niemeyer, T. Sattler, and A. Geiger, “Monosdf:
Exploring monocular geometric cues for neural implicit surface
Best optimizing steps To determine the optimal number of reconstruction,” Advances in neural information processing systems,
optimization steps for various ViT models, we vary different vol. 35, pp. 25018–25032, 2022. 1, 5
16

[7] C. Jiang, H. Zhang, P. Liu, Z. Yu, H. Cheng, B. Zhou, and S. Shen, [30] W. Yin, Y. Liu, C. Shen, and Y. Yan, “Enforcing geometric con-
“H2-mapping: Real-time dense mapping using hierarchical hy- straints of virtual normal for depth prediction,” in Proc. IEEE Int.
brid representation,” arXiv preprint arXiv:2306.03207, 2023. 1 Conf. Comp. Vis., 2019. 2, 4, 9
[8] Y. Li, Z. Ge, G. Yu, J. Yang, Z. Wang, Y. Shi, J. Sun, and Z. Li, [31] B. Ke, A. Obukhov, S. Huang, N. Metzger, R. C. Daudt, and
“Bevdepth: Acquisition of reliable depth for multi-view 3d object K. Schindler, “Repurposing diffusion-based image generators for
detection,” arXiv: Comp. Res. Repository, p. 2206.10092, 2022. 1 monocular depth estimation,” arXiv preprint arXiv:2312.02145,
[9] Z. Li, Z. Yu, D. Austin, M. Fang, S. Lan, J. Kautz, and J. M. 2023. 2
Alvarez, “Fb-occ: 3d occupancy prediction based on forward- [32] T. Do, K. Vuong, S. I. Roumeliotis, and H. S. Park, “Surface nor-
backward view transformation,” arXiv preprint arXiv:2307.01492, mal estimation of tilted images via spatial rectifier,” in Computer
2023. 1 Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August
[10] R. Fan, H. Wang, P. Cai, and M. Liu, “Sne-roadseg: Incorporat- 23–28, 2020, Proceedings, Part IV 16, pp. 265–280, Springer, 2020.
ing surface normal information into semantic segmentation for 2, 5, 9
accurate freespace detection,” in European Conference on Computer [33] G. Bae, I. Budvytis, and R. Cipolla, “Estimating and exploiting the
Vision, pp. 340–356, Springer, 2020. 1 aleatoric uncertainty in surface normal estimation,” in Proceed-
[11] J. Behley and C. Stachniss, “Efficient surfel-based slam using 3d ings of the IEEE/CVF International Conference on Computer Vision,
laser range data in urban environments.,” in Robotics: Science and pp. 13137–13146, 2021. 2, 5, 8, 9, 10, 11, 12
Systems, vol. 2018, p. 59, 2018. 1, 5 [34] X. Yang, L. Yuan, K. Wilber, A. Sharma, X. Gu, S. Qiao, S. Debats,
[12] R. Mur-Artal and J. D. Tardós, “ORB-SLAM2: an open-source H. Wang, H. Adam, M. Sirotenko, et al., “Polymax: General dense
SLAM system for monocular, stereo and RGB-D cameras,” IEEE prediction with mask transformer,” in Proceedings of the IEEE/CVF
Trans. Robot., vol. 33, no. 5, pp. 1255–1262, 2017. 1, 13 Winter Conference on Applications of Computer Vision, pp. 1050–
[13] T. Schops, T. Sattler, and M. Pollefeys, “Bad slam: Bundle ad- 1061, 2024. 2, 5, 9, 10
justed direct rgb-d slam,” in Proceedings of the IEEE/CVF Confer- [35] A. Eftekhar, A. Sax, J. Malik, and A. Zamir, “Omnidata: A scalable
ence on Computer Vision and Pattern Recognition, pp. 134–144, 2019. pipeline for making multi-task mid-level vision datasets from 3d
1 scans,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 10786–
[14] H. Zhu, H. Yang, X. Wu, D. Huang, S. Zhang, X. He, T. He, 10796, 2021. 2, 3, 5, 10, 11, 12
H. Zhao, C. Shen, Y. Qiao, et al., “Ponderv2: Pave the way for [36] S. F. Bhat, R. Birkl, D. Wofk, P. Wonka, and M. Müller, “Zoedepth:
3d foundataion model with a universal pre-training paradigm,” Zero-shot transfer by combining relative and metric depth,” arXiv
arXiv preprint arXiv:2310.08586, 2023. 1 preprint arXiv:2302.12288, 2023. 3, 9, 10, 11, 12
[15] J. Zhou, J. Wang, B. Ma, Y.-S. Liu, T. Huang, and X. Wang, [37] Z. Teed and J. Deng, “Droid-slam: Deep visual slam for monocu-
“Uni3d: Exploring unified 3d representation at scale,” arXiv lar, stereo, and rgb-d cameras,” vol. 34, pp. 16558–16569, 2021. 3,
preprint arXiv:2310.06773, 2023. 1 4, 12, 13, 15
[16] H. Xu, J. Zhang, J. Cai, H. Rezatofighi, F. Yu, D. Tao, and [38] J. Facil, B. Ummenhofer, H. Zhou, L. Montesano, T. Brox, and
A. Geiger, “Unifying flow, stereo and depth estimation,” IEEE J. Civera, “CAM-Convs: camera-aware multi-scale convolutions
Transactions on Pattern Analysis and Machine Intelligence, 2023. 1 for single-view depth,” in Proc. IEEE Conf. Comp. Vis. Patt.
[17] W. Yuan, X. Gu, Z. Dai, S. Zhu, and P. Tan, “New CRFs: Neural Recogn., pp. 11826–11835, 2019. 2, 14
window fully-connected CRFs for monocular depth estimation,” [39] S. Peng, S. Zhang, Z. Xu, C. Geng, B. Jiang, H. Bao, and X. Zhou,
in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2022. 1, 2, 4, 9, 10 “Animatable neural implicit surfaces for creating avatars from
[18] W. Yin, Y. Liu, and C. Shen, “Virtual normal: Enforcing geometric videos,” arXiv: Comp. Res. Repository, p. 2203.08133, 2022. 3
constraints for accurate and robust depth prediction,” IEEE Trans. [40] J. Huang, Y. Zhou, T. Funkhouser, and L. J. Guibas, “Framenet:
Pattern Anal. Mach. Intell., 2021. 1, 2, 4, 5, 8, 10 Learning local canonical frames of 3d surfaces from a single rgb
[19] S. F. Bhat, I. Alhashim, and P. Wonka, “Adabins: Depth estimation image,” in Proceedings of the IEEE/CVF International Conference on
using adaptive bins,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Computer Vision, pp. 8638–8647, 2019. 3, 9, 10
pp. 4009–4018, 2021. 1, 2, 4, 9, 10 [41] X. Qi, R. Liao, Z. Liu, R. Urtasun, and J. Jia, “Geonet: Geometric
[20] G. Yang, H. Tang, M. Ding, N. Sebe, and E. Ricci, “Transformer- neural network for joint depth and surface normal estimation,”
based attention networks for continuous pixel-wise prediction,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
in Proc. IEEE Int. Conf. Comp. Vis., 2021. 1, 2, 9 Recognition, pp. 283–291, 2018. 3, 5, 8, 9
[21] K. Xian, C. Shen, Z. Cao, H. Lu, Y. Xiao, R. Li, and Z. Luo, [42] G. Bae, I. Budvytis, and R. Cipolla, “Irondepth: Iterative re-
“Monocular relative depth perception with web stereo data su- finement of single-view depth using surface normal and its
pervision,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 311– uncertainty,” arXiv preprint arXiv:2210.03676, 2022. 3, 5, 7, 9, 10
320, 2018. 2, 5 [43] J. Park, K. Joo, Z. Hu, C.-K. Liu, and I. So Kweon, “Non-local
[22] K. Xian, J. Zhang, O. Wang, L. Mai, Z. Lin, and Z. Cao, “Structure- spatial propagation network for depth completion,” in Computer
guided ranking loss for single image depth prediction,” in Proc. Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August
IEEE Conf. Comp. Vis. Patt. Recogn., pp. 611–620, 2020. 2, 5 23–28, 2020, Proceedings, Part XIII 16, pp. 120–136, Springer, 2020.
[23] W. Chen, S. Qian, D. Fan, N. Kojima, M. Hamilton, and J. Deng, 3, 5
“Oasis: A large-scale dataset for single image 3d in the wild,” in [44] S. Shao, Z. Pei, X. Wu, Z. Liu, W. Chen, and Z. Li, “Iebins:
Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 679–688, 2020. 2, 5 Iterative elastic bins for monocular depth estimation,” arXiv
[24] W. Chen, Z. Fu, D. Yang, and J. Deng, “Single-image depth preprint arXiv:2309.14137, 2023. 3, 5, 7, 9
perception in the wild,” in Proc. Advances in Neural Inf. Process. [45] L. Lipson, Z. Teed, and J. Deng, “Raft-stereo: Multilevel recurrent
Syst., pp. 730–738, 2016. 2, 5 field transforms for stereo matching,” in Int. Conf. 3D. Vis., 2021.
[25] W. Yin, J. Zhang, O. Wang, S. Niklaus, L. Mai, S. Chen, and 3, 5, 7
C. Shen, “Learning to recover 3d scene shape from a single [46] Z. Teed and J. Deng, “Raft: Recurrent all-pairs field transforms
image,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2021. 2, 3, for optical flow,” in Computer Vision–ECCV 2020: 16th European
4, 5, 8, 9, 10, 12, 13 Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II
[26] W. Yin, J. Zhang, O. Wang, S. Niklaus, S. Chen, Y. Liu, and 16, pp. 402–419, Springer, 2020. 3, 5, 7
C. Shen, “Towards accurate reconstruction of 3d scene shape [47] L. Sun, W. Yin, E. Xie, Z. Li, C. Sun, and C. Shen, “Improving
from a single monocular image,” IEEE Trans. Pattern Anal. Mach. monocular visual odometry using learned depth,” IEEE Transac-
Intell., 2022. 2, 5 tions on Robotics, vol. 38, no. 5, pp. 3173–3186, 2022. 4
[27] R. Ranftl, K. Lasinger, D. Hafner, K. Schindler, and V. Koltun, [48] S. Im, H.-G. Jeon, S. Lin, and I.-S. Kweon, “Dpsnet: End-to-end
“Towards robust monocular depth estimation: Mixing datasets deep plane sweep stereo,” in Proc. Int. Conf. Learn. Representations,
for zero-shot cross-dataset transfer,” IEEE Trans. Pattern Anal. 2019. 4, 12, 13
Mach. Intell., 2020. 2, 3, 5, 8, 10, 14 [49] R. Mur-Artal and J. D. Tardós, “Orb-slam2: An open-source
[28] R. Ranftl, A. Bochkovskiy, and V. Koltun, “Vision transformers for slam system for monocular, stereo, and rgb-d cameras,” IEEE
dense prediction,” in Proc. IEEE Int. Conf. Comp. Vis., pp. 12179– transactions on robotics, vol. 33, no. 5, pp. 1255–1262, 2017. 4
12188, 2021. 2, 8, 10, 12, 13 [50] R. Zhu, X. Yang, Y. Hold-Geoffroy, F. Perazzi, J. Eisenmann,
[29] C. Zhang, W. Yin, Z. Wang, G. Yu, B. Fu, and C. Shen, “Hierarchi- K. Sunkavalli, and M. Chandraker, “Single view metrology in
cal normalization for robust monocular depth estimation,” Proc. the wild,” in Proc. Eur. Conf. Comp. Vis., pp. 316–333, Springer,
Advances in Neural Inf. Process. Syst., 2022. 2, 3, 5, 9, 10 2020. 4
17

[51] J. T. Barron and J. Malik, “Shape, illumination, and reflectance [72] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, “Pwc-net: Cnns for
from shading,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, optical flow using pyramid, warping, and cost volume,” in
no. 8, pp. 1670–1687, 2014. 4 Proceedings of the IEEE conference on computer vision and pattern
[52] N. Wang, Y. Zhang, Z. Li, Y. Fu, W. Liu, and Y.-G. Jiang, recognition, pp. 8934–8943, 2018. 5
“Pixel2mesh: Generating 3d mesh models from single RGB im- [73] H. Liu, T. Lu, Y. Xu, J. Liu, and L. Wang, “Learning optical flow
ages,” in Proc. Eur. Conf. Comp. Vis., pp. 52–67, 2018. 4 and scene flow with bidirectional camera-lidar fusion,” arXiv
[53] J. Wu, C. Zhang, X. Zhang, Z. Zhang, W. Freeman, and J. Tenen- preprint arXiv:2303.12017, 2023. 5
baum, “Learning shape priors for single-view 3d completion and [74] X. Cheng, P. Wang, and R. Yang, “Depth estimation via affinity
reconstruction,” in Proc. Eur. Conf. Comp. Vis., pp. 646–662, 2018. learned with convolutional spatial propagation network,” in
4 Proceedings of the European conference on computer vision (ECCV),
[54] S. Saito, Z. Huang, R. Natsume, S. Morishima, A. Kanazawa, and pp. 103–119, 2018. 5
H. Li, “Pifu: Pixel-aligned implicit function for high-resolution [75] M. Hu, S. Wang, B. Li, S. Ning, L. Fan, and X. Gong, “Penet:
clothed human digitization,” in Proc. IEEE Conf. Comp. Vis. Patt. Towards precise and efficient image guided depth completion,”
Recogn., pp. 2304–2314, 2019. 4 in 2021 IEEE International Conference on Robotics and Automation
[55] S. Saito, T. Simon, J. Saragih, and H. Joo, “Pifuhd: Multi-level (ICRA), pp. 13656–13662, IEEE, 2021. 5
pixel-aligned implicit function for high-resolution 3d human [76] Z. Ma, Z. Teed, and J. Deng, “Multiview stereo with cascaded
digitization,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 84– epipolar raft,” in European Conference on Computer Vision, pp. 734–
93, 2020. 4 750, Springer, 2022. 5
[56] A. Saxena, M. Sun, and A. Y. Ng, “Make3d: Learning 3d scene [77] G. Xu, X. Wang, X. Ding, and X. Yang, “Iterative geometry encod-
structure from a single still image,” IEEE Trans. Pattern Anal. ing volume for stereo matching,” in Proceedings of the IEEE/CVF
Mach. Intell., vol. 31, no. 5, pp. 824–840, 2008. 4 Conference on Computer Vision and Pattern Recognition, pp. 21919–
[57] C. Zhang, W. Yin, G. Yu, Z. Wang, T. Chen, B. Fu, J. T. Zhou, and 21928, 2023. 5
C. Shen, “Robust geometry-preserving depth estimation using [78] J. E. Lenssen, C. Osendorfer, and J. Masci, “Deep iterative surface
differentiable rendering,” in Proc. IEEE Int. Conf. Comp. Vis., 2023. normal estimation,” in Proceedings of the ieee/cvf conference on
4 computer vision and pattern recognition, pp. 11247–11256, 2020. 5
[58] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor seg- [79] W. Zhao, S. Liu, Y. Wei, H. Guo, and Y.-J. Liu, “A confidence-
mentation and support inference from rgbd images,” in Proc. Eur. based iterative solver of depths and surface normals for deep
Conf. Comp. Vis., pp. 746–760, Springer, 2012. 4, 5, 9, 10 multi-view stereo,” in Proceedings of the IEEE/CVF International
[59] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets Conference on Computer Vision, pp. 6168–6177, 2021. 5, 15
robotics: The kitti dataset,” Int. J. Robot. Res., 2013. 4, 9, 10 [80] W. Yin, Y. Liu, C. Shen, A. v. d. Hengel, and B. Sun, “The devil
[60] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction is in the labels: Semantic segmentation from sentences,” arXiv:
from a single image using a multi-scale deep network,” in Proc. Comp. Res. Repository, p. 2202.02002, 2022. 5
Advances in Neural Inf. Process. Syst., pp. 2366–2374, 2014. 4, 8
[81] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agar-
[61] K. Wang, F. Gao, and S. Shen, “Real-time scalable dense surfel wal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning
mapping,” in 2019 International conference on robotics and automa- transferable visual models from natural language supervision,”
tion (ICRA), pp. 6919–6925, IEEE, 2019. 5 in Proc. Int. Conf. Mach. Learn., pp. 8748–8763, PMLR, 2021. 5
[62] J. Wang, P. Wang, X. Long, C. Theobalt, T. Komura, L. Liu,
[82] J. Lambert, Z. Liu, O. Sener, J. Hays, and V. Koltun, “Mseg: A
and W. Wang, “Neuris: Neural reconstruction of indoor scenes
composite dataset for multi-domain semantic segmentation,” in
using normal priors,” in European Conference on Computer Vision,
Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 2879–2888, 2020. 5
pp. 139–155, Springer, 2022. 5
[83] M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec,
[63] X. Qi, Z. Liu, R. Liao, P. H. Torr, R. Urtasun, and J. Jia, “Geonet++:
V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al.,
Iterative geometric neural network with edge-aware refinement
“Dinov2: Learning robust visual features without supervision,”
for joint depth and surface normal estimation,” IEEE Transactions
arXiv preprint arXiv:2304.07193, 2023. 5, 8
on Pattern Analysis and Machine Intelligence, vol. 44, no. 2, pp. 969–
984, 2020. 5, 9 [84] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly,
[64] X. Wang, D. Fouhey, and A. Gupta, “Designing deep networks for
J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words:
surface normal estimation,” in Proceedings of the IEEE conference
Transformers for image recognition at scale,” Proc. Int. Conf.
on computer vision and pattern recognition, pp. 539–547, 2015. 5, 9
Learn. Representations, 2021. 5, 8, 10
[65] D. Eigen and R. Fergus, “Predicting depth, surface normals
and semantic labels with a common multi-scale convolutional [85] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer,
architecture,” in Proceedings of the IEEE international conference on “High-resolution image synthesis with latent diffusion models,”
computer vision, pp. 2650–2658, 2015. 5, 9 in Proceedings of the IEEE/CVF conference on computer vision and
[66] S. Liao, E. Gavves, and C. G. Snoek, “Spherical regression: Learn- pattern recognition, pp. 10684–10695, 2022. 5, 10
ing viewpoints, surface normals and 3d rotations on n-spheres,” [86] Y. Hold-Geoffroy, K. Sunkavalli, J. Eisenmann, M. Fisher, E. Gam-
in Proceedings of the IEEE/CVF Conference on Computer Vision and baretto, S. Hadap, and J.-F. Lalonde, “A perceptual measure for
Pattern Recognition, pp. 9759–9767, 2019. 5 deep single image camera calibration,” in Proc. IEEE Conf. Comp.
[67] L. Ladickỳ, B. Zeisl, and M. Pollefeys, “Discriminatively trained Vis. Patt. Recogn., pp. 2354–2363, 2018. 5
dense surface normal estimation,” in Computer Vision–ECCV [87] D. Singh and B. Singh, “Investigating the impact of data normal-
2014: 13th European Conference, Zurich, Switzerland, September 6- ization on classification performance,” Applied Soft Computing,
12, 2014, Proceedings, Part V 13, pp. 468–484, Springer, 2014. 5, 2019. 8
9 [88] J. Li, R. Klein, and A. Yao, “A two-streamed network for estimat-
[68] B. Li, C. Shen, Y. Dai, A. Van Den Hengel, and M. He, “Depth ing fine-scaled depth maps from single rgb images,” in Proc. IEEE
and surface normal estimation from monocular images using Conf. Comp. Vis. Patt. Recogn., pp. 3372–3380, 2017. 9
regression on deep features and hierarchical crfs,” in Proceedings [89] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab,
of the IEEE conference on computer vision and pattern recognition, “Deeper depth prediction with fully convolutional residual net-
pp. 1119–1127, 2015. 5 works,” in 2016 Fourth international conference on 3D vision (3DV),
[69] X. Long, Y. Zheng, Y. Zheng, B. Tian, C. Lin, L. Liu, H. Zhao, pp. 239–248, IEEE, 2016. 9
G. Zhou, and W. Wang, “Adaptive surface normal constraint [90] V. Guizilini, I. Vasiljevic, D. Chen, R. Ambrus, , and A. Gaidon,
for geometric estimation from monocular images,” arXiv preprint “Towards zero-shot scale-aware monocular depth estimation,” in
arXiv:2402.05869, 2024. 5 Proceedings of the IEEE/CVF International Conference on Computer
[70] X. Long, C. Lin, L. Liu, W. Li, C. Theobalt, R. Yang, and W. Wang, Vision, pp. 9233–9243, 2023. 9, 10
“Adaptive surface normal constraint for depth estimation,” in [91] L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth
Proceedings of the IEEE/CVF international conference on computer anything: Unleashing the power of large-scale unlabeled data,”
vision, pp. 12849–12858, 2021. 5 arXiv:2401.10891, 2024. 9
[71] W. Chen, D. Xiang, and J. Deng, “Surface normals in the wild,” in [92] X. Guo, H. Li, S. Yi, J. Ren, and X. Wang, “Learning monocular
Proceedings of the IEEE International Conference on Computer Vision, depth by distilling cross-domain stereo networks,” in Proc. Eur.
pp. 1557–1566, 2017. 5 Conf. Comp. Vis., pp. 484–500, 2018. 9
18

[93] D. F. Fouhey, A. Gupta, and M. Hebert, “Unfolding an indoor tonomous driving scenarios,” in Proc. IEEE Conf. Comp. Vis. Patt.
origami world,” in Computer Vision–ECCV 2014: 13th European Recogn., 2019. 10
Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, [115] J. Cho, D. Min, Y. Kim, and K. Sohn, “DIML/CVL RGB-D dataset:
Part VI 13, pp. 687–702, Springer, 2014. 9 2m RGB-D images of natural indoor and outdoor scenes,” arXiv:
[94] A. Bansal, B. Russell, and A. Gupta, “Marr revisited: 2d-3d Comp. Res. Repository, 2021. 10
alignment via surface normal prediction,” in Proceedings of the [116] B. Wilson, W. Qi, T. Agarwal, J. Lambert, J. Singh, S. Khandelwal,
IEEE conference on computer vision and pattern recognition, pp. 5965– B. Pan, R. Kumar, A. Hartnett, J. K. Pontes, D. Ramanan, P. Carr,
5974, 2016. 9 and J. Hays, “Argoverse 2: Next generation datasets for self-
[95] P. Wang, X. Shen, B. Russell, S. Cohen, B. Price, and A. L. Yuille, driving perception and forecasting,” in Proc. Advances in Neural
“Surge: Surface regularized geometry estimation from a single Inf. Process. Syst., 2021. 10
image,” Advances in Neural Information Processing Systems, vol. 29, [117] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler,
2016. 9 R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes
[96] Z. Zhang, Z. Cui, C. Xu, Y. Yan, N. Sebe, and J. Yang, “Pattern- dataset for semantic urban scene understanding,” in Proc. IEEE
affinitive propagation across depth, surface normal and semantic Conf. Comp. Vis. Patt. Recogn., 2016. 10
segmentation,” in Proceedings of the IEEE/CVF conference on com- [118] M. Gehrig, W. Aarents, D. Gehrig, and D. Scaramuzza, “Dsec: A
puter vision and pattern recognition, pp. 4106–4115, 2019. 9 stereo event camera dataset for driving scenarios,” IEEE Robotics
[97] R. Wang, D. Geraghty, K. Matzen, R. Szeliski, and J.-M. Frahm, and Automation Letters, 2021. 10
“Vplnet: Deep single view normal estimation with vanishing [119] M. Lopez-Antequera, P. Gargallo, M. Hofinger, S. R. Bulò,
points and lines,” in Proceedings of the IEEE/CVF Conference on Y. Kuang, and P. Kontschieder, “Mapillary planet-scale depth
Computer Vision and Pattern Recognition, pp. 689–698, 2020. 9 dataset,” in Proc. Eur. Conf. Comp. Vis., vol. 12347, pp. 589–604,
[98] J. H. Lee, M.-K. Han, D. W. Ko, and I. H. Suh, “From big to 2020. 10
small: Multi-scale local planar guidance for monocular depth [120] P. Xiao, Z. Shao, S. Hao, Z. Zhang, X. Chai, J. Jiao, Z. Li, J. Wu,
estimation,” arXiv: Comp. Res. Repository, p. 1907.10326, 2019. 9 K. Sun, K. Jiang, Y. Wang, and D. Yang, “Pandaset: Advanced
[99] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, sensor suite dataset for autonomous driving,” in IEEE Int. Intelli-
“A convnet for the 2020s,” in Proc. IEEE Conf. Comp. Vis. Patt. gent Transportation Systems Conf., 2021. 10
Recogn., pp. 11976–11986, 2022. 8, 10 [121] Z. Bauer, F. Gomez-Donoso, E. Cruz, S. Orts-Escolano, and
[100] T. Darcet, M. Oquab, J. Mairal, and P. Bojanowski, “Vision trans- M. Cazorla, “Uasol, a large-scale high-resolution outdoor stereo
formers need registers,” arXiv preprint arXiv:2309.16588, 2023. 8 dataset,” Scientific data, vol. 6, no. 1, pp. 1–14, 2019. 10
[101] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated [122] Y. Cabon, N. Murray, and M. Humenberger, “Virtual kitti 2,”
residual transformations for deep neural networks,” in Proc. IEEE arXiv preprint arXiv:2001.10773, 2020. 10
Conf. Comp. Vis. Patt. Recogn., pp. 1492–1500, 2017. 10 [123] P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik,
[102] X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer, “Scaling P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine, et al., “Scalability
vision transformers,” in Proceedings of the IEEE/CVF Conference in perception for autonomous driving: Waymo open dataset,”
on Computer Vision and Pattern Recognition, pp. 12104–12113, 2022. in Proceedings of the IEEE/CVF conference on computer vision and
10 pattern recognition, pp. 2446–2454, 2020. 10
[124] A. Zamir, A. Sax, , W. Shen, L. Guibas, J. Malik, and S. Savarese,
[103] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu,
“Taskonomy: Disentangling task transfer learning,” in Proc. IEEE
A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A
Conf. Comp. Vis. Patt. Recogn., IEEE, 2018. 10
multimodal dataset for autonomous driving,” in Proc. IEEE Conf.
Comp. Vis. Patt. Recogn., pp. 11621–11631, 2020. 9, 10 [125] J. Straub, T. Whelan, L. Ma, Y. Chen, E. Wijmans, S. Green, J. J. En-
gel, R. Mur-Artal, C. Ren, S. Verma, et al., “The replica dataset: A
[104] T. Koch, L. Liebel, F. Fraundorfer, and M. Korner, “Evaluation of
digital replica of indoor spaces,” arXiv preprint arXiv:1906.05797,
cnn-based single-image depth estimation methods,” in Eur. Conf.
2019. 10
Comput. Vis. Worksh., pp. 0–0, 2018. 9, 10
[126] S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets,
[105] I. Vasiljevic, N. Kolkin, S. Zhang, R. Luo, H. Wang, F. Z. Dai,
A. Clegg, J. Turner, E. Undersander, W. Galuba, A. Westbury,
A. F. Daniele, M. Mostajabi, S. Basart, M. R. Walter, et al., “Diode:
A. X. Chang, et al., “Habitat-matterport 3d dataset (hm3d): 1000
A dense indoor and outdoor depth dataset,” arXiv: Comp. Res.
large-scale 3d environments for embodied ai,” arXiv preprint
Repository, p. 1908.00463, 2019. 9, 10
arXiv:2109.08238, 2021. 10
[106] T. Schops, J. L. Schonberger, S. Galliani, T. Sattler, K. Schindler, [127] M. Roberts, J. Ramapuram, A. Ranjan, A. Kumar, M. A. Bautista,
M. Pollefeys, and A. Geiger, “A multi-view stereo benchmark N. Paczan, R. Webb, and J. M. Susskind, “Hypersim: A photore-
with high-resolution images and multi-camera videos,” in Proc. alistic synthetic dataset for holistic indoor scene understanding,”
IEEE Conf. Comp. Vis. Patt. Recogn., pp. 3260–3269, 2017. 9, 10 in Proceedings of the IEEE/CVF international conference on computer
[107] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, vision, pp. 10912–10922, 2021. 10
and M. Nießner, “Scannet: Richly-annotated 3d reconstructions [128] J. Kopf, X. Rong, and J.-B. Huang, “Robust consistent video depth
of indoor scenes,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., estimation,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2021. 12,
pp. 5828–5839, 2017. 9, 10 13
[108] J. L. Schönberger, E. Zheng, M. Pollefeys, and J.-M. Frahm, [129] J.-W. Bian, H. Zhan, N. Wang, T.-J. Chin, C. Shen, and I. Reid,
“Pixelwise view selection for unstructured multi-view stereo,” “Auto-rectify network for unsupervised indoor depth estima-
in Proc. Eur. Conf. Comp. Vis., 2016. 9 tion,” IEEE Trans. Pattern Anal. Mach. Intell., 2021. 12, 13
[109] A. Knapitsch, J. Park, Q.-Y. Zhou, and V. Koltun, “Tanks and [130] Z. Yin and J. Shi, “Geonet: Unsupervised learning of dense
temples: Benchmarking large-scale scene reconstruction,” ACM depth, optical flow and camera pose,” in Proceedings of the IEEE
Trans. Graph., vol. 36, no. 4, pp. 1–13, 2017. 9 conference on computer vision and pattern recognition, pp. 1983–1992,
[110] S. Li, X. Wu, Y. Cao, and H. Zha, “Generalizing to the open 2018. 13
world: Deep visual odometry with online adaptation,” in Proc. [131] S. Song, M. Chandraker, and C. C. Guest, “High accuracy monoc-
IEEE Conf. Comp. Vis. Patt. Recogn., pp. 13184–13193, 2021. 9 ular sfm and scale correction for autonomous driving,” IEEE
[111] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for au- Trans. Pattern Anal. Mach. Intell., vol. 38, no. 4, pp. 730–743, 2015.
tonomous driving? the kitti vision benchmark suite,” in Proc. 13
IEEE Conf. Comp. Vis. Patt. Recogn., pp. 3354–3361, IEEE, 2012.
9
[112] V. Guizilini, R. Ambrus, S. Pillai, A. Raventos, and A. Gaidon,
“3d packing for self-supervised monocular depth estimation,” in
Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2020. 10
[113] R. Kesten, M. Usman, J. Houston, T. Pandya, K. Nadhamuni,
A. Ferreira, M. Yuan, B. Low, A. Jain, P. Ondruska, S. Omari,
S. Shah, A. Kulkarni, A. Kazakova, C. Tao, L. Platinsky, W. Jiang,
and V. Shet, “Level 5 perception dataset 2020.” https://level-5.
global/level5/data/, 2019. 10
[114] G. Yang, X. Song, C. Huang, Z. Deng, J. Shi, and B. Zhou,
“Drivingstereo: A large-scale dataset for stereo matching in au-
Supplementary Materials for Metric3D v2: A Versatile Monocular
Geometric Foundation Model for Zero-shot Metric Depth and
Surface Normal Estimation
arXiv:2404.15506v1 [cs.CV] 22 Mar 2024

April 25, 2024

1 Details for Models (Ht1/14 ) to the finest (Ht1/4 ). For instance, the refined
feature map at the 14 1
scale Ht+1
1/14 is fed into the sec-
Details for ConvNet models. In our work, our en-
coder employs the ConvNext [?] networks, whose pre- ond ConvGRU sub-block to refine the 17 -scale feature
trained weight is from the official released ImageNet- map Ht1/7 . Finally, the projection heads Gd , Gn em-
22k pretraining. The decoder follows the adabins [?]. ploys a concatenation of original predictions D̂tc , N̂t
We set the depth bins number to 256, and the depth and the to finest feature map Ht1/4 to predict the up-
range is [0.3m, 150m]. We establish 4 flip connections date items ∆D̂t+1
c , ∆N̂
t+1
. Both projection heads
from different levels of encoder blocks to the decoder are composed of two linear layers with a sandwiched
to merge more low-level features. An hourglass sub- ReLU activation layer.
network is attached to the head of the decoders to
Resource comparison of different models. We
enhance background predictions.
compare the resource and performance among our
Details for ViT models. We use dino-v2 trans-
model families in Tab. 1. All inference-time and GPU
formers [?] with registers [?] as our encoders, which
memory results are computed on an Nvida-A100 40G
are pretrained on a curated dataset with 142M im-
GPU with the original pytorch implemented mod-
ages. DPT [?] is used as the decoders. For the ViT-S
els (No engineering optimization like TensorRT or
and ViT-L variants, the DPT decoders take only the
last-layer normalized encoder features as the input for
stabilized training. The giant ViT-g model instead 𝑡𝑡
𝐇𝐇1/14 C-GRU
𝑡𝑡+1
𝐇𝐇1/14 𝐇𝐇𝑠𝑠𝑡𝑡 Hidden feature map
takes varying-layer features, the same as the original down

� 𝑡𝑡𝑐𝑐 𝐍𝐍� 𝑡𝑡
𝑡𝑡
up 𝐃𝐃 Intermediate predictions
DPT settings. Different from the convnets models 𝐇𝐇1/7 c C-GRU
𝑡𝑡+1
𝐇𝐇1/7 up down Up/downsample
down

above, we use depth bins ranging from [0.1m, 200m] 𝑡𝑡


up
𝑡𝑡+1 c Concatenate
𝐇𝐇1/4 c C-GRU 𝐇𝐇1/4
for ViT models. C-GRU ConvGRU

Details for recurrent blocks. As illustrated in � 𝑡𝑡𝑐𝑐


𝐃𝐃 𝒢𝒢𝑑𝑑 � 𝑡𝑡+1
∆𝐃𝐃 𝑐𝑐
Linear
Linear
Linear
Linear

ReLU

c c
ReLU

Fig 1. Each recurrent block updates hierarchical fea- � 𝑡𝑡+1


1 1 1 � 𝑡𝑡
𝐍𝐍 𝒢𝒢𝑛𝑛 ∆𝐍𝐍
tures maps {Ht1/14 , Ht1/7 , Ht1/4 } at { 14 , 7 , 4 } scales
and the intermediate predictions sD̂tc , N̂t at each it- Figure 1: Detailed structure of the update
eration step t. This block compromises three Con- block. Inspired by RAFTStereo [?], we build slow-
vGRU sub-blocks to refine feature maps at different fast ConvGRU sub-blocks (denoted as ‘C-GRU’) to
scales, and two projection heads Gd and Gn to pre- refine hierarchical hidden feature maps. Projection
dict updates for depth and normal respectively. The heads Gd , Gn are attached to the end of the final Con-
feature maps are gradually refined from the coarsest vGRU to prediction update items for the predictions.

1
ONNX). Generally, the enormous ViT-Large/giant- the ADAM optimizer with the initial learning rate
backbone models enjoy better performance, while the beginning at 10−6 and linear decayed to 10−7 within
others are more deployment-friendly. In addition, 6K steps. Notably, such finetune does not require
our models built in classical en-decoder schemes run a large batch size like large-scale training. We use
much faster than the recent diffusion counterpart [?]. in practice a batch-size of 16 for the ViT-g model
and 32 for ViT-L. The models will converge quickly
1.1 Datasets and Training and Testing in approximately 2K steps. To stabilize finetuning,
we also leverage the predictions D̃c Ñ of the pre-
We collect over 16M data from 18 public datasets for finetuned model as alternative pseudo labels. These
training. Datasets are listed in Tab. 2. When train- labels can sufficiently impose supervision upon the
ing the ConvNeXt-backbone models, we use a smaller annotation-absent regions. The complete loss can be
collection containing the following 11 datasets with formulated as:
8M images: DDAD [?], Lyft [?], DrivingStereo [?], ∗ ∗
DIML [?], Argoverse2 [?], Cityscapes [?], DSEC [?], Lf t = 0.01Ld−n (Dc , N) + Ld (Dc , Dc ) + Ln (N, N )
Maplillary PSD [?], Pandaset [?], UASOL[?], and +0.01(Ld (Dc , D̃c ) + Ln (N, Ñ)),
Taskonomy [?]. In the autonomous driving datasets, (1)
including DDAD [?], Lyft [?], DrivingStereo [?],
Argoverse2 [?], DSEC [?], and Pandaset [?], have where Dc and N are the predicted depth in the
provided LiDar and camera intrinsic and extrin- canonical space and surface normal, D∗c and N∗ are
sic parameters. We project the LiDar to image the groundtruth labels, Ld , Ln , and Ld−n are the
planes to obtain ground-truth depths. In contrast, losses for depth, normal, and depth-normal consis-
Cityscapes [?], DIML [?], and UASOL [?] only pro- tency introduced in the main text.
vide calibrated stereo images. We use raftstereo [?] Evaluation of zero-shot 3D scene reconstruc-
to achieve pseudo ground-truth depths. Mapillary tion. In this experiment, we use all methods’ re-
PSD [?] dataset provides paired RGB-D, but the leased models to predict each frame’s depth and use
depth maps are achieved from a structure-from- the ground-truth poses and camera intrinsic param-
motion method. The camera intrinsic parameters eters to reconstruct point clouds. When evaluat-
are estimated from the SfM. We believe that such ing the reconstructed point cloud, we employ the it-
achieved metric information is noisy. Thus we do not erative closest point (ICP) [?] algorithm to match
enforce learning-metric-depth loss on this data, i.e., the predicted point clouds with ground truth by a
Lsilog , to reduce the effect of noises. For the Taskon- pose transformation matrix. Finally, we evaluate the
omy [?] dataset, we follow LeReS [?] to obtain the Chamfer ℓ1 distance and F-score on the point cloud.
instance planes, which are employed in the pair-wise Reconstruction of in-the-wild scenes. We col-
normal regression loss. During training, we employ lect several photos from Flickr. From their associated
the training strategy from [?] to balance all datasets camera metadata, we can obtain the focal length fˆ
in each training batch. and the pixel size δ. According to fˆ/δ, we can ob-
The testing data is listed in Tab. 2. All of them tain the pixel-represented focal length for 3D recon-
are captured by high-quality sensors. In testing, we struction and achieve the metric information. We use
employ their provided camera intrinsic parameters meshlab software to measure some structures’ size on
to perform our proposed canonical space transforma- point clouds. More visual results are shown in Fig.
tion. 6.
Generalization of metric depth estimation. To
evaluate our method’s robustness of metric recovery,
1.2 Details for Some Experiments
we test on 7 zero-shot datasets, i.e. NYU, KITTI,
Finetuning protocols. To finetune the large-scale- DIODE (indoor and outdoor parts), ETH3D, iBims-
data trained models on some specific datasets, we use 1, and NuScenes. Details are reported in Tab. 2. We

2
Table 1: Comparative analysis of resource and performance across our model families includes evaluation of
resource utilization metrics such as inference speed, memory usage, and the proportion of optimization mod-
ules. Additionally, we assess metric depth performance on KITTI/NYU datasets and normal performance
on NYUv2 dataset, with results derived from checkpoints without fine-tuning. For ViT models, inference
speed is measured using 16-bit precision (Bfloat16), which is the same precision as the training setup.

Model Resource KITTI Depth NYUv2 Depth NYUv2 Normal


Encoder Decoder Optim. Speed GPU Memory Optim. time AbsRel↓ δ1 ↑ AbsRel↓ δ1 ↑ Median↓ 30◦ ↑
Marigold[?] VAE+U-net U-net+VAE - 0.13 fps 17.3G - No metric No metric No metric No metric - -
Ours ConvNeXt-Large Hourglass - 10.5 fps 4.2G - 0.053 0.965 0.083 0.944 - -
Ours ViT-Small DPT 4 steps 11.6 fps 2.9G 3.4% 0.070 0.937 0.084 0.945 7.7 0.870
Ours ViT-Large DPT 8 steps 9.5 fps 7.0G 9.5% 0.052 0.974 0.063 0.975 7.0 0.881
Ours ViT-giant DPT 8 steps 5.0 fps 15.6G 25% 0.051 0.977 0.067 0.980 7.1 0.881

use the officially provided focal length to predict the directly run their released codes on KITTI. With
metric depths. All benchmarks use the same depth Droid-SLAM predicted poses, we unproject depths to
model for evaluation. We don’t perform any scale the 3D point clouds and fuse them together to achieve
alignment. dense metric mapping. More qualitative results are
Evaluation on affine-invariant depth bench- shown in Fig. 5.
marks. We follow existing affine-invariant depth es-
timation methods to evaluate 5 zero-shot datasets. 1.3 More Visual Results
Before evaluation, we employ the least square fitting
to align the scale and shift with ground truth [?]. Qualitative comparison of depth and normal
Previous methods’ performance is cited from their estimation. In Figs 2, 3, we compare visualized
papers. depth and normal maps from the Vit-g CSTM label
Dense-SLAM Mapping. This experiment is con- model with ZoeDepth [?], Bae etal [?], and Om-
ducted on the KITTI odometry benchmark. We use nidata [?]. In Figs. 4, 8, 9, and 10, We show the
our model to predict metric depths, and then naively qualitative comparison of our depth maps from the
input them to the Droid-SLAM system as an ini- ConvNeXt-L CSTM label model with Adabins [?],
tial depth. We do not perform any finetuning but NewCRFs [?], and Omnidata [?]. Our results have
Table 2: Training and testing datasets used for ex- much fine-grained details and less artifacts.
periments. Reconstructing 360◦ NuScenes scenes. Cur-
rent autonomous driving cars are equipped with sev-
Datasets Scenes Source Label Size #Cam. eral pin-hole cameras to capture 360◦ views. Cap-
Training Data
DDAD [?] Outdoor Real-world Depth ∼80K 36+ turing the surround-view depth is important for au-
Lyft [?] Outdoor Real-world Depth ∼50K 6+
Driving Stereo (DS) [?] Outdoor Real-world Depth ∼181K 1 tonomous driving. We sampled some scenes from the
DIML [?] Outdoor Real-world Depth ∼122K 10
Arogoverse2 [?] Outdoor Real-world Depth ∼3515K 6+ testing data of NuScenes. With our depth model,
Cityscapes [?] Outdoor Real-world Depth ∼170K 1
DSEC [?] Outdoor Real-world Depth ∼26K 1 we can obtain the metric depths for 6-ring cameras.
Mapillary PSD [?] Outdoor Real-world Depth 750K 1000+
Pandaset [?] Outdoor Real-world Depth ∼48K 6 With the provided camera intrinsic and extrinsic pa-
UASOL [?] Outdoor Real-world Depth ∼1370K 1
Virtual KITTI [?] Outdoor Synthesized Depth 37K 2 rameters, we unproject the depths to the 3D point
Waymo [?] Outdoor Real-world Depth ∼1M 5
Matterport3d [?] In/Out Real-world Depth + Normal 144K 3 cloud and merge all views together. See Fig. 7 for de-
Taskonomy [?] Indoor Real-world Depth + Normal ∼4M ∼1M
Replica [?] Indoor Real-world Depth + Normal ∼150K 1 tails. Note that 6-ring cameras have different camera
ScanNet†[?] Indoor Real-world Depth + Normal ∼2.5M 1
HM3d [?] Indoor Real-world Depth + Normal ∼2000K 1 intrinsic parameters. We can observe that all views’
Hypersim [?] Indoor Synthesized Depth + Normal 54K 1
Testing Data point clouds can be fused together consistently.
NYU [?] Indoor Real-world Depth+Normal 654 1
KITTI [?] Outdoor Real-world Depth 652 4
ScanNet†[?] Indoor Real-world Depth+Normal 700 1
NuScenes (NS) [?] Outdoor Real-world Depth 10K 6
ETH3D [?] Outdoor Real-world Depth 431 1
DIODE [?] In/Out Real-world Depth 771 1
iBims-1 [?] Indoor Real-world Depth 100 1

ScanNet is a non-zero-shot testing dataset for our ViT models.

3
RGB GT Depth Ours Depth ZoeDepth GT Normal Ours Normal Bae et al.
4
Figure 2: Depth and normal estimation. The visual comparison of predicted depth and normal on
indoor/outdoor scenes from NYUv2, iBims, Eth3d, and ScanNet. Our depth and normal maps come from
the ViT-g CSTM label model.
RGB GT Depth Ours Depth ZoeDepth Ours Normal Bae et al.
Figure 3: Depth and normal estimation. The visual comparison of predicted depth and normal on
driving scenes from KITTI, Nuscenes, DIML, DDAD, and Waymo. Our depth and normal maps come from
the ViT-g CSTM label model.
5
RGB GT Ours NewCRFs Adabins Omnidata
Figure 4: The visual comparison of predicted depth on iBims, ETH3D, and DIODE. Our depth maps come
from the ConvNeXt-L CSTM label model.

6
Droid-SLAM

Droid-SLAM
Ours

Ours
GT
Trajectory GT
Trajectory
Droid-SLAM
Droid-SLAM

GT
Trajectory
Ours
Ours

GT
GT Trajectory
Trajectory

Figure 5: Dense-SLAM Mapping. Existing SOTA mono-SLAM methods usually face scale drift problems
in large-scale scenes and are unable to achieve the metric scale. We show the ground-truth trajectory and
Droid-SLAM [?] predicted trajectory and their dense mapping. Then, we naively input our metric depth to
Droid-SLAM, which can recover a much more accurate trajectory and perform the metric dense mapping.

7
Kodak
C913

2.2m
1.3m

3.3m

3.4m
3.4m
Olympus
X450

4.3m

4.5m
Panasonic
DMC-FS40

15.7m

6.5m
5.6m

Fujifilm
X-T10
4.9m

16.8m

RGB Point Cloud (view 1) Point Cloud (view 2)


Figure 6: 3D metric reconstruction of in-the-wild images. We collect several Flickr images and use
our model to reconstruct the scene. The focal length information is collected from the photo’s metadata.
From the reconstructed point cloud, we can measure some structures’ sizes. We can observe that sizes are
in a reasonable range.

8
Point Cloud of Car

Ring RGBs & Depth Point Cloud

Figure 7: 3D reconstruction of 360◦ views. Current autonomous driving cars are equipped with several
pin-hole cameras to capture 360◦ views. With our model, we can reconstruct each view and smoothly fuse
them together. We can see that all views can be well merged together without scale inconsistency problems.
Testing data are from NuScenes. Note that the front view camera has a different focal length from other
views.

9
RGB GT Ours NewCRFs Adabins Omnidata

Figure 8: Depth estimation. The visual comparison of predicted depth on iBims, ETH3D, and DIODE.
Our depth maps come from the ConvNeXt-L CSTM label model.

10
RGB GT Ours NewCRFs Adabins Omnidata

Figure 9: Depth estimation. The visual comparison of predicted depth on iBims, ETH3D, and DIODE.
Our depth maps come from the ConvNeXt-L CSTM label model.

11
RGB GT Ours NewCRFs Adabins Omnidata

Figure 10: Depth estimation. The visual comparison of predicted depth on iBims, ETH3D, and DIODE.
Our depth maps come from the ConvNeXt-L CSTM label model.

12

You might also like