0% found this document useful (0 votes)

11 views

Multiview Compressive Coding for 3D Reconstruction

Uploaded by

Ines Bouhelal

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Multiview Compressive Coding for 3D Reconstruction

Uploaded by

Ines Bouhelal

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Multiview Compressive Coding for 3D Reconstruction

Chao-Yuan Wu Justin Johnson Jitendra Malik Christoph Feichtenhofer Georgia Gkioxari

FAIR, Meta AI

Abstract RGB-D Image 3D Reconstruction

arXiv:2301.08247v1 [cs.CV] 19 Jan 2023

Encodings

A central goal of visual recognition is to understand ob-

jects and scenes from a single image. 2D recognition has

Attention
witnessed tremendous progress thanks to large-scale learn-
ing and general-purpose representations. Comparatively,
3D poses new challenges stemming from occlusions not de-

…
picted in the image. Prior works try to overcome these by unproject
inferring from multiple views or rely on scarce CAD models
Queries
and category-specific priors which hinder scaling to novel
settings. In this work, we explore single-view 3D recon- (a) MCC Overview
(a)
struction by learning generalizable representations inspired
by advances in self-supervised learning. We introduce a
simple framework that operates on 3D points of single ob-
jects or whole scenes coupled with category-agnostic large- Input Output Input Output Input Output
scale training from diverse RGB-D videos. Our model, Mul- (b) 3D Reconstructions by MCC
tiview Compressive Coding (MCC), learns to compress the
Figure 1. Multiview Compressive Coding (MCC). (a): MCC
input appearance and geometry to predict the 3D structure
encodes an input RGB-D image and uses an attention-based model
by querying a 3D-aware decoder. MCC’s generality and ef-
to predict the occupancy and color of query points to form the final
ficiency allow it to learn from large-scale and diverse data (b)
3D reconstruction. (b): MCC generalizes to novel objects captured
sources with strong generalization to novel objects imag- with iPhones (left) or imagined by DALL·E 2 [48] (middle). It is
ined by DALL·E 2 or captured in-the-wild with an iPhone. also general – it works not only on objects but also scenes (right).

Reminiscent of generalized cylinders [41], some intro-

1. Introduction duce object-specific priors via category-specific 3D tem-
Images depict objects and scenes in diverse settings. plates [26, 30, 32], pose [43] or symmetries [73]. While im-
Popular 2D visual tasks, such as object classification [8] and pressive, these methods cannot scale as they rely on onerous
segmentation [33, 83], aim to recognize them on the image 3D annotations and category-specific priors which are not
plane. But image planes do not capture scenes in their en- generally true. Alas large-scale learning, which has shown
tirety. Consider Fig. 1a. The toy’s left arm is not visible in promising generalization results for images [46] and lan-
the image. This is framed by the task of 3D reconstruction: guage [3], is largely underexplored for 3D reconstruction.
given an image, fully reconstruct the scene in 3D. Image-based recognition is entering a new era thanks to
3D reconstruction is a longstanding problem in AI with domain-agnostic architectures, like transformers [11, 65],
applications in robotics and AR/VR. Structure from Mo- and large-scale category-agnostic learning [20] . Motivated
tion [19, 61] lifts images to 3D by triangulation. Recently, by these advances, we present a scalable, general-purpose
NeRF [38] optimizes radiance fields to synthesize novel model for 3D reconstruction from a single image. We in-
views. These approaches require many views of the same troduce a simple, yet effective, framework that operates di-
scene during inference and do not generalize to novel scenes rectly on 3D points. 3D points are general as they can cap-
from a single image. Others [17, 68] predict 3D from a sin- ture any objects or scenes and are more versatile and ef-
gle image but rely on expensive CAD supervision [6, 60]. ficient than meshes and voxels. Their generality and ef-
ficiency enables large-scale category-agnostic training. In
Project page: https://mcc3d.github.io turn, large-scale training makes our 3D model effective.

Computer-aided design (CAD) is a way to digitally create 2D drawings and 3D models of real-world products before they're ever manufactured.
1 Category-agnostic vision models aim to build a generic model
that can be applied to any categories on a particular task,

Simple and Scalable

Central to our approach is an input encoding and a que- stereopsis [70], SfM [19,54,61–63], and SLAM [5,58]. Re-
riable 3D-aware decoder. The input to our model is a single construction by analysis [12] or synthesis via volume ren-
RGB-D image, which returns the visible (seen) 3D points dering [25] of implicit [38, 82] and explicit [34, 57] repre-
via unprojection. Image and points are encoded with trans- sentations have shown to produce strong results. Supervised
formers. A new 3D point, sampled from 3D space, queries approaches predict voxels [67, 74] or meshes [69, 71] by
a transformer decoder conditioned on the input to predict its training deep nets. These techniques produce high-quality
occupancy and its color. The decoder reconstructs the full, outputs, but rely on multiple views at test time. In this work,
seen and unseen, 3D geometry, as shown in Fig. 1a. Our we assume a single RGB-D image during inference.
occupancy-based formulation, introduced in [37], frames Single-view 3D reconstruction is challenging. One line of
3D reconstruction as a binary classification problem and re- work trains models that predict 3D geometry via CAD [17,
moves constraints pertinent to specialized representations 68], meshes [31,75], voxels [16,72] or point clouds [13,37]
(e.g., deformations of a 3D template) or a fixed resolution. supervision. Results are commonly demonstrated on syn-
Being tasked with predicting the unseen 3D geometry of thetic simplistic benchmarks, such as ShapeNet [6], or for
diverse objects or scenes, our decoder learns a strong 3D a small set of object categories, as in Pix3D [60]. Weakly
representation. This finding directly connects to recent ad- supervised approaches use category-specific priors via 3D
vances in image-based self-supervised learning and masked shape templates [18, 26, 30] and pose [43] or learn via 2D
autoencoders (MAE) [20] which learn powerful image rep- silhouettes and re-projection on posed views [7, 27, 35, 50].
resentations by predicting masked (unseen) image patches. While impressive, these approaches are limited to specific
Our model inputs single RGB-D images, which are ubiq- objects from a closed-world vocabulary. Some [66, 77]
uitous thanks to advances in hardware. Nowadays, depth explore category-agnostic models, but focus on synthetic
sensors are found in iPhone’s front and back cameras. We datasets. In this work, we learn a general-purpose 3D rep-
show results from iPhone captures in §4 and Fig. 1b. Our resentation from RGB-D views from a diverse and large set
decoder predicts point cloud occupancies. Supervision is of data sources of real-world objects and scenes.
sourced from multiple RGB-D views, e.g., video frames,
with relative camera poses, e.g., from COLMAP [55, 56]. Shape completion methods complete the 3D geometry of
The posed views produce 3D point clouds which serve as partial reconstructions. For objects, methods directly out-
proxy ground truth. These point clouds are far from “per- put full point clouds [22, 79, 80] or deploy generative mod-
fect” as they are amenable to sensor and camera pose noise. els [76, 84], but are typically tied to a fixed resolution. For
However, we show that when used at scale they are suf- scenes, techniques include plane fitting [39], 3D model fit-
ficient for our model. This suggests that 3D annotations, ting and retrieval [15, 40] or leverage symmetries [28] and
which are expensive to acquire, can be replaced with many predict 3D semantics [4, 14, 59]. Our model tackles both
RGB-D video captures, which are much easier to collect. objects and scenes with a unified architecture and outputs
any-resolution 3D geometry with a 3D-aware decoder. We
We call our approach Multiview Compressive Coding
compare to recent shape completion techniques.
(MCC), as it learns from many views, compresses appear-
ance and geometry and learns a 3D-aware decoder. We Implicit 3D representations such as SDFs [44, 53] and oc-
demonstrate the generality of MCC by experimenting on six cupancy nets (OccNets) [37] have proven effective 3D rep-
diverse data sources: CO3D [51], Hypersim [52], Taskon- resentations. NeRF [38] optimizes per-scene neural fields
omy [81], ImageNet [8], in-the-wild iPhone captures and for view synthesis. NeRF extensions target scene gener-
DALL·E 2 [48] generations. These datasets range from alization by encoding input views with deep nets [21, 47,
large-scale captures of more than 50 common object types, 78] or improve reconstruction quality by supervising with
to holistic scenes, such as warehouses, auditoriums, lofts, depth [9]. MCC adopts an occupancy-based representa-
restaurants, and imaginary objects. We compare to state- tion, similar to OccNets [37], with an attention mechanism
of-the-art methods, tailored for single objects [21, 51, 79] on encoded appearance and geometric cues which allows it
and scene reconstruction [31] and show our model’s supe- to predict in any 3D region, even outside the camera frus-
riority in both settings with a unified architecture. Enabled tum, efficiently. We show that this strategy outperforms
by MCC’s general purpose design, we show the impact of the global-feature strategy from OccNets [37] or single-
large-scale learning in terms of reconstruction quality and location features used in NeRF-based methods [21, 51].
zero-shot generalization on novel object and scene types. Self-supervised learning has advanced image [2, 20, 46]
and language [3, 10] understanding. For images, masked
2. Related Work autoencoders [20] paired with transformers and large-scale
category-agnostic training learn general representations for
Multiview 3D reconstruction is a longstanding problem in 2D recognition. We draw from these findings and extend the
computer vision. Traditional techniques include binocular architecture and learning for the task of 3D reconstruction.

2
Extracting position and depth

Predicts the remaining positions of the points

MCC is Transformer Based
using transformers

Figure 2. Model Overview. Given an RGB-D image, MCC unprojects the pixels of the input RGB image I to the corresponding 3D points
P . An image encoder E RGB and a geometry encoder E XYZ encode I and P into a 3D-aware representation R. A decoder predicts the
occupancy σi and color ci of query qi , i = 0, . . . , Nq −1, conditioned on R. The predicted colored points form the final 3D reconstruction.

3. Multiview Compressive Coding (MCC) E XYZ processes the input points P similar to a ViT, but en-
codes 3D coordinates instead of RGB color channels. We
MCC adopts an encoder-decoder architecture. The input explain in detail how to adapt a ViT to encode the input
RGB-D image is fed to the encoder to produce encoding R.
points P in §3.4. f concatenates the two outputs from the
The decoder inputs a query 3D point qi ∈ R3 , along with R,
transformers along the channel dimension followed by a lin-
to predict its occupancy probability σi ∈ [0, 1], as in [37],
3 ear projection to C-dimensions. N enc is the number of to-
and RGB color ci ∈ [0, 1] . Fig. 2 illustrates our model. kens used in the transformers. Fig. 2 shows an illustration.
During training, we supervise MCC with “true” points The proposed two-tower design is general and perfor-
derived from posed RGB-D views. These point clouds serve mant. Alternative designs are ablated in §4.
as ground truth: qi is labeled as positive if it is close to the
ground truth and negative otherwise. Intuitively, the other 3.2. MCC Decoder
views guide the model to reason about what parts of the un- The decoder takes as input the output of the encoder, R,
seen space belong to the object or scene. As a result, the and N q 3D point queries qi , i = 0, . . . , Nq − 1, to predict
input encoding R learns a representation of the full 3D ge- occupancy and colors for each point,
ometry and guides the decoder to make the right prediction.
During inference, the model predicts occupancy and (σ0 , c0 ), (σ1 , c1 ), . . . := Dec(R, q0 , q1 , . . .) (2)
color for a grid of points at any desired resolution. The set The decoder Dec linearly projects each query qi to C-
of occupied colored points forms the final reconstruction. dimensions (the same as R), concatenates them with R in
MCC requires only points for supervision, extracted the token dimension, and then uses a transformer to model
from posed RGB-D views, e.g., video frames. Note that the interactions between R and queries. We draw inspira-
the derived point clouds, which serve as ground truth, are tion from MAE [20] for this design. The output feature of
far from perfect due to noise in the captures and pose es- each query token is passed through a binary classification
timation. However, when used at scale they are sufficient. head that predicts its occupancy σi , and a 256-way classifi-
This deviates from OccNets [37] and other distance-based cation head that predicts its RGB color ci [64].
works [44, 53] which rely on clean CAD models or 3D As described in Eq. 2, we feed multiple queries to the de-
meshes. This is an important finding as it suggests that ex- coder for efficiency via parallelization, which significantly
pensive CAD supervision can be replaced with cheap RGB- speeds up training and inference. However, since all tokens
D video captures. This property of MCC allows us to train attend to all tokens in a standard transformer, this creates
on a wide range of diverse data. In §4, we show that large- undesirable dependencies among queries. To break the un-
scale training is crucial for high-quality reconstruction. wanted dependencies, we mask out the attention weights
3.1. MCC Encoder such that tokens cannot attend to the other queries (except
for self). This masking pattern is illustrated in Fig. 3.
The input to our model is a single RGB-D image. Let MCC’s attention architecture differentiates it from prior
I ∈ RH×W ×3 be the RGB image and ∆ ∈ RH×W the 3D reconstruction approaches. In [37, 42], points condition
associated depth. We use ∆ to unproject the pixels into on a globally pooled image feature; in [21, 47, 78] they con-
their positions P ∈ RH×W ×3 in 3D. I and P are encoded dition on the projected locations of the image feature map.
into a single representation R via In §4 we show that MCC’s design performs better.
enc
R := f E RGB (I), E XYZ (P ) ∈ RN ×C

(1) The computation of the decoder grows with the number
of queries, while the encoder embeds the input image once
E RGB and E XYZ are two transformers [65]. E RGB fol- regardless of the final output resolution. By using a rela-
lows a ViT architecture [11] to encode the input image I. tively lightweight decoder, our inference is made efficient

3
Figure 3. Attention Masking 4. Object Reconstruction Experiments
Pattern in MCC’s Decoder.
The masking in MCC’s de- MCC works naturally for both objects and scenes. In §4,
coder ensures a query can- we show results and compare to competing methods for sin-
not depend on another, apart gle object reconstruction. In §5, we show results on scenes.
from itself. cls is a learn- Dataset. We use CO3D-v2 [51] as our main dataset for sin-
able global summary token, gle object reconstruction. It consists of ∼37k short videos of
following [10, 11]. 51 object categories; the largest dataset of 3D objects in the
Unmasked
wild. To show generalization to new objects, we hold out 10
Masked
randomly selected categories for evaluation and train on the
remaining 41. The list of held-out categories is available in
even at high resolutions, and the encoder cost is amortized.
the Appendix. Since CO3D is object-centric, we focus on
This allows us to dynamically change output resolutions
foreground objects specified by segmentation masks pro-
and does not require re-computing the input encoding R.
vided in CO3D. Full 3D annotations, such as 3D meshes,
are not available. CO3D extracts point clouds from the
3.3. Query Sampling
videos via COLMAP [55, 56], which are inevitably noisy
Training. MCC samples N q = 550 queries from the 3D and are used to train our model. Despite imperfect super-
world space uniformly and per training example. We ab- vision, we show that MCC learns to reconstruct 3D shapes
late sampling strategies in §4. A query is considered “oc- and texture and even corrects the noisy depth inputs.
cupied” (positive) if it is located within radius τ = 0.1 to a Metrics. Following Kulkarni et al. [31], we report: accu-
ground truth point, and “unoccupied” (negative) otherwise. racy (acc), the percentage of predicted points within ρ to
The ground truth is defined as the union of all unprojected a ground truth point, completeness (cmp), the percentage
points from all RGB-D views of the scene. of ground truth points within ρ from a predicted point, and
Inference. We uniformly sample a grid of points covering their F-score (F1) which drives our comparisons. ρ is 0.1.
the 3D space. Queries with occupancy score greater than Training Details. We train with Adam [29] for 150k itera-
a threshold of 0.1 and their color predictions form the final tions with an effective batch size of 512 using 32 GPUs, a
reconstruction. Techniques such as Octree [36] could be base learning rate of 10−4 with a cosine schedule and a lin-
easily integrated to further speed up test-time sampling. ear warm-up for the first 5% of iterations. Training takes
∼2.5 days. We randomly scale augment images by s ∈
3.4. Implementation Details [0.8, 1.2]. We also perform 3D augmentations by randomly
rotating 3D points along each axis by θ ∈ [−180o , 180o ].
E XYZ Patch Embeddings. Note that the depth values, and Rotation is applied to the seen points P , the queries and
consequently the 3D locations in P , might be unknown for the ground truth. Image I and points P are aligned through
some points (e.g., due to sensor uncertainty). Thus, the the concatenation of their encodings (Eq. 1). Points P and
convolution-based patch embedding design in a ViT [11] is queries are consistent as well, as both are rotated. Essen-
not directly applicable. We use a self-attention-based design tially, our 3D augmentations build in rotation equivariance.
instead. First, the 3D coordinates are transformed. For pix-
Coordinate System. We adopt the original CO3D coordi-
els with unknown depth, we learn a special C-dimensional
nate system from [51], where objects are normalized to have
embedding. For pixels with valid depth, their 3D points are
zero-mean and unit-variance. Training and testing points
linearly transformed to a C-dimensional vector. This re-
are sampled from [−3, 3] along each axis. Evaluation points
sults in a 16×16×C representation for each 16×16 patch.
are sampled with a granularity of 0.1.
A transformer, shared across patches, converts each patch
to a C-dimensional vector via a learned patch token which 4.1. Qualitative Results on Novel Categories
summarizes the patch [10]. This results in W/16 × H/16 to-
kens (and thus N enc = W/16 × H/16 + 1 with the additional Fig. 4 shows qualitative results on the CO3D test set
global token used in a ViT [11]). of novel categories. We show reconstructions for a vari-
ety of shapes and object types. MCC tackles heavy self-
E RGB Patch Embeddings. For RGB, we follow standard occlusions, e.g., the mug handle is barely visible in the in-
ViTs [11] and embed each 16×16 patch with a convolution. put image, and complex shapes, e.g., the toy airplane. In
Architecture. The E RGB and E XYZ encoder use a 12-layer addition to shape, MCC predicts texture which is difficult
768-dimensional “ViT-Base” architecture [11, 65]. The in- especially for unseen regions. For instance, the left and
put image size is 224×224. Our decoder is a lighter-weight back side of the kids backpack is completely invisible, but
8-layer 512-dimensional transformer, following MAE [20]. MCC predicts to propagate the color from the right side. We
Detailed specifications can be found in the Appendix. also note that MCC is robust to noisy depth from COLMAP,

4
eeee
Seen eeee
Seen eeee
Seen
Input image Input image Input image

eeeeee
Output eeeeee
Output eeeeee
Output

eeee
Seen eeee
Seen eeee
Seen
Input image Input image Input image

eeeeee
Output eeeeee
Output eeeeee
Output

Figure 4. Predictions on CO3D-v2 Novel Categories. For each example, we show the input image (left), the unprojected seen 3D points
(top), and our reconstruction (bottom). We show results for a variety of object types, shapes, textures and occlusion patterns. We emphasize
that we do not use any shape priors such as symmetries, canonical views, or mean shapes. See project page for animations.
Acc Cmp F1 Acc Cmp F1 Acc Cmp F1
Shared 42.6 77.0 52.5 MLP 43.4 79.8 54.5 Contrastive 45.0 78.7 55.6
Decoupled (ours) 47.5 76.0 56.7 PointNet [45] 45.6 80.3 56.6 Uniform (ours) 47.5 76.0 56.7
Transformer (ours) 47.5 76.0 56.7
(a) Encoder Structure (b) E XYZ Design (c) Training Query Sampling
Acc Cmp F1 Acc Cmp F1 Acc Cmp F1 CD (↓)
Loc-pooled 49.2 22.6 28.2 Loc+MLP 49.2 22.6 28.2 PoinTr [79] 79.6 27.1 39.7 0.065
Global 44.7 77.1 54.5 Cross-attn 42.3 49.5 43.7 MCC (w/o RGB) 46.5 70.8 53.9 0.047
Detailed (ours) 47.5 76.0 56.7 Concat+attn (ours) 47.5 76.0 56.7 MCC 47.5 76.0 56.7 0.040
(d) Feature Conditioning (e) Decoder Design (f) Comparison to Prior Work with Explicit Design
Table 1. Ablations on CO3D-v2, which validate MCC’s design choices. We highlight ablation (e) which shows that an attention-based
decoder outperforms an MLP, and (f) where we find that MCC’s queriable decoder performs better than an explicit design [79]. Higher is
better for Accuracy (Acc), Completeness (Cmp), and F1. Lower is better for Chamfer distance (CD).

present at varying degrees and depicted in the seen points of Feature Conditioning. Our input encoding R uses all N enc
each example (top row). MCC corrects and completes the tokens from the appearance I and geometry P encodings.
geometry in spite of the noise in depth inputs. We empha- We call this detailed conditioning and compare it with two
size that we do not make geometric assumptions nor use popular choices: one where a globally average-pooled vec-
any priors such as symmetry or mean templates when re- tor is used, as in [37, 42], and one where the feature vector
constructing objects. MCC learns only from data. is bilinearly interpolated at the projected location in the fea-
ture map, as in [21, 47, 78]. Table 1d validates our choice.
4.2. Ablation Study Decoder Design. As described in §3, MCC’s decoder con-
catenates queries to the input encoding R in the token di-
Encoder Structure. In Table 1a, we ablate our encoder
mension, and a transformer models their interactions (con-
design which models I and P with two separate trans-
cat+attn). We compare this design with two popular ones.
formers (decoupled) and compare to a shared transformer
Recent works on image-conditioned NeRF [21, 47, 78] con-
which models the fused (sum) patch embeddings of I and
dition points on their projected location in the feature map
P (shared). Our decoupled design performs slightly better.
followed by an MLP (loc+MLP) – this comparison was also
E XYZ Design. Table 1b compares our transformer to an presented in the context of feature conditioning strategies.
MLP and PointNet [45] for the E XYZ encoder. PointNet Another approach is cross-attention (cross-attn), where the
and our transformer, which model point interactions, work encoded input R only serves as keys/values but not as
slightly better than an MLP, though not critically. queries to a transformer, e.g., in Perceiver models [23, 24].
Training Query Sampling. In Table 1c, we compare our Table 1e shows that our decoder is critical for performance.
uniform sampling strategy with a contrastive-style sam- Comparison to Prior Work with an Explicit Design. Fi-
pling, where each example samples a fixed number of pos- nally, we compare MCC and its queriable 3D decoder with
itives and negatives. Both work similarly. We choose uni- a state-of-the-art 3D point completion method PoinTr [79].
form sampling because of its simplicity. PoinTr inputs an incomplete point cloud and predicts a

5
Input image Seen PoinTr [79] MCC w/o RGB MCC seen categ. unseen categ.
depth sup. [9] depth in Abs MSE Abs MSE
NeRF-WCE [21] 8.43 175.5 10.1 156.4
NeRF-WCE [21] X 7.38 92.2 9.15 139.9
NeRF-WCE [21] X 7.46 156.3 8.30 119.4
NeRF-WCE [21] X X 2.75 78.4 2.79 30.5
NerFormer [51] 2.02 70.4 2.00 20.6
NerFormer [51] X 2.19 72.8 2.18 23.5
NerFormer [51] X 2.20 72.1 2.17 22.5
NerFormer [51] X X 2.34 80.7 2.28 24.1
MCC X X 1.46 38.8 1.17 13.6
Figure 5. Qualitative Comparison to PoinTr [79]. MCC predicts
shape details while PoinTr tends to place points roughly around the Table 2. Comparison to the State-of-the-Art on CO3D-v2 [51].
object. For a fair comparison, MCC predicts the same number of For a fair comparison with MCC, we extend baselines [21,51] with
points as PoinTr. Unlike PoinTr, MCC also predicts color. depth supervision [9] or using depth as input. MCC outperforms
prior state of the art on CO3D-v2 for shape reconstruction.
F1 on held-out cat. (%)

F1 on held-out cat. (%)

56 56
gests two things. First, building category-agnostic scaleable
54 +3.2% 54 +5.0% models like MCC is a promising direction towards general-
52 52 purpose 3D reconstruction. Second, expanding the datasets,
and especially the set of categories, is promising.
50 50
4.4. Zero-Shot Generalization In-the-Wild
25 50 75 100 25 50 75 100
# examples (%) # categories (%) In §4.1, we show generalization to novel categories from
(a) Scaling # examples (same cate- (b) Scaling # categories the CO3D dataset. Now, we turn to in-the-wild settings and
gories) show MCC reconstructions on ImageNet [8], iPhone cap-
Figure 6. Scaling Behavior Analysis. We train MCC on (a) a tures, and AI-generated images [48].
varying number of examples uniformly sampled from all training iPhone Captures. This is arguably the most popular in-the-
categories and (b) all examples from a varying number of train- wild setting — our personal use of an off-the-shelf smart
ing categories. All models are evaluated on the same held-out set phone for capturing everyday objects. Specifically, we use
of novel categories. We see clear performance gains from scaling iPhones and their depth sensor to take RGB-D images on a
training data, especially when expanding the number of categories. diverse set of objects in two of the coauthors’ homes (us-
This supports that category-agnostic models and large-scale train-
ing a 12 and 14 Pro iPhone). This is a challenging setting
ing are promising for 3D reconstruction.
due to the domain shift from the training data and the differ-
ence in the depth estimation pipeline (COLMAP in CO3D
fixed-resolution output using a transformer which models
vs. sensor from iPhone). Fig. 7a shows ours results. Exam-
explicit geometric point relations (via nearest neighbors).
ples such as the vacuum or the VR headset in Fig. 1b stand
We train PoinTr on CO3D which inputs the set of seen
out as they deviate from our training set. Fig. 7a demon-
points P . For a fair comparison, we implement PoinTr with
strates MCC’s ability to learn general shape priors, instead
the same 12-layer architecture as ours, which is stronger
of memorizing the training set.
than their 6-layer one. Since PoinTr does not use RGB, we
compare with a MCC variant that ignores texture by encod- ImageNet. We turn to ImageNet [8], which contains highly
ing P but not I. We additionally report chamfer distance diverse Internet photos, ranging from bears and elephants
(CD), as in [79], and use the same number of points for in their natural habitat to Japanese mailboxes, drastically
a fair comparison. Table 1f shows that MCC outperforms different than the staged CO3D objects. For depth, we use
PoinTr by a large margin. Fig. 5 presents a qualitative com- an off-the-shelf model from Ranftl et al. [49], which differs
parison. In §4.5, we also compare to NeRF-based methods. from CO3D’s COLMAP output. Fig. 7b shows results on
ImageNet images of diverse objects.
4.3. Scaling Behavior Analysis AI-generated Images. We test MCC on DALL·E 2 which
MCC’s strength is that it only requires points for train- generates images of imaginary objects. Fig. 7c shows
ing and does not rely on any shape priors. As a result, MCC reconstructions including the Internet-famous avo-
MCC can train on a large number of examples. We an- cado chair and a cat-shaped marshmallow with a mustache!
alyze our model’s performance as a function of data size.
4.5. Comparison to Image-Conditioned NeRF
Fig. 6 shows that scaling the training data leads to steady
performance improvements. Furthermore, if we increase A recent successful line of work for 3D reconstruction
the number of categories, and thus the visual diversity of our extends NeRF [38] to cross-scene generalization from one
training data, the improvements are even larger. This sug- or few views by conditioning on image embeddings [21,47,

6
eeee
Seen eeee
Seen eeee
Seen
Input image
Input image Input image

eeeeee
Output eeeeee
Output eeeeee
Output

eeee
Seen eeee
Seen eeee
Seen
Input image Input image
Input image

eeeeee
Output eeeeee
Output eeeeee
Output

eeee
Seen eeee
Seen eeee
Seen
Input image
Input image
Input image

eeeeee
Output eeeeee
Output eeeeee
Output

eeee
Seen eeee
Seen eeee
Seen
Input image
Input image Input image

eeeeee
Output eeeeee
Output eeeeee
Output

(a) iPhone (b) ImageNet (c) DALL·E 2

Figure 7. Zero-Shot Generalization. We test MCC, trained on CO3D-v2 [51], on three challenging settings: (a) iPhone captures with
LiDAR sensor of everyday objects, (b) Web images (from ImageNet) of in-the-wild objects with depth estimated by an off-the-shelf
model [49], (c) AI-generated images (by DALL·E 2) of imaginary objects with depth estimated by [49]. These examples are challenging
as they demonstrate variance in object types (e.g., novel, imaginary objects), image styles (e.g., digital arts, natural), depth systems (e.g.,
depth sensor, off-the-shelf predictors), and visual context (e.g., safari, street scene). See project page for animations.
Seen NerFormer MCC GT Seen NerFormer MCC GT
with depth. To input depth, we fuse the XYZ input encod-
ing, i.e. E XYZ (P ), to the input image features. Table 2
shows that the baselines benefit from depth, as expected;
MCC outperforms them by a clear margin. Fig. 8 qualita-
tively compares MCC to the best baseline, NerFormer [51].
NerFormer captures texture but struggles with geometry un-
Figure 8. Qualitative Comparison between MCC and Ner- der the challenging single-view novel-category setting, thus
Former [51]. NerFormer captures texture but struggles with ge- rendering relatively blurry novel views. Admittedly, these
ometry; MCC reconstructs shapes more accurately.
methods tend to work better with more (5-10) input views.
51, 78]. We compare to two recent best performing meth- MCC predicts more accurate shapes from just a single view.
ods on CO3D from this family, NeRF-WCE [21] and Ner-
Former [51]. We evaluate for shape reconstruction using the 5. Scene Reconstruction Experiments
official CO3D novel view depth metrics [51]: absolute (abs)
and mean-squared error (MSE) on the official CO3D chal- MCC naturally handles singles objects and scenes with-
lenge evaluation frames. This puts MCC at a disadvantage out modifications to its design. So, now we turn to scenes.
as it is not designed for synthesis via rendering. Since MCC Task. We test 3D scene reconstruction from a single RGB-
uses RGB-D as input, we extend both methods, which orig- D image. Formally, we aim to reconstruct everything in
inally use posed RGB views, to take depth as input or su- front of the camera (z > 0 in camera coordinate system) up
pervision. For depth supervision we follow Deng et al. [9], to a certain range. Note that this includes areas outside the
which shows strong results by supervising NeRF models camera frustum, which increases the complexity of the task.

7
Input image Input image Input image Input image

Seen Output Seen Output Seen Output Seen Output

(a) Reconstruction of Hypersim Held-Out Scenes (b) Zero-Shot Generalization to the Taskonomy Dataset
Figure 9. Scene Reconstructions. With a model trained on Hypersim, we show reconstructions on (a) held-out Hypersim scenes, and (b)
novel scenes from Taskonomy. From a single RGB-D image, MCC reconstructs furniture, walls, floors, and ceilings, even outside the view
frustum. Capturing fine scene details is hard, but more data can help as our analysis in §4.3 suggests. See project page for animations.
Acc Cmp F1 5.2. Zero-Shot Generalization to Taskonomy
DRDF [31] 54.4 1.0 2.0
DRDF (our arch) [31] 54.2 1.4 2.7 Finally, we deploy MCC, trained on Hypersim, on novel
MCC 66.3 1.5 2.8 scenes from Taskonomy [81]. While photorealistic, Hy-
Table 3. Comparison to DRDF on Hypersim. MCC outperforms persim is synthetic, while Taskonomy is real. So, we test
the state-of-the-art scene reconstruction approach, DRDF [31], ex- both generalization to novel scenes but also the “sim-to-
tended to input depth, with both its original and our architecture.
real” transfer. Fig. 9b shows MCC’s reconstructions, which
Dataset. We experiment on the Hypersim dataset [52], demonstrate that our model is able to reconstruct the room
which contains complex, diverse scenes, such as warehouse, layout (floors, walls, ceilings) in this challenging setting.
lofts, restaurants, church etc., with over 77k images. We
6. Failure Cases
split the dataset into 365 scenes for training and 46 scenes
for testing. We use images along with the associated depth While MCC has demonstrated promising results, we ob-
as ground truth for training. Since 3D meshes are available, serve three error modes: (1) Sensitivity to depth input.
we use them for evaluation and report the metrics from §4. MCC can recover from noisy depth inputs. But if depth
is largely incorrect, it will fail to reconstruct accurate 3D
5.1. Hypersim Scene Reconstruction geometry. (2) Distribution shifts. For targets far from the
training distribution, we see errors in texture and geome-
Qualitative Results. Fig. 9a shows qualitative results on try (e.g., Rubik’s cubes). (3) High-fidelity texture. Detailed
Hypersim [52]. While MCC only sees the scene within texture predictions from a single view are difficult and MCC
the view frustum, it is able to complete furniture, walls, often omits details (e.g., text on volleyball in Fig. 4).
floors, and ceilings. For instance, in the left example, MCC
predicts the space behind the kitchen, including the floors, 7. Conclusions
which are almost entirely occluded in the input view. In We present MCC, a general-purpose 3D reconstruction
the right example, MCC predicts the wall on the left which model that works for both objects and scenes. We show
is entirely outside of the view frustum. Scene reconstruc- generalization to challenging settings, including in-the-wild
tion from a single view is hard; while MCC reconstructs the captures and AI-generated images of imagined objects. Our
room geometry it fails to capture fine details in both shape results show that a simple point-based method coupled with
and texture. We expect more data to significantly improve category-agnostic large-scale training is effective. We hope
performance, as suggested by our scaling analysis in §4.3. this is a step towards building a general vision system for
Quantitative Evaluation. We compare to recent state-of- 3D understanding. Models and code are available online.
the-art on scene reconstruction, DRDF [31], which we ex- From an ethics standpoint, as with all data-driven meth-
tend to take RGB-D inputs like MCC. Table 3 shows that ods, MCC can potentially inherit the bias (if any) in data. In
MCC outperforms DRDF across all metrics. We also extend this project, we solely train on inanimate objects and scenes
DRDF to use MCC’s architecture but keeping its original to minimize the risk. We do not foresee immediate negative
loss and ray-based inference. This variant performs better repercussions with the model, but caution against future use
than the original DRDF but still worse than MCC. without paying careful attention to the training dataset.

8
A. Appendix Stage Operators Output sizes
Input I - 3×224×224
A.1. Animations and Interactive Visualizations Patch embed
Conv 16×16, 768
768×196
(stride 16×16)
We provide 360-view animations and interactive 3D vi-
MHA(768)

sualizations for all qualitative results, in Figures 4, 7 and Transformer layers ×12 768×196
MLP(3072)
9, and more in our project page. Our video animations are (a) Encoder E RGB
shown in the main window and interactive 3D visualizations
are available by clicking on the 3D icon, per the instructions Stage Operators Output sizes
in the webpage. Input P - 3×224×224
Embed Linear, 768 768×224×224
A.2. Architecture Specifications 
MHA(768)

Patch embed  MLP(1536) ×1 768×196
Table 4 describes in detail the MCC architecture for the
RGB [cls] readout
E and E XYZ encoders and the decoder.
(on each 16×16 patch)
The E RGB and E XYZ encoders follow the “ViT-Base”
MHA(768)

transformer architecture by Dosovitskiy et al. [11, 65]. The Transformer layers ×12 768×196
MLP(3072)
transformer architecture is composed of 12 layers of a (b) Encoder E XYZ
768-dimensional self-attention operator with 12 heads, re-
ferred to as multi-head attention (MHA), followed by a
Stage Operators Output sizes
3072-dimensional 2-layer MLP. The input image size is
- 768×196
224×224. The RGB image I, input to the E RGB encoder, Input encodings
768×196
is embedded via a single convolutional layer, of a 16 × 16- Concat Concat 1536×196
sized kernel and a 16 × 16 stride, to produce N enc = 196 Linear Linear, 768 768×196
tokens. The (seen) points P , input to the E XYZ encoder, (c) Fusion Module f
are first linearly projected to a 768-dimensional representa-
tion and then embedded via a single transformer layer which Stage Operators Output sizes
operates on 16 × 16 non-overlapping patches as explained Input queries - 3×550
in Section 3.4 of the main paper and further described in Embed Linear, 768 768×550
Table 4, resulting also in N enc = 196 tokens. The sin- Concat with R Concat 768×746

gle transformer layer used for the patch embeddings defines Transformer layers
MHA(512)
×8 768×746
a [cls] token whose output is the embedding for each MLP(2048)
patch, as is commonly used in [10, 11] and referred to as (d) Decoder Dec
a readout token.
Our decoder follows the decoder design from MAE [20]. Table 4. Architecture specification for each part of the MCC
It is composed of 8 layers of a 512-dimensional self- model. MHA(c): Multi-Head Attention [65] with c channels.
attention operator with 16 heads followed by a 2048- MLP(c0 ): MultiLayer Perceptron with c0 channels. [cls] read-
dimensional 2-layer MLP. The input to the decoder is: (a) out: feature readout with the [cls] token [10, 11]. Here, we use
N q = 550 3D point queries which are linearly projected to the default choice of N q = 550 queries. We omit the optional
a 768-dimensional vector, and (b) input R which concate- [cls] token in the outputs of the transformers for clarity.
nates the N enc output tokens from E RGB and E XYZ in the
channel dimension and then linearly projects each to a 768-
dimensional vector. This results in a 768 × (N q + N enc ) = plane}. They have 8,453 example videos in total. Please
768 × 746 input to the decoder. Our decoder additional de- see the original CO3D paper for more information about
fines a global [cls] token whose role is to “summarize” CO3D [51].
all inputs in the transformer and can be attended freely by
other tokens. A.4. Additional Implementation Details for Scene
LayerNorm [1] is used in all self-attention and MLP lay- Reconstruction Experiments
ers following standard practice [11, 20, 65]. Similar to the object reconstruction experiments, we
train MCC on Hypersim [52] with Adam [29] for 100k iter-
A.3. Held-Out CO3D Categories
ations with an effective batch size of 512 using 32 GPUs, a
In our experiments, we hold out 10 randomly selected base learning rate of 5×10−5 with a cosine schedule and
categories which we use as our test set. The 10 randomly a linear warm-up for the first 10% of iterations. Train-
selected held-out categories are: {apple, ball, baseball- ing takes ∼1.6 days. We normalize each scene to have
glove, book, bowl, carrot, cup, handbag, suitcase, toy- zero-mean and unit-variance. At inference time, we pre-

9
dict points up to 6.0 units (i.e., 6× standard deviation) [14] Michael Firman, Oisin Mac Aodha, Simon Julier, and
away from the camera origin. Since we aim at predicting Gabriel J Brostow. Structured prediction of unobserved vox-
the scene in front of the camera, we use the camera view els from a single depth image. In CVPR, 2016. 2
coordinate system (X-axis points top/down, Y -axis points [15] Andreas Geiger and Chaohui Wang. Joint 3D object and
left/right, and Z-axis points out from the image plane). We layout inference from a single RGB-D image. In German
randomly scale augment images by s ∈ [0.8, 1.2], as in Conference on Pattern Recognition, 2015. 2
the object reconstruction model, but do not perform rota- [16] Rohit Girdhar, David F Fouhey, Mikel Rodriguez, and Ab-
hinav Gupta. Learning a predictable and generative vector
tion augmentation. Other implementation details follow the
representation for objects. In ECCV, 2016. 2
CO3D experiments.
[17] Georgia Gkioxari, Jitendra Malik, and Justin Johnson. Mesh
R-CNN. In ICCV, 2019. 1, 2
References
[18] Shubham Goel, Angjoo Kanazawa, and Jitendra Malik.
[1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin- Shape and viewpoints without keypoints. In ECCV, 2020.
ton. Layer normalization. arXiv preprint arXiv:1607.06450, 2
2016. 9 [19] Richard Hartley and Andrew Zisserman. Multiple view ge-
[2] Hangbo Bao, Li Dong, and Furu Wei. BEiT: Bert pre- ometry in computer vision. Cambridge university press,
training of image transformers. In ICLR, 2022. 2 2003. 1, 2
[3] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- [20] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr
biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- Dollár, and Ross Girshick. Masked autoencoders are scalable
tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- vision learners. In CVPR, 2022. 1, 2, 3, 4, 9
guage models are few-shot learners. NeurIPS, 2020. 1, 2 [21] Philipp Henzler, Jeremy Reizenstein, Patrick Labatut, Ro-
[4] Anh-Quan Cao and Raoul de Charette. MonoScene: Monoc- man Shapovalov, Tobias Ritschel, Andrea Vedaldi, and
ular 3D semantic scene completion. In CVPR, 2022. 2 David Novotny. Unsupervised learning of 3D object cate-
[5] Jose A Castellanos, José MM Montiel, José Neira, and gories from videos in the wild. In CVPR, 2021. 2, 3, 5, 6,
Juan D Tardós. The SPmap: A probabilistic framework for 7
simultaneous localization and map building. IEEE Transac- [22] Zitian Huang, Yikuan Yu, Jiawen Xu, Feng Ni, and Xinyi Le.
tions on robotics and Automation, 1999. 2 PF-Net: Point fractal network for 3D point cloud completion.
[6] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, In CVPR, 2020. 2
Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, [23] Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac,
Manolis Savva, Shuran Song, Hao Su, et al. ShapeNet: Carl Doersch, Catalin Ionescu, David Ding, Skanda Kop-
An information-rich 3D model repository. arXiv preprint pula, Daniel Zoran, Andrew Brock, Evan Shelhamer, et al.
arXiv:1512.03012, 2015. 1, 2 Perceiver IO: A general architecture for structured inputs &
[7] Wenzheng Chen, Huan Ling, Jun Gao, Edward Smith, outputs. In ICLR, 2022. 5
Jaakko Lehtinen, Alec Jacobson, and Sanja Fidler. Learning [24] Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals,
to predict 3D objects with an interpolation-based differen- Andrew Zisserman, and Joao Carreira. Perceiver: General
tiable renderer. In NeurIPS, 2019. 2 perception with iterative attention. In ICML, 2021. 5
[8] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,
[25] James T Kajiya and Brian P Von Herzen. Ray tracing volume
and Li Fei-Fei. ImageNet: A large-scale hierarchical image
densities. SIGGRAPH, 1984. 2
database. In CVPR, 2009. 1, 2, 6
[26] Angjoo Kanazawa, Shubham Tulsiani, Alexei A Efros, and
[9] Kangle Deng, Andrew Liu, Jun-Yan Zhu, and Deva Ra-
Jitendra Malik. Learning category-specific mesh reconstruc-
manan. Depth-supervised NeRF: Fewer views and faster
tion from image collections. In ECCV, 2018. 1, 2
training for free. In CVPR, 2022. 2, 6, 7
[10] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina [27] Hiroharu Kato, Yoshitaka Ushiku, and Tatsuya Harada. Neu-
Toutanova. BERT: Pre-training of deep bidirectional trans- ral 3D mesh renderer. In CVPR, 2018. 2
formers for language understanding. In NAACL, 2019. 2, 4, [28] Young Min Kim, Niloy J Mitra, Dong-Ming Yan, and
9 Leonidas Guibas. Acquiring 3D indoor environments with
[11] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, variability and repetition. TOG, 2012. 2
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, [29] Diederik P Kingma and Jimmy Ba. Adam: A method for
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- stochastic optimization. In ICLR, 2015. 4, 9
vain Gelly, et al. An image is worth 16x16 words: Trans- [30] Nilesh Kulkarni, Abhinav Gupta, David Fouhey, and Shub-
formers for image recognition at scale. In ICLR, 2021. 1, 3, ham Tulsiani. Articulation-aware canonical surface map-
4, 9 ping. In CVPR, 2020. 1, 2
[12] Carlos Hernández Esteban and Francis Schmitt. Silhouette [31] Nilesh Kulkarni, Justin Johnson, and David F. Fouhey.
and stereo fusion for 3D object modeling. Computer Vision What’s behind the couch? directed ray distance functions
and Image Understanding, 2004. 2 for 3D scene reconstruction. In ECCV, 2022. 2, 4, 8
[13] Haoqiang Fan, Hao Su, and Leonidas J Guibas. A point set [32] Abhijit Kundu, Yin Li, and James M. Rehg. 3D-RCNN:
generation network for 3D object reconstruction from a sin- Instance-level 3D object reconstruction via render-and-
gle image. In CVPR, 2017. 2 compare. In CVPR, 2018. 1

10
[33] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, [50] Nikhila Ravi, Jeremy Reizenstein, David Novotny, Tay-
Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence lor Gordon, Wan-Yen Lo, Justin Johnson, and Georgia
Zitnick. Microsoft coco: Common objects in context. In Gkioxari. Accelerating 3D deep learning with PyTorch3D.
ECCV, 2014. 1 arXiv:2007.08501, 2020. 2
[34] Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and [51] Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler,
Christian Theobalt. Neural sparse voxel fields. NeurIPS, Luca Sbordone, Patrick Labatut, and David Novotny. Com-
2020. 2 mon objects in 3D: Large-scale learning and evaluation of
[35] Shichen Liu, Weikai Chen, Tianye Li, and Hao Li. Soft real-life 3D category reconstruction. In ICCV, 2021. 2, 4, 6,
rasterizer: Differentiable rendering for unsupervised single- 7, 9
view mesh reconstruction. In ICCV, 2019. 2 [52] Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit
[36] Donald Meagher. Geometric modeling using octree encod- Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb,
ing. Computer graphics and image processing, 1982. 4 and Joshua M Susskind. Hypersim: A photorealistic syn-
[37] Lars Mescheder, Michael Oechsle, Michael Niemeyer, Se- thetic dataset for holistic indoor scene understanding. In
bastian Nowozin, and Andreas Geiger. Occupancy networks: ICCV, 2021. 2, 8, 9
Learning 3D reconstruction in function space. In CVPR, [53] Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Mor-
2019. 2, 3, 5 ishima, Angjoo Kanazawa, and Hao Li. PIFu: Pixel-aligned
[38] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, implicit function for high-resolution clothed human digitiza-
Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: tion. In ICCV, 2019. 2, 3
Representing scenes as neural radiance fields for view syn- [54] Daniel Scharstein and Richard Szeliski. A taxonomy and
thesis. In ECCV, 2020. 1, 2, 6 evaluation of dense two-frame stereo correspondence algo-
[39] Aron Monszpart, Nicolas Mellado, Gabriel J Brostow, and rithms. IJCV, 2002. 2
Niloy J Mitra. RAPter: rebuilding man-made scenes with [55] Johannes Lutz Schönberger and Jan-Michael Frahm.
regular arrangements of planes. TOG, 2015. 2 Structure-from-motion revisited. In CVPR, 2016. 2, 4
[40] Liangliang Nan, Ke Xie, and Andrei Sharf. A search-classify [56] Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys,
approach for cluttered indoor scene understanding. TOG, and Jan-Michael Frahm. Pixelwise view selection for un-
2012. 2 structured multi-view stereo. In ECCV, 2016. 2, 4
[41] Ramakant Nevatia and Thomas O Binford. Description and [57] Vincent Sitzmann, Justus Thies, Felix Heide, Matthias
recognition of curved objects. Artificial intelligence, 1977. 1 Nießner, Gordon Wetzstein, and Michael Zollhofer. Deep-
[42] Michael Niemeyer, Lars Mescheder, Michael Oechsle, and Voxels: Learning persistent 3D feature embeddings. In
Andreas Geiger. Differentiable volumetric rendering: Learn- CVPR, 2019. 2
ing implicit 3D representations without 3D supervision. In [58] Randall Smith, Matthew Self, and Peter Cheeseman. Es-
CVPR, 2020. 3, 5 timating uncertain spatial relationships in robotics. In Au-
[43] David Novotny, Nikhila Ravi, Benjamin Graham, Natalia tonomous robot vehicles, 1990. 2
Neverova, and Andrea Vedaldi. C3dpo: Canonical 3d pose [59] Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Mano-
networks for non-rigid structure from motion. In ICCV, lis Savva, and Thomas Funkhouser. Semantic scene comple-
2019. 1, 2 tion from a single depth image. In CVPR, 2017. 2
[44] Jeong Joon Park, Peter Florence, Julian Straub, Richard [60] Xingyuan Sun, Jiajun Wu, Xiuming Zhang, Zhoutong
Newcombe, and Steven Lovegrove. DeepSDF: Learning Zhang, Chengkai Zhang, Tianfan Xue, Joshua B Tenenbaum,
continuous signed distance functions for shape representa- and William T Freeman. Pix3d: Dataset and methods for
tion. In CVPR, 2019. 2, 3 single-image 3d shape modeling. In CVPR, 2018. 1, 2
[45] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. [61] Richard Szeliski. Computer vision: algorithms and applica-
PointNet: Deep learning on point sets for 3D classification tions. Springer Nature, 2022. 1, 2
and segmentation. In CVPR, 2017. 5 [62] Carlo Tomasi and Takeo Kanade. Shape and motion from
[46] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya image streams under orthography: a factorization method.
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, IJCV, 1992. 2
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- [63] Lorenzo Torresani, Aaron Hertzmann, and Chris Bregler.
ing transferable visual models from natural language super- Nonrigid structure-from-motion: Estimating shape and mo-
vision. In ICML, 2021. 1, 2 tion with hierarchical priors. PAMI, 2008. 2
[47] Amit Raj, Michael Zollhofer, Tomas Simon, Jason Saragih, [64] Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt,
Shunsuke Saito, James Hays, and Stephen Lombardi. Pixel- Oriol Vinyals, Alex Graves, et al. Conditional image gen-
aligned volumetric avatars. In CVPR, 2021. 2, 3, 5, 6 eration with PixelCNN decoders. NeurIPS, 2016. 3
[48] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, [65] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
and Mark Chen. Hierarchical text-conditional image gen- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
eration with clip latents. arXiv preprint arXiv:2204.06125, Polosukhin. Attention is all you need. In NeurIPS, 2017. 1,
2022. 1, 2, 6 3, 4, 9
[49] René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- [66] Bram Wallace and Bharath Hariharan. Few-shot generaliza-
sion transformers for dense prediction. In ICCV, 2021. 6, tion for single-image 3D reconstruction via priors. In ICCV,
7 2019. 2

11
[67] Dan Wang, Xinrui Cui, Xun Chen, Zhengxia Zou, Tianyang [84] Linqi Zhou, Yilun Du, and Jiajun Wu. 3D shape genera-
Shi, Septimiu Salcudean, Z Jane Wang, and Rabab Ward. tion and completion through point-voxel diffusion. In ICCV,
Multi-view 3d reconstruction with transformers. In ICCV, 2021. 2
2021. 2
[68] Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, Wei
Liu, and Yu-Gang Jiang. Pixel2Mesh: Generating 3D mesh
models from single RGB images. In ECCV, 2018. 1, 2
[69] Chao Wen, Yinda Zhang, Zhuwen Li, and Yanwei Fu.
Pixel2Mesh++: Multi-view 3D mesh generation via defor-
mation. In ICCV, 2019. 2
[70] Charles Wheatstone. Contributions to the physiology of vi-
sion.—part the first. on some remarkable, and hitherto unob-
served, phenomena of binocular vision. Philosophical trans-
actions of the Royal Society of London, 1838. 2
[71] Markus Worchel, Rodrigo Diaz, Weiwen Hu, Oliver Schreer,
Ingo Feldmann, and Peter Eisert. Multi-view mesh recon-
struction with neural deferred shading. In CVPR, 2022. 2
[72] Jiajun Wu, Yifan Wang, Tianfan Xue, Xingyuan Sun, Bill
Freeman, and Josh Tenenbaum. MarrNet: 3D shape recon-
struction via 2.5D sketches. NeurIPS, 2017. 2
[73] Shangzhe Wu, Christian Rupprecht, and Andrea Vedaldi.
Unsupervised learning of probably symmetric deformable
3D objects from images in the wild. In CVPR, 2020. 1
[74] Haozhe Xie, Hongxun Yao, Xiaoshuai Sun, Shangchen
Zhou, and Shengping Zhang. Pix2vox: Context-aware 3D
reconstruction from single and multi-view images. In ICCV,
2019. 2
[75] Qiangeng Xu, Weiyue Wang, Duygu Ceylan, Radomir
Mech, and Ulrich Neumann. DISN: Deep implicit sur-
face network for high-quality single-view 3D reconstruction.
NeurIPS, 2019. 2
[76] Xingguang Yan, Liqiang Lin, Niloy J Mitra, Dani Lischin-
ski, Daniel Cohen-Or, and Hui Huang. ShapeFormer:
Transformer-based shape completion via sparse representa-
tion. In CVPR, 2022. 2
[77] Xinchen Yan, Jimei Yang, Ersin Yumer, Yijie Guo, and
Honglak Lee. Perspective transformer nets: Learning
single-view 3D object reconstruction without 3D supervi-
sion. NeurIPS, 2016. 2
[78] Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa.
PixelNeRF: Neural radiance fields from one or few images.
In CVPR, 2021. 2, 3, 5, 6
[79] Xumin Yu, Yongming Rao, Ziyi Wang, Zuyan Liu, Jiwen Lu,
and Jie Zhou. PoinTr: Diverse point cloud completion with
geometry-aware transformers. In ICCV, 2021. 2, 5, 6
[80] Wentao Yuan, Tejas Khot, David Held, Christoph Mertz, and
Martial Hebert. PCN: Point completion network. In 3DV,
2018. 2
[81] Amir R Zamir, Alexander Sax, William Shen, Leonidas J
Guibas, Jitendra Malik, and Silvio Savarese. Taskonomy:
Disentangling task transfer learning. In CVPR, 2018. 2, 8
[82] Kai Zhang, Gernot Riegler, Noah Snavely, and Vladlen
Koltun. NeRF++: Analyzing and improving neural radiance
fields. arXiv preprint arXiv:2010.07492, 2020. 2
[83] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela
Barriuso, and Antonio Torralba. Scene parsing through
ade20k dataset. In CVPR, 2017. 1