0% found this document useful (0 votes)
17 views12 pages

Progressive Learning of 3D Reconstruction Network From 2D GAN Data

This paper presents a method to reconstruct high-quality textured 3D models from single images. Current methods rely on datasets with expensive annotations; multi-view images and their camera parameters. Our method relies on GAN generated multi-view image datasets which have a negligible annotation cost. However, they are not strictly multi-view consistent and sometimes GANs output distorted images.

Uploaded by

cocbottest01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views12 pages

Progressive Learning of 3D Reconstruction Network From 2D GAN Data

This paper presents a method to reconstruct high-quality textured 3D models from single images. Current methods rely on datasets with expensive annotations; multi-view images and their camera parameters. Our method relies on GAN generated multi-view image datasets which have a negligible annotation cost. However, they are not strictly multi-view consistent and sometimes GANs output distorted images.

Uploaded by

cocbottest01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

1

Progressive Learning of 3D Reconstruction


Network from 2D GAN Data
Aysegul Dundar, Jun Gao, Andrew Tao, Bryan Catanzaro

Abstract—This paper presents a method to reconstruct high-quality textured 3D models from single images. Current methods rely on
datasets with expensive annotations; multi-view images and their camera parameters. Our method relies on GAN generated multi-view
image datasets which have a negligible annotation cost. However, they are not strictly multi-view consistent and sometimes GANs
output distorted images. This results in degraded reconstruction qualities. In this work, to overcome these limitations of generated
datasets, we have two main contributions which lead us to achieve state-of-the-art results on challenging objects: 1) A robust
arXiv:2305.11102v1 [cs.CV] 18 May 2023

multi-stage learning scheme that gradually relies more on the models own predictions when calculating losses, 2) A novel adversarial
learning pipeline with online pseudo-ground truth generations to achieve fine details. Our work provides a bridge from 2D supervisions
of GAN models to 3D reconstruction models and removes the expensive annotation efforts. We show significant improvements over
previous methods whether they were trained on GAN generated multi-view images or on real images with expensive annotations.
Please visit our web-page for 3D visuals: https://research.nvidia.com/labs/adlr/progressive-3d-learning.

Index Terms—3D Texture Learning, 3D Reconstruction, Single-image Inference, Generative Adversarial Networks.

1 I NTRODUCTION Single view image collections are also explored to learn


GAN based models achieve realistic image synthesis on 3D reconstruction models [2], [17]. However, with single
various objects [19], [20], [55] and find applications in image view images during trainings, models receive limited su-
editing, conditional image generation [9], [27], [39], and pervision; only to the visible parts. Various constraints and
video generation [49] tasks. They are also found to be useful regularizers are proposed to obtain plausible results such
for dataset generations with automatic part segmentation as losses which limit the deformation from mean template
annotations [44], [59]. There is further interest to deploy [6], [17], rotation and swap consistency losses [2], [36] and
this technology for gaming, robotics, architectural designs, semantic consistency [24] constraints. Still results are not
and AR/VR applications. However, such applications also realistic. Another way to use single view image collec-
require contollability on the viewpoint requiring generation tions to learn 3D reconstruction models is to train a GAN
in 3D representations. On the other hand, the realism of 3D model from them and generate multi-view datasets with the
image generation and reconstruction results are not on par trained GAN models [26], [58]. This approach becomes pos-
with the GAN generated 2D images [2], [3], [10], [36], [53]. sible because recent generative models of images, especially
In this work, we are interested in closing this gap. StyleGANs, are shown to learn an implicit 3D representation
To reconstruct high quality 3D models, current state- with latent codes that can be manipulated to change the
of-the-art (SOTA) methods rely on 3D annotations. Such viewpoint of a scene [19]. The latent codes of the StyleGAN
data is expensive to collect, requires special hardware and are controlled to generate consistent objects from different
is usually collected in constrained lab environments. Due view points. Few selected viewpoints are labeled for camera
to the difficulty in collecting such annotations, efforts have parameters which only takes a minute to annotate and
been limited to few objects such as faces [11] and human unlimited number of samples can be generated [58] for those
bodies [23], [56]. A cheaper alternative is finely curated viewpoints. However, one issue of these datasets is that they
multi-view datasets. They can be collected with a camera are not perfect especially in presenting the realistic details
without expensive hardware requirements. However, they across views. This is because StyleGAN does not have strict
are still difficult to annotate for their camera parameters. disentanglement of shape, texture, and camera parameters.
For that reason, synthetic images are used instead of real Therefore, one cannot change the camera parameters while
images to train 3D reconstruction models [5], [6], [7], [53]. preserving the identity strictly. Another issue is the distorted
While these models learn to reconstruct synthetic objects, image generations, sometimes appear as missing parts in
they fall short in their ability to recover the 3D properties of objects (cf. Fig. 2).
real images due to the domain gap between synthetic and In this work, our goal is to learn accurate 3D recon-
real images. struction models from GAN generated multi-view images.
As our first contribution, we propose a framework that is
robust to the noise in the training data. We achieve this
• A. Dundar, J. Gao, A. Tao, B. Catanzaro are with NVIDIA, CA, USA.
• A. Dundar is with Department of Computer Science, Bilkent University, with a multi-stage learning scheme that gradually relies
Ankara, Turkey more on the models own predictions when calculating
• J. Gao is with Department of Computer Science, University of Toronto, losses. Secondly, we propose a novel adversarial learning
Canada.
pipeline with online pseudo-ground truth generation to
2

Fig. 1: Given a single 2D image input, our method outputs high quality textured 3D models. We achieve these results
by learning from StyleGAN generated datasets via a robust multi-stage training scheme and a novel adversarial learning
pipeline.

train a discriminator. With the discriminator, our model the renderers and trained to predict 3D mesh representa-
learns to output fine details. Our model shows significant tions and texture maps of input images via reconstruction
improvements over previous methods whether they were losses [6], [12], [15], [17], [24]. However, inferring these
trained on GAN generated multi-view images or on real im- 3D attributes from single 2D images is inherently ill-posed
ages with expensive data collections/annotations pipelines. problem given that the invisible mesh and texture predic-
In summary, our main contributions are: tions receive no gradients during training [6], [12], [15], [17],
• A robust multi-stage learning scheme that relies more on [24]. These algorithms that learn from single-view images
the model’s predictions at each step. Our model is not af- output results that look unrealistic especially when viewed
fected by missing parts in the images and inconsistencies from a different point.
across views. Multi-view image datasets provide a solution for the
• A novel adversarial learning pipeline to increase the re- limited supervision problem of single-view image datasets.
alism of textured 3D predictions. We generate pseudo- However, due to the expensive annotation of multi-view im-
ground truth during training and employ a multi-view age datasets for their 2D keypoints or camera pose, they are
conditional discriminator for learning to generate fine small in scale. There have been methods that use sequence of
details. images to optimize a mesh and texture model [8]. However,
• High-fidelity textured 3D model synthesis both quali- they learn a new network for each sequence. Recently, these
tatively and quantitatively on three challenging objects. sequence of image datasets have also been tremendously ex-
Examples are shown in Fig. 1. plored with a method called Neural Radiance Fields (Nerf)
to explore implicit geometry [31], [34], [50]. These models
overfit to a sequence and can not be used for single image
2 R ELATED W ORK inference. PixelNerf [53] is an extension of these models that
Style-based GAN models [19], [20] achieve high quality achieves single image inference, however, as we show in our
synthesis of various objects which are quite indistinguish- experiments, the results are not good.
able from real image and are shown to learn an implicit Another promising direction with Nerf-based models is
3D knowledge of objects without a supervision. One can optimization of 3D representations with well-trained diffu-
control the viewpoint of the synthesized object by its latent sion models. These models can stylize meshes or generate
codes. This makes pretrained GANs a promising technology 3D geometry representations from scratch with given text
for controllable generation [1], [41], [48]. However, in these prompts [25], [32], [33], [42]. However, these models re-
models, the disentanglement of 3D shape and appearance quire run time-optimization and control on the generation
is not strict and therefore the appearance of objects change is limited. In our work, we are interested in a different
as the viewpoint is manipulated. Recently, 3D-aware gener- application, single view image reconstruction where the
ative models are proposed with impressive results but they generation is conditioned on an input image.
either do not guarantee strict 3D consistency [4], [13], [37], In our work, we are interested in mesh representations
[38] or computationally expensive [3] and overall not on due its efficiency in rendering. To infer mesh representa-
par with 2D StyleGAN results [10]. We are interested in tions, multi-view datasets are also shown to be beneficial
single-view image inference so our work is more related based on the experiments with synthetic datasets [5], [45],
to image inversion methods that project images into these [46]. However, those results do not translate well to real
3D-aware GAN’s latent space [22], [29], [52]. Even though image inferences because of the domain gap between syn-
significant progress is achieved for image inversion [22], thetic and real images. To generate a realistic multi-view
[29], [52], these methods require run-time optimization and dataset with cheap labor cost, Zhang et. al. [58] use a gen-
suffer from lower quality novel view predictions. erative adversarial network by controlling the latent codes
There have been many works that learn textured 3D and generate coarsely consistent objects from different view
mesh models from images with differentiable renderers [6], points. We also use these datasets but achieve significantly
[21], [28], [30], [43]. Deep neural networks are coupled with better results than the state-of-the-art and Zhang et. al. [58]
3

Other View Training

STAGE I
Mesh
Prediction

Rendered GT
stylegan

Same View Training

STAGE II
Generator Texture
Prediction

Rendered GT
Fast and cheap
multi-view data generation GAN Training
With lots of noise

STAGE III
Ø Missing parts Input
Ø Inconsistencies across view

(a) Fast Dataset Generation (b) Robust Multi-Stage Training

Fig. 2: Overview of the dataset generation (a) and multi-stage training scheme of the reconstruction network (b). The
generator network takes input image and outputs mesh and texture predictions. In the first stage, the output is rendered
from another view than the input image and losses are calculated on this novel view. This way model is not effected by the
missing parts in the images and also the unrealistic segmentation maps resulted from these images. In the second stage,
additional reconstruction loss is added from the same view. The rendered and ground-truth images are masked based on
the silhouette predictions of the model. Lastly, to achieve sharp and realistic predictions, we add adversarial training on
the third stage. GAN training pipeline is given in Fig. 3.

thanks to the robust learning scheme (which also allows us 3.2 Multi-stage Training Pipeline
to remove regularizers that limit the deformations) and the
We train the generator that outputs mesh and texture pre-
adversarial learning pipeline.
dictions with multi-stage training pipeline to be robust to
the errors in annotations and multi-view inconsistencies. At
3 M ETHOD each stage of our pipeline, results improve progressively.
In Section 3.1, we describe the motivation of our approach. First Stage. In the first stage, our model outputs 3D mesh
The multi-stage training scheme and adversarial learning and texture predictions from an input image that is from
pipeline are presented in Section 3.2 and 3.3, respectively. g
camera view-1, Iv1 . We render the image from these 3D
predictions with the target view, view-2, and output the im-
3.1 Motivation r
age Iv2 . We calculate the losses on these novel predictions.
Differentiable rendering enable training neural networks to The target view is randomly selected among the sequence.
perform 3D inference such as predicting 3D mesh geometry The motivation of the first stage is to capture reliable 3D
and textures from images [6]. However, they require multi- mesh predictions and reasonable texture estimations. At this
view images, camera parameters, and object silhouettes to stage, we do not expect a high quality texture estimation
achieve high performance models. Such data is expensive to given the inconsistencies across views. Our experiments
obtain. StyleGAN generated datasets remove the expensive show that when the network is guided with the losses from
r g
labeling effort via the latent codes that control the camera the same view as input (Iv1 vs. Iv1 ), the errors in missing
viewpoints. When few viewpoints are selected and anno- parts and so the errors in segmentation mask annotations
tated, multi-view images can be generated in infinite num- propagate to the mesh predictions. This instabilizes the
bers for those viewpoints. The annotations require 1 minute training even when the model is trained with multi-view
r r
[58] no matter how many images are generated because consistency, e.g. objectives calculated from both Iv1 and Iv2 .
they are all aligned across different examples. As for the Therefore, in the first stage, we learn to reconstruct an object
segmentation masks the renderers utilize during training, from the image of the object from a different view. This
they can be obtained by off-the-shelf instance segmentation way model does not overfit to the errors of the given view
models [14]. However, learning a high performing model since it receives feedback from a novel view. Note that novel
from these datasets remains a challenge since generated view may and does also have errors but since a novel view
images suffer from precise multi-view consistency as shown is randomly sampled from a sequence and errors are not
in Fig. 2 by red rectangles. Additionally, some examples consistent among the views, the model outputs the most
have missing parts as shown in Fig. 2 by blue rectangles, plausible 3D model to minimize losses in a sense similar to
the head of the horse is not generated in good quality which majority voting.
also transfers to instance segmentation mask (fourth-row). The training losses at this stage is calculated as fol-
In this work, we address these challenges by proposing lows. We use perceptual image reconstruction loss between
g
a robust multi-stage training pipeline and an adversarial the ground-truth image of the target view (Iv2 ) and the
r
learning set-up. rendered image view (Iv2 ). We mask the images with
4

View 1 Projection

stride 2 + down

stride 2 + down
Camera-1

stride 2

stride 2

stride 2

stride 2
Input

128

256

256

256
64

512
Inner
Product

Fake Pair
Generator x

stride 2

stride 2

stride 2

stride 2

stride 1
128

256

512
+

64

1
Camera-1
Input

Real Pair
Camera-2
Input
View 2 Projection

Fig. 3: Gan training pipeline. Discriminator is trained with fake and real pairs and a conditioning pair. First, texture and
mesh predictions are outputted by the generator. With the estimated mesh and given camera parameters of a second view
image, texture and visibility maps are projected from the second view image. For fake pair, estimated texture is partially
erased by the visibility map. Additionally, conditioning input pair is obtained by projecting from the first view image with
the estimated mesh predictions and given camera parameters of the first view.

g
ground-truth silhouette (mask) predictions, Sv2 . This way The rendered and ground-truth images are masked
reconstruction loss is only calculated on the object. As a based on the silhouette predictions of the model and per-
reconstruction loss, we use perceptual losses from Alexnet ceptual loss is calculated as given in Eq. 4. This way, we
(Φ) at different feature layers (j ) between these images from rely on the first stage training for the mesh prediction and
the loss objective as given in Eq. 1. learn improved textures with the same view training for
the visible parts. Mesh predictions can still improve via the
g g r g
Lp−nv = ||Φj (Iv2 ∗ Sv2 ) − Φj (Iv2 ∗ Sv2 )||2 (1) reconstruction losses since they still receive a feedback via
For shapes, we use an IoU loss between the silhouette image reconstruction, but we do not guide it directly with
r
rendered (Sv2
g
) and the silhouette (Sv2 ) of the input object. the silhouettes.
g r
||Sv2 Sv2 ||1 g
Lp−sv = ||Φj (Iv1 r
∗ Sv1 r
) − Φj (Iv1 r
∗ Sv1 )|| (4)
Lsil = 1 − g r g r
(2)
||Sv2 + Sv2 − Sv2 Sv2 ||
We use the following loss in additional iterations:
Similar to [6], [28], we also regularize predicted mesh
using a laplacian loss (Llap ) constraining neighboring mesh
triangles to have similar normals. Following are our base
Lsecond = Lf irst + λps Lp−sv (5)
losses: The second stage training starts after first stage training
Lf irst = λpn Lp−nv + λs Lsil + λlap Llap (3) converges. That is because we rely on the models mesh
predictions in the newly introduces losses. Results improve
The model outputs reliable 3D mesh predictions since significantly but the results are not sharp as the training
it gets feedback from different views. Note that, we do not data.
use many regularizes such as mean template and penalizing Third Stage. After learning reliable mesh representation
deformation vertices as previous works [2], [6], [58] and still and high quality texture predictions, we use generative
achieve a stable training with the first stage objectives. learning pipeline to improve realism of our predictions. In
Second Stage. In the second stage, we rely on our 3D this stage, we rely on our predictions to generate pseudo
mesh predictions and introduce additional losses between ground-truths to enable adversarial learning which is ex-
r g
Iv1 vs. Iv1 , same view as the input image. In 3D inference plained in the next section.
predictions, we expect the model to output predictions that
faithfully match with object for the input view. Therefore,
in the second stage, we add additional reconstruction losses 3.3 Adversarial Learning
from the input view. However, we do not add silhouette Training the model with an adversarial loss applied on the
loss for the input view because there are noise in the seg- rendered images do not improve the results due to the
mentation masks. Furthermore, for the reconstruction loss, shortcomings of renderers [40]. Therefore, we convert tex-
we do not mask the input image and rendered output with ture learning into 2D image synthesis task. Texture learning
the ground-truth mask since it is noisy. We rely on the 3D in UV space is previously explored with successful results
prediction of our model and mask the reconstruction loss [8], [11], [40]. However, our set-up is different as we learn
based on the projected mesh prediction. a single view image inference, the texture projection and
5

pseudo ground-truth generations are online in our training, class, and two representing articulated class. For car and
we do not learn a GAN trained from scratch for texture horse dataset, official models from StyleGAN2 ( [19]) repos-
generation rather we tune our 3D reconstruction network, itory are used. These models are trained on LSUN Car
and we propose a multi-view training in our discriminator. dataset with 5.7M images and LSUN horse dataset with 2M
While previous methods use different networks for texture images [54]. We also use a model trained on a bird class
projection and texture generation, we achieve both with the on NABirds dataset [47] with 48k images. The StyleGAN
same architecture. This enables us to improve the generator generated images are aligned for few view-points and those
further and achieve state-of-the-art results. views are annotated for one example which takes 1 minutes
As shown in Fig. 3, during our training, we obtain in total. Please refer to Zhang et. al. [58] for more details in
projected texture maps for the input view (v1) and a dif- the dataset generation pipeline.
ferent view (v2). We obtain those by first predicting 3D Architectural Details of Generator. Our generator has
meshes from an input view (v1) via our generator. The input an encoder-decoder architecture as shown in Fig. 4. In the
images are projected onto the UV map of the predicted mesh encoder, for predicting deformation and texture maps, the
template based on the camera parameters of each image encoder receives 512 × 512 image. It has 7 convolutional
via an inverse rendering. In this process, mesh predictions blocks with each convolution layer with 3 × 3 filters and
are transformed onto 2D screen by projection with camera stride of 2. The number of channels of the convolution layers
parameters. Then transformed mesh coordinates and UV are (64, 128, 256, 256, 256, 128, 128). There is a ReLU non-
map coordinates are used in reverse way and real images are linearity after each convolution layer. The encoder decreases
projected onto UV map with the renderer. Visibility masks the spatial size to 4 × 4.The encoded features (128 × 4 × 4)
are also obtained in this set-up. With this set-up, we obtain go through a fully connected layer and after reshaping, we
a real partial texture (from v2) and a conditioning texture output features with dimension of 512 × 8 × 4. Note that,
map (from v1) to train our discriminator. In this set-up, it we start with width of 8 and height of 4. While generating
is important for the mesh predictions to be accurate for the texture and mesh predictions, we only predict half of the
correct inverse rendering. That is way we leave the GAN maps in the height since they are expected to be mostly
training to the third stage. symmetric in y axis. Later, we flip the predictions and
Discriminator. Finally, to provide adversarial feedback, concatenate with itself to expand the height.
we train a conditional discriminator. The discriminator is The texture and mesh predictions go through two shared
conditioned on the partial texture view 1. Partial texture blocks of convolutional layers at first. In the decoder, each
view 2 is a real example and the generated texture is a block has two convolutional layers. There is an adaptive
fake one. For the fake example, we mask the generated normalization and leaky ReLU after each convolutional
texture with the visibility mask from real example to prevent layer. There is also a skip connection from input to the
distribution mismatch. output at each block. After each block, there is a bilinear
In traditional image-to-image translation algorithms interpolation layer to upsample the feature maps at each
conditional input and target images are concatenated and layer. The first two blocks has channels of (512, 256) which
fed to the discriminators. However, in our case, the im- bring the feature maps to a spatial dimension of 32 × 16.
ages are partially missing and aligning them in input via After that mesh prediction branches out and there is an-
concatenation do not provide useful signals. Instead, we other block with channel size of 64 for the mesh prediction
use a projection based discriminator [35] where we process branch. After that there is a single convolutional layer with
the conditioned input via convolutional layers and global channel of 3. The output represents the deformations (x,y,z)
pooling until the spatial dimension decreases to 1×1. Again coordinates.
the reason to decrease the dimension is because the input is For texture prediction, there are 4 more convolution
partially missing, therefore we want a full receptive field of blocks with channels size of (256, 256, 128, 128). After these
the input image while conditioning on the patches. layers, the spatial resolution becomes 512 × 256. A reflection
We dot product the conditional input that is embedded symmetry is applied as we flip the texture predictions in y
and the discriminator outputs. This score is added to the axis and concatenate it with the original texture predictions.
final discriminator score. With the multi-view conditioning, This results in spatial resolution of 512 × 512. After that
the discriminator does not only consider if the patch is there is one more convolution block with channel size of 64
realistic but also if the predicted texture is consistent with its and a final convolution layer which decreases the number of
input pair. The GAN training is especially important in our channels to 3. They represent (R,G,B) channels of the texture
set-up since we do not have consistent multi-view images. prediction. The convolutional layers after symmetry relaxes
The overall objective for the third stage includes following the symmetry constrains as we do not expect a perfect
min-max optimization: symmetry in the texture.
Note that the mesh predictions are also estimated in
min max Lsecond + λadv Ladv (θg , θd ) (6)
θg θd a convolutional way as a UV deformation map [2], [8],
[40]. Our deformation is a representation on the function
where θd and θg refer to parameters of the discriminator and
of sphere directly with a fixed surface topology. We sample
the generator, respectively.
from the deformation map for the corresponding vertex
locations. We also apply symmetry on the predicted UV de-
4 E XPERIMENTS formation. We use DIB-R [6] as our differentiable renderer.
Datasets. First to generate datasets, we use three category- The renderer takes mesh and texture predictions and output
specific StyleGAN models, one representing a rigid object images for target camera parameters.
6

stride 2 + down

stride 2 + down

stride 2

stride 2

stride 2

stride 2
128

256

256

256
64

512
Embedding
64 stride2
Inner
Product
128 stride2

Fake Pair
256 stride2

256 stride2

stride 2

stride 2

stride 2

stride 2

stride 1
z

128

256

512
+

64

1
256 stride2
Sampling

Real Pair
128 stride2

128 stride2
Discriminator
Encoder

Fig. 5: Discriminator architecture. To provide adversarial


FC feedback, we train a projection based conditional discrim-
inator.
Concat 512
FC Up 2
256
Up 2 network. The embedding network process the conditioned
256 input via convolutional layers and global pooling until the
Up 2
256 spatial dimension decreases to 1 × 1. We dot product the
Up 2 64 conditional input that is embedded to 1 × 1 resolution
128 and the discriminator outputs. This score is added to the
Up 2 3
128 final discriminator score. This happens for both scales of
Up 2 discriminators.
64 Training parameters We train our model on 8 GPUs with
batch size of 4 per GPU, for 100 epochs in total with learning
rate of 1−4 . In our loss functions, we use λlap = 0.5, λp−sv =
1, λp−nv = 1, λadv = 1. The discriminators learning rate is
2D set to 10−4 . We use Adam optimizer for updating both the
Deformation 3d inference model and discriminator.
Texture Map
Decoder Prediction Evaluation. We report various metrics on validation
datasets. Since we have multi-view data, we report the
Fig. 4: Encoder-decoder based generator architecturet that scores for both same view and novel view. Same view results
takes the input image and sampled z to output texture are obtained by rendering the predictions from the same
and 2D deformation map predictions. Each block represent view given as the input whereas for novel view, we render
a convolution layer with channel size. The encoder and the predictions from a different camera view than the input.
decoder are connected with fully connected layers. Same view looks at the fidelity to the given input whereas
novel view measures if the model estimates the invisible
texture and geometry which is a more difficult task. For both
We additionally sample a latent vector from normal views, we report Frechet Inception Distance (FID) metric
distribution to provide a diversity in our predictions. The [16] which looks at the realism by comparing the target dis-
sampled latent vector is concatenated with the encoded tribution and rendered images, Learned Perceptual Image
features via a linear layer. The output is fed to the adaptive Patch Similarity (LPIPS) [57] which compares the target and
normalization layers in convolutional blocks. rendered output pairs at the feature level of a pretrained
Architectural Details of Discriminator. As for the ar- deep network, Structural Similarity Index Measure (SSIM)
chitecture of the discriminator, we use a projection based and Mean Squared Error (MSE) which compare the pairs in
discriminator [35]. The discriminator takes the generated pixel-level similarity. We also report the intersection-over-
and real 3 × 512 × 512 texture maps, and the pseudo union (IoU) between the target silhouette and projected
ground-truth visibility mask as shown in Fig. 5. Generated silhouette of the predicted geometry.
textures are also multiplied with the masks to prevent Results. We provide quantitative results in Table 1. We
a mismatch between the real-fake data distributions. We provide comparisons with GanVerse [58] which is trained
concatenate the input with learnable positional embeddings on the same data as ours. We observe quantitative im-
on both scales which is omitted from the figure [8]. The provements at each stage for all three classes as given in
discriminator adopts a multi-scale architecture with two Table 1 as well as large improvements over GanVerse model
scales one operates on 32 × 32 patches, the other 16 × 16. We especially on FID metrics which measure the quality of
also have a conditioning pathway shown as an embedding generations. We qualitatively compare with other methods,
7

Input Ours Unicorn Pixelnerf Dib-r GanVerse


ECCV 22 CVPR 21 NeurIPS 20 ICLR 21

Fig. 6: Given input images (1st column), we predict 3D shape, texture, and render them into the same viewpoint and
novel view points for our model. We also show renderings of state-of-the-art models that are trained on real and synthetic
images. Since, models use different camera parameters, we did not align the results. However, the results shown from
similar viewpoints capture the behaviour of each model. Unicorn outputs similar 3D shape for different inputs. Pixelnerf
achieves a high quality same view results but the results are poor from novel views. Dib-r also suffers from the same issue.
Ganverse does not output realistic details. Our model achieves significantly better results than the previous works while
being trained on a synthetic (StyleGAN generated) dataset.
8

TABLE 1: We report results for same view (input and target have the same view) and novel view (input and target have
different view). We provide FID scores, LPIPS, MSE, SSIM, and 2D mIOU accuracy predictions and GT. We compare with
GanVerse since they are trained on the same dataset. We also provide results of each stage showcasing improvements of
progressive training.
Same View Novel View
Method FID⇓ LPIPS⇓ MSE⇓ SSIM⇑ IoU⇑ FID⇓ LPIPS ⇓ MSE⇓ SSIM⇑ IoU⇑
GanVerse [58] 28.04 0.1238 0.0060 0.8683 0.92 29.59 0.1333 0.0075 0.8599 0.93
Ours - Stage I 10.75 0.1011 0.0060 0.8695 0.94 11.98 0.1091 0.0074 0.8582 0.92
Car

Ours -Stage II 6.24 0.0737 0.0039 0.9027 0.94 9.05 0.1012 0.0072 0.8651 0.93
Ours - Stage III 4.56 0.0696 0.0040 0.9039 0.94 6.92 0.0965 0.0076 0.8632 0.93
GanVerse [58] 69.32 0.0742 0.0034 0.9230 0.82 63.89 0.0782 0.0037 0.9202 0.80
Ours - Stage I 69.58 0.0763 0.0035 0.9222 0.82 63.76 0.0764 0.0036 0.9222 0.81
Bird

Ours -Stage II 64.67 0.0720 0.0030 0.9258 0.82 60.97 0.0749 0.0036 0.9226 0.81
Ours - Stage III 61.21 0.0689 0.0030 0.9231 0.83 59.08 0.0712 0.0035 0.9201 0.82
GanVerse [58] 62.38 0.1272 0.0060 0.8852 0.78 83.82 0.1531 0.0100 0.8642 0.77
Horse

Ours - Stage I 83.66 0.1395 0.0088 0.8727 0.78 76.64 0.1442 0.0100 0.8669 0.78
Ours - Stage II 57.07 0.1037 0.0059 0.8971 0.79 67.77 0.1368 0.0101 0.8676 0.78
Ours - Stage III 56.83 0.1024 0.0055 0.9017 0.79 67.30 0.1367 0.0099 0.8684 0.78

Fig. 7: Qualitative results of our final model on Bird and Horse class. Given input images (1st row), we predict 3D shape,
texture, and render them into the same view point and novel view points.

since they all use different camera set-ups, it is not possible


to do an accurate quantitative comparison. However, on our
qualitative result comparisons (Fig. 6), it is clear that our
results achieve significantly better results.
In our qualitative comparisons, we compare with Uni-
corn [36] which learns 3D inference model in an unsuper-
vised way on Pascal3D+ Car dataset [51]. The model only
uses the bounding box annotation and is trained on 5000
training images. As can be seen from Fig. 6, impressive
results are achieved given that the model is learned in an un-
supervised way. On the other hand, the results lack details
and diversity in the shapes. It is significantly worse than
ours. Second, we compare with Pixelnerf [53] which pre-
Fig. 8: 3D model predictions of our model.
dicts a continuous neural scene representation conditioned
on a single view image. While neural radiance fields [34]
optimizes the representation to every scene independently, the input image. However, their texture and meshes do not
Pixelnerf trains across multiple scenes to learn a scene generalize across views and results in unrealistic predictions
prior and is able to perform novel view synthesis given an from novel views even though the model is trained with
input image. Pixelnerf is trained on a synthethic dataset, expensive annotations.
ShapeNet dataset [5]. It is also showcased on real image Last, we compare with GanVerse model which is trained
reconstruction for car classes. In our results, Pixelnerf is very on the same StyleGAN generated dataset as our method. As
good at predicting the same view but not as successful on shown in Table 1, the quantitative results of GanVerse are
the novel view predictions. Next, we compare with Dib- even worse than our single-stage results. GanVerse model
r model [6] which is trained on Pascal3D+ Car dataset is trained with same view and other view reconstructions
with ground-truth silhouette and camera parameters. Dib-r simultaneously. One difference is that, GanVerse model
outputs reasonable results on the same-view predictions of learns a mean shape and additional deformation vertices
9

First Stage
Second Stage
Third Stage
FID

iterations

Fig. 9: Validation FID curve on Car class with respect to First Stage Second Stage Third Stage
number of iterations. As shown, the improvements are not
Fig. 10: Qualitative results of our models from first, second,
coming from longer iterations but from the progressive
and third stage trainings on Car class. At each stage, results
training.
improve significantly. For example, zooming into tires in
the first stage, we see duplicated features. The second stage
solves that problem mostly but results are not as sharp as
for each image. The additional deformation is penalized
the third stage results.
for each image for stable training. Since, we rely on other
view training for mesh prediction, we do not put such
constrain on the vertices. Their architecture is based on inconsistent multi-view images. In the second-stage, texture
a U-Net and they do not employ a GAN training. Their improves significantly over first-stage. Finally, with GAN
results lack details and do not look as realistic compared training in the last stage, the colors look more realistic,
to our model. Note that StyleGAN generated dataset also sharp, and fine-details are generated.
removes the expensive labeling effort and models that train
We provide additional ablation study in Table 2 con-
on this dataset have the same motivation with the models
ducted on Car dataset. We also provide our each stage scores
that learn without annotations such as Unicorn. Generat-
in the first block to compare the results easily. In the second
ing StyleGAN dataset with annotations requires 1 minute,
block, we first experiment with no multi-stage training (No
whereas Pascal3D+ dataset requires 200h 350h work time for
Multi-Stage). This set-up refers to training the model from
the annotations [58]. Therefore, our comparisons provided
scratch with the final proposed loss function. This training
in Fig. 6 cover models learned from a broad range of dataset
results in poor results in all metrics and even worse than our
set-ups with different levels of annotation efforts. It includes
first-stage training results. Especially, adversarial training
unsupervised training data (Unicorn with Pascal3D+ im-
makes the training less stable when the model did not yet
ages), StyleGAN generated data which adds a minute longer
learn reliable predictions. It shows the importance of our
annotation effort (our method), a much more expensive data
multi-stage training pipeline.
with real images and with key-point annotations (Dib-r on
Next, we train with Same-view training objectives. This
labeled Pascal3D+ images), and a synthetic data (Pixelnerf
set-up is used when multi-view images are not available and
with ShapeNet dataset) with perfect annotations.
models have to be trained on single-view images. We use a
Lastly, we show the final results of our method on bird similar perceptual objective to Eq. 4 but with the ground-
and horse classes in Fig. 7. 3D models of these categories truth silhouettes as given in Eq. 7. We also add the silhouette
without textures are also shown in Fig. 8. Our method loss from the same view to guide the geometry predictions.
achieves realistic 3D predictions for these classes as well Here the input and target images share the same camera
even though StyleGAN generated datasets have inconsis- parameters.
tencies across views.
Ablation Study. First, we analyze the role of each stage g
Lp−sv−sil = ||Φj (Iv1 g
∗ Sv1 r
) − Φj (Iv1 g
∗ Sv1 )|| (7)
in our training pipeline. As given in Table 1, additional
training at each stage improves metrics consistently, espe- g r
||Sv1 Sv1 ||1
cially FIDs and LPIPS, the metrics that are shown to closely Lsil−sv = 1 − g r g r
(8)
correlate with human perception. In Fig. 9, we provide the ||Sv1 + Sv1 − Sv1 Sv1 ||
training curves of each stage. As can be seen from figure,
the improvements are not coming from longer trainings Lsv = λpn Lp−sv−sil + λs Lsil−sv + λlap Llap (9)
but from the progressive learning. We provide qualitative
comparisons of each stage’s output renderings in Fig. 10. The same-view training (model trained with objective
First stage rendering outputs results with a reliable geom- from Eq. 9) results are given in Table 2. With this set-
etry. However, the texture is not realistic, especially tires up, same view results look good quantitatively because the
have duplicated features which is understandable given network learns to reconstruct the input view. Especially IoU
that the model is minimizing the reconstruction loss from of the same view is better than the other set-ups because the
10

TABLE 2: Ablation study showing results without multi-stage pipeline, without multi-view discriminator, and without the
gaussian sampling in the generator on Car dataset. Results are given for same and novel views and contains scores, LPIPS,
MSE, SSIM, and 2D mIOU accuracy predictions and GT.
Same View Novel View
Method FID ⇓ LPIPS ⇓ MSE ⇓ SSIM ⇑ IoU ⇑ FID ⇓ LPIPS ⇓ MSE ⇓ SSIM⇑ IoU ⇑
Ours - Stage I 10.75 0.1011 0.0060 0.8695 0.94 11.98 0.1091 0.0074 0.8582 0.92
Ours -Stage II 6.24 0.0737 0.0039 0.9027 0.94 9.05 0.1012 0.0072 0.8651 0.93
Ours - Stage III 4.56 0.0696 0.0040 0.9039 0.94 6.92 0.0965 0.0076 0.8632 0.93
No Multi-Stage 66.59 0.1424 0.0118 0.8276 0.91 66.71 0.1485 0.0136 0.8200 0.90
Same-View Training 4.88 0.0759 0.0037 0.9034 0.96 42.24 0.1256 0.0105 0.8469 0.88
Multi-View Training 6.60 0.0741 0.0039 0.9026 0.94 10.00 0.1017 0.0071 0.8657 0.94
No Multi-View disc. 4.65 0.0702 0.0039 0.9042 0.94 7.02 0.0965 0.0076 0.8633 0.93
No Gaussian samp. 4.67 0.0696 0.0039 0.9043 0.94 7.05 0.0986 0.0075 0.8625 0.93

model learns the missing parts of input images and their


corresponding silhouettes and make similar predictions on
the validation dataset that match the ground-truth. On the
other hand, IoU of novel view score is the worst among all
set-ups. Models trained with single-view objectives struggle
generating realistic novel views as can be seen in FIDs, 42.24
Sample 1 Sample 2 Difference Map
novel view FID versus 4.88 same view FID.
We also experiment with multi-view training set-up.
This refers to training the model from scratch with the
second stage objective. We compare those results with our
second-stage training results. We see that better results are
achieved with the progressive learning.
In the last block, we run experiments where we train Sample 1 Sample 3 Difference Map
a discriminator without the multi-view conditioning. For
this experiment, we only update the stage three training.
The discriminator only has the main pipeline without the
projection based other view conditioning. This also results
in worse results than our proposed multi-view conditional
discriminator, especially in FIDs. Since the multi-view con-
Sample 2 Sample 3 Difference Map
ditional discriminator receives guidance from a given view,
it propagates better signals to the generator. Fig. 11: Texture predictions with different sampled codes
We also experiment the setting which does not have and with same input image. The model does not achieve
the sampling from normal distribution. This change con- visible diversity. However, when we take the difference of
verts the setting to a deterministic model. We observe that two images, they are slightly different.
sampling provides with a slight diversity in the colors
and improves the metrics slightly so we decide to keep it.
The diversity results are given in Fig. 11. The model does work on mesh representations obtained by deforming from
not achieve visible diversity, however, when we take the spheres [17], [40], [58]. Another limitation we observe is
difference of two images, they are slightly different. The the different 3D model qualities we obtain across different
improvements are not significant but consistent across all categories. Specifically, our generated 3D models are better
metrics. quality for the car class than the bird class. We also ob-
serve the same on the StyleGAN image generations results
5 C ONCLUSION between car and bird classes. One reason for that is that
StyleGAN model is trained on 5.7M car images whereas
In this work, we present a method to reconstruct high-
for bird category it is only trained on 48k bird images.
quality textured 3D models from single images. Our method
We acknowledge that the performance of our model is
learns from GAN generated images and bypasses the re-
correlated with the performance of the StyleGAN model
liance on labeled multi-view datasets or expensive 3D scans.
it learns from. Even though, our model does not need
GAN generated dataset is labeled in mass and requires
annotated images, StyleGAN model requires a large amount
a total of 1 minute human-labor. Because GAN generated
of unlabeled data. Learning GAN models on limited data is
dataset is noisy and not strictly consistent across views, we
an important future direction for this work [18].
propose a novel multi-stage training pipeline and adver-
sarial training set-up. We achieve significant improvements
over previous methods whether they were trained on GAN R EFERENCES
generated images or on real images.
Limitations. First limitation of our work is that we [1] Y. Alaluf, O. Tov, R. Mokady, R. Gal, and A. Bermano. Hyperstyle:
Stylegan inversion with hypernetworks for real image editing.
deform our final meshes from a sphere and cannot handle In Proceedings of the IEEE/CVF Conference on Computer Vision and
objects with holes similar to previous works that build their Pattern Recognition, pages 18511–18521, 2022. 2
11

[2] A. Bhattad, A. Dundar, G. Liu, A. Tao, and B. Catanzaro. View A. Ghosh, and S. Zafeiriou. Avatarme: Realistically renderable 3d
generalization for single image textured 3d models. In Proceedings facial reconstruction. In Proceedings of the IEEE/CVF Conference on
of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- Computer Vision and Pattern Recognition, pages 760–769, 2020. 1
tion (CVPR), pages 6081–6090, June 2021. 1, 4, 5 [24] X. Li, S. Liu, K. Kim, S. De Mello, V. Jampani, M.-H. Yang,
[3] E. R. Chan, C. Z. Lin, M. A. Chan, K. Nagano, B. Pan, S. De Mello, and J. Kautz. Self-supervised single-view 3d reconstruction via
O. Gallo, L. J. Guibas, J. Tremblay, S. Khamis, et al. Efficient semantic consistency. arXiv preprint arXiv:2003.06473, 2020. 1, 2
geometry-aware 3d generative adversarial networks. In Proceed- [25] C.-H. Lin, J. Gao, L. Tang, T. Takikawa, X. Zeng, X. Huang,
ings of the IEEE/CVF Conference on Computer Vision and Pattern K. Kreis, S. Fidler, M.-Y. Liu, and T.-Y. Lin. Magic3d:
Recognition, pages 16123–16133, 2022. 1, 2 High-resolution text-to-3d content creation. arXiv preprint
[4] E. R. Chan, M. Monteiro, P. Kellnhofer, J. Wu, and G. Wetzstein. arXiv:2211.10440, 2022. 2
pi-gan: Periodic implicit generative adversarial networks for 3d- [26] F. Liu and X. Liu. 2d gans meet unsupervised single-view 3d
aware image synthesis. In Proceedings of the IEEE/CVF Conference reconstruction. arXiv preprint arXiv:2207.10183, 2022. 1
on Computer Vision and Pattern Recognition, pages 5799–5809, 2021. [27] G. Liu, A. Dundar, K. J. Shih, T.-C. Wang, F. A. Reda, K. Sapra,
2 Z. Yu, X. Yang, A. Tao, and B. Catanzaro. Partial convolution for
[5] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, padding, inpainting, and image synthesis. IEEE Transactions on
Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. Shapenet: Pattern Analysis and Machine Intelligence, 2022. 1
An information-rich 3d model repository. arXiv preprint [28] S. Liu, T. Li, W. Chen, and H. Li. Soft rasterizer: A differentiable
arXiv:1512.03012, 2015. 1, 2, 8 renderer for image-based 3d reasoning. In Proceedings of the IEEE
[6] W. Chen, H. Ling, J. Gao, E. Smith, J. Lehtinen, A. Jacobson, and International Conference on Computer Vision, pages 7708–7717, 2019.
S. Fidler. Learning to predict 3d objects with an interpolation- 2, 4
based differentiable renderer. In Advances in Neural Information [29] Y. Liu, Z. Shu, Y. Li, Z. Lin, R. Zhang, and S. Kung. 3d-fm gan:
Processing Systems, pages 9609–9619, 2019. 1, 2, 3, 4, 5, 8 Towards 3d-controllable face manipulation. In Computer Vision–
[7] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. 3d- ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27,
r2n2: A unified approach for single and multi-view 3d object 2022, Proceedings, Part XV, pages 107–125. Springer, 2022. 2
reconstruction. In European conference on computer vision, pages [30] M. M. Loper and M. J. Black. Opendr: An approximate differen-
628–644. Springer, 2016. 1 tiable renderer. In European Conference on Computer Vision, pages
[8] A. Dundar, J. Gao, A. Tao, and B. Catanzaro. Fine detailed texture 154–169. Springer, 2014. 2
learning for 3d meshes with generative models. arXiv preprint [31] R. Martin-Brualla, N. Radwan, M. S. Sajjadi, J. T. Barron, A. Doso-
arXiv:2203.09362, 2022. 2, 4, 5, 6 vitskiy, and D. Duckworth. Nerf in the wild: Neural radiance fields
[9] A. Dundar, K. Sapra, G. Liu, A. Tao, and B. Catanzaro. Panoptic- for unconstrained photo collections. In Proceedings of the IEEE/CVF
based image synthesis. In Proceedings of the IEEE/CVF Conference Conference on Computer Vision and Pattern Recognition, pages 7210–
on Computer Vision and Pattern Recognition, pages 8070–8079, 2020. 7219, 2021. 2
1 [32] G. Metzer, E. Richardson, O. Patashnik, R. Giryes, and D. Cohen-
[10] J. Gao, T. Shen, Z. Wang, W. Chen, K. Yin, D. Li, O. Litany, Or. Latent-nerf for shape-guided generation of 3d shapes and
Z. Gojcic, and S. Fidler. Get3d: A generative model of high textures. arXiv preprint arXiv:2211.07600, 2022. 2
quality 3d textured shapes learned from images. arXiv preprint [33] O. Michel, R. Bar-On, R. Liu, S. Benaim, and R. Hanocka.
arXiv:2209.11163, 2022. 1, 2 Text2mesh: Text-driven neural stylization for meshes. In Proceed-
[11] B. Gecer, S. Ploumpis, I. Kotsia, and S. Zafeiriou. Ganfit: Gener- ings of the IEEE/CVF Conference on Computer Vision and Pattern
ative adversarial network fitting for high fidelity 3d face recon- Recognition, pages 13492–13502, 2022. 2
struction. In Proceedings of the IEEE Conference on Computer Vision [34] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ra-
and Pattern Recognition, pages 1155–1164, 2019. 1, 4 mamoorthi, and R. Ng. Nerf: Representing scenes as neural ra-
[12] S. Goel, A. Kanazawa, and J. Malik. Shape and viewpoint without diance fields for view synthesis. In European conference on computer
keypoints. arXiv preprint arXiv:2007.10982, 2020. 2 vision, pages 405–421. Springer, 2020. 2, 8
[13] J. Gu, L. Liu, P. Wang, and C. Theobalt. Stylenerf: A style-based 3d- [35] T. Miyato and M. Koyama. cgans with projection discriminator. In
aware generator for high-resolution image synthesis. arXiv preprint International Conference on Learning Representations, 2018. 5, 6
arXiv:2110.08985, 2021. 2 [36] T. Monnier, M. Fisher, A. A. Efros, and M. Aubry. Share with
[14] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In thy neighbors: Single-view reconstruction by cross-instance con-
Proceedings of the IEEE international conference on computer vision, sistency. arXiv preprint arXiv:2204.10310, 2022. 1, 8
pages 2961–2969, 2017. 3 [37] T. Nguyen-Phuoc, C. Li, L. Theis, C. Richardt, and Y.-L. Yang.
[15] P. Henderson, V. Tsiminaki, and C. H. Lampert. Leveraging 2d Hologan: Unsupervised learning of 3d representations from nat-
data to learn textured 3d mesh generation. In Proceedings of the ural images. In Proceedings of the IEEE International Conference on
IEEE/CVF Conference on Computer Vision and Pattern Recognition, Computer Vision, pages 7588–7597, 2019. 2
pages 7498–7507, 2020. 2 [38] M. Niemeyer and A. Geiger. Giraffe: Representing scenes as
[16] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochre- compositional generative neural feature fields. In Proceedings of
iter. Gans trained by a two time-scale update rule converge to a the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
local nash equilibrium. Advances in neural information processing pages 11453–11464, 2021. 2
systems, 30, 2017. 6 [39] T. Park, M.-Y. Liu, T.-C. Wang, and J.-Y. Zhu. Semantic image
[17] A. Kanazawa, S. Tulsiani, A. A. Efros, and J. Malik. Learning synthesis with spatially-adaptive normalization. In Proceedings
category-specific mesh reconstruction from image collections. In of the IEEE Conference on Computer Vision and Pattern Recognition,
Proceedings of the European Conference on Computer Vision (ECCV), pages 2337–2346, 2019. 1
pages 371–386, 2018. 1, 2, 10 [40] D. Pavllo, G. Spinks, T. Hofmann, M.-F. Moens, and A. Lucchi.
[18] T. Karras, M. Aittala, J. Hellsten, S. Laine, J. Lehtinen, and Convolutional generation of textured 3d meshes. arXiv preprint
T. Aila. Training generative adversarial networks with limited arXiv:2006.07660, 2020. 4, 5, 10
data. Advances in neural information processing systems, 33:12104– [41] H. Pehlivan, Y. Dalva, and A. Dundar. Styleres: Transforming
12114, 2020. 10 the residuals for real image editing with stylegan. arXiv preprint
[19] T. Karras, S. Laine, and T. Aila. A style-based generator architec- arXiv:2212.14359, 2022. 2
ture for generative adversarial networks. In Proceedings of the IEEE [42] B. Poole, A. Jain, J. T. Barron, and B. Mildenhall. Dreamfusion:
conference on computer vision and pattern recognition, pages 4401– Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988,
4410, 2019. 1, 2, 5 2022. 2
[20] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila. [43] N. Ravi, J. Reizenstein, D. Novotny, T. Gordon, W.-Y. Lo, J. John-
Analyzing and improving the image quality of stylegan. In Pro- son, and G. Gkioxari. Accelerating 3d deep learning with py-
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern torch3d. arXiv preprint arXiv:2007.08501, 2020. 2
Recognition, pages 8110–8119, 2020. 1, 2 [44] N. Tritrong, P. Rewatbowornwong, and S. Suwajanakorn. Re-
[21] H. Kato, Y. Ushiku, and T. Harada. Neural 3d mesh renderer. In purposing gans for one-shot semantic part segmentation. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Proceedings of the IEEE/CVF conference on computer vision and pattern
Recognition, pages 3907–3916, 2018. 2 recognition, pages 4475–4485, 2021. 1
[22] J. Ko, K. Cho, D. Choi, K. Ryoo, and S. Kim. 3d gan inversion with [45] S. Tulsiani, A. A. Efros, and J. Malik. Multi-view consistency as
pose optimization. In Proceedings of the IEEE/CVF Winter Conference supervisory signal for learning shape and pose prediction. In
on Applications of Computer Vision, pages 2967–2976, 2023. 2 Proceedings of the IEEE conference on computer vision and pattern
[23] A. Lattas, S. Moschoglou, B. Gecer, S. Ploumpis, V. Triantafyllou, recognition, pages 2897–2905, 2018. 2
12

[46] S. Tulsiani, T. Zhou, A. A. Efros, and J. Malik. Multi-view


supervision for single-view reconstruction via differentiable ray
consistency. In Proceedings of the IEEE conference on computer vision
and pattern recognition, pages 2626–2634, 2017. 2
[47] G. Van Horn, S. Branson, R. Farrell, S. Haber, J. Barry, P. Ipeirotis,
P. Perona, and S. Belongie. Building a bird recognition app and
large scale dataset with citizen scientists: The fine print in fine-
grained dataset collection. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 595–604, 2015. 5
[48] T. Wang, Y. Zhang, Y. Fan, J. Wang, and Q. Chen. High-fidelity
gan inversion for image attribute editing. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pages 11379–11388, 2022. 2
[49] T.-C. Wang, A. Mallya, and M.-Y. Liu. One-shot free-view neural
talking-head synthesis for video conferencing. In Proceedings of
the IEEE/CVF conference on computer vision and pattern recognition,
pages 10039–10049, 2021. 1
[50] Z. Wang, S. Wu, W. Xie, M. Chen, and V. A. Prisacariu. Nerf–:
Neural radiance fields without known camera parameters. arXiv
preprint arXiv:2102.07064, 2021. 2
[51] Y. Xiang, R. Mottaghi, and S. Savarese. Beyond pascal: A bench-
mark for 3d object detection in the wild. In IEEE winter conference
on applications of computer vision, pages 75–82. IEEE, 2014. 8
[52] F. Yin, Y. Zhang, X. Wang, T. Wang, X. Li, Y. Gong, Y. Fan, X. Cun,
Y. Shan, C. Oztireli, et al. 3d gan inversion with facial symmetry
prior. arXiv preprint arXiv:2211.16927, 2022. 2
[53] A. Yu, V. Ye, M. Tancik, and A. Kanazawa. pixelnerf: Neural
radiance fields from one or few images. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pages 4578–4587, 2021. 1, 2, 8
[54] F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and J. Xiao. Lsun:
Construction of a large-scale image dataset using deep learning
with humans in the loop. arXiv preprint arXiv:1506.03365, 2015. 5
[55] N. Yu, G. Liu, A. Dundar, A. Tao, B. Catanzaro, L. S. Davis,
and M. Fritz. Dual contrastive loss and attention for gans. In
Proceedings of the IEEE/CVF International Conference on Computer
Vision, pages 6731–6742, 2021. 1
[56] J. Y. Zhang, P. Felsen, A. Kanazawa, and J. Malik. Predicting
3d human dynamics from video. In Proceedings of the IEEE
International Conference on Computer Vision, pages 7114–7123, 2019.
1
[57] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The
unreasonable effectiveness of deep features as a perceptual metric.
In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 586–595, 2018. 6
[58] Y. Zhang, W. Chen, H. Ling, J. Gao, Y. Zhang, A. Torralba, and
S. Fidler. Image gans meet differentiable rendering for inverse
graphics and interpretable 3d neural rendering. International
Conference on Learning Representations, 2020. 1, 2, 3, 4, 5, 6, 8, 9,
10
[59] Y. Zhang, H. Ling, J. Gao, K. Yin, J.-F. Lafleche, A. Barriuso,
A. Torralba, and S. Fidler. Datasetgan: Efficient labeled data
factory with minimal human effort. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 10145–
10155, 2021. 1

You might also like