Progressive Learning of 3D Reconstruction Network From 2D GAN Data
Progressive Learning of 3D Reconstruction Network From 2D GAN Data
Abstract—This paper presents a method to reconstruct high-quality textured 3D models from single images. Current methods rely on
datasets with expensive annotations; multi-view images and their camera parameters. Our method relies on GAN generated multi-view
image datasets which have a negligible annotation cost. However, they are not strictly multi-view consistent and sometimes GANs
output distorted images. This results in degraded reconstruction qualities. In this work, to overcome these limitations of generated
datasets, we have two main contributions which lead us to achieve state-of-the-art results on challenging objects: 1) A robust
arXiv:2305.11102v1 [cs.CV] 18 May 2023
multi-stage learning scheme that gradually relies more on the models own predictions when calculating losses, 2) A novel adversarial
learning pipeline with online pseudo-ground truth generations to achieve fine details. Our work provides a bridge from 2D supervisions
of GAN models to 3D reconstruction models and removes the expensive annotation efforts. We show significant improvements over
previous methods whether they were trained on GAN generated multi-view images or on real images with expensive annotations.
Please visit our web-page for 3D visuals: https://research.nvidia.com/labs/adlr/progressive-3d-learning.
Index Terms—3D Texture Learning, 3D Reconstruction, Single-image Inference, Generative Adversarial Networks.
Fig. 1: Given a single 2D image input, our method outputs high quality textured 3D models. We achieve these results
by learning from StyleGAN generated datasets via a robust multi-stage training scheme and a novel adversarial learning
pipeline.
train a discriminator. With the discriminator, our model the renderers and trained to predict 3D mesh representa-
learns to output fine details. Our model shows significant tions and texture maps of input images via reconstruction
improvements over previous methods whether they were losses [6], [12], [15], [17], [24]. However, inferring these
trained on GAN generated multi-view images or on real im- 3D attributes from single 2D images is inherently ill-posed
ages with expensive data collections/annotations pipelines. problem given that the invisible mesh and texture predic-
In summary, our main contributions are: tions receive no gradients during training [6], [12], [15], [17],
• A robust multi-stage learning scheme that relies more on [24]. These algorithms that learn from single-view images
the model’s predictions at each step. Our model is not af- output results that look unrealistic especially when viewed
fected by missing parts in the images and inconsistencies from a different point.
across views. Multi-view image datasets provide a solution for the
• A novel adversarial learning pipeline to increase the re- limited supervision problem of single-view image datasets.
alism of textured 3D predictions. We generate pseudo- However, due to the expensive annotation of multi-view im-
ground truth during training and employ a multi-view age datasets for their 2D keypoints or camera pose, they are
conditional discriminator for learning to generate fine small in scale. There have been methods that use sequence of
details. images to optimize a mesh and texture model [8]. However,
• High-fidelity textured 3D model synthesis both quali- they learn a new network for each sequence. Recently, these
tatively and quantitatively on three challenging objects. sequence of image datasets have also been tremendously ex-
Examples are shown in Fig. 1. plored with a method called Neural Radiance Fields (Nerf)
to explore implicit geometry [31], [34], [50]. These models
overfit to a sequence and can not be used for single image
2 R ELATED W ORK inference. PixelNerf [53] is an extension of these models that
Style-based GAN models [19], [20] achieve high quality achieves single image inference, however, as we show in our
synthesis of various objects which are quite indistinguish- experiments, the results are not good.
able from real image and are shown to learn an implicit Another promising direction with Nerf-based models is
3D knowledge of objects without a supervision. One can optimization of 3D representations with well-trained diffu-
control the viewpoint of the synthesized object by its latent sion models. These models can stylize meshes or generate
codes. This makes pretrained GANs a promising technology 3D geometry representations from scratch with given text
for controllable generation [1], [41], [48]. However, in these prompts [25], [32], [33], [42]. However, these models re-
models, the disentanglement of 3D shape and appearance quire run time-optimization and control on the generation
is not strict and therefore the appearance of objects change is limited. In our work, we are interested in a different
as the viewpoint is manipulated. Recently, 3D-aware gener- application, single view image reconstruction where the
ative models are proposed with impressive results but they generation is conditioned on an input image.
either do not guarantee strict 3D consistency [4], [13], [37], In our work, we are interested in mesh representations
[38] or computationally expensive [3] and overall not on due its efficiency in rendering. To infer mesh representa-
par with 2D StyleGAN results [10]. We are interested in tions, multi-view datasets are also shown to be beneficial
single-view image inference so our work is more related based on the experiments with synthetic datasets [5], [45],
to image inversion methods that project images into these [46]. However, those results do not translate well to real
3D-aware GAN’s latent space [22], [29], [52]. Even though image inferences because of the domain gap between syn-
significant progress is achieved for image inversion [22], thetic and real images. To generate a realistic multi-view
[29], [52], these methods require run-time optimization and dataset with cheap labor cost, Zhang et. al. [58] use a gen-
suffer from lower quality novel view predictions. erative adversarial network by controlling the latent codes
There have been many works that learn textured 3D and generate coarsely consistent objects from different view
mesh models from images with differentiable renderers [6], points. We also use these datasets but achieve significantly
[21], [28], [30], [43]. Deep neural networks are coupled with better results than the state-of-the-art and Zhang et. al. [58]
3
STAGE I
Mesh
Prediction
Rendered GT
stylegan
STAGE II
Generator Texture
Prediction
Rendered GT
Fast and cheap
multi-view data generation GAN Training
With lots of noise
STAGE III
Ø Missing parts Input
Ø Inconsistencies across view
Fig. 2: Overview of the dataset generation (a) and multi-stage training scheme of the reconstruction network (b). The
generator network takes input image and outputs mesh and texture predictions. In the first stage, the output is rendered
from another view than the input image and losses are calculated on this novel view. This way model is not effected by the
missing parts in the images and also the unrealistic segmentation maps resulted from these images. In the second stage,
additional reconstruction loss is added from the same view. The rendered and ground-truth images are masked based on
the silhouette predictions of the model. Lastly, to achieve sharp and realistic predictions, we add adversarial training on
the third stage. GAN training pipeline is given in Fig. 3.
thanks to the robust learning scheme (which also allows us 3.2 Multi-stage Training Pipeline
to remove regularizers that limit the deformations) and the
We train the generator that outputs mesh and texture pre-
adversarial learning pipeline.
dictions with multi-stage training pipeline to be robust to
the errors in annotations and multi-view inconsistencies. At
3 M ETHOD each stage of our pipeline, results improve progressively.
In Section 3.1, we describe the motivation of our approach. First Stage. In the first stage, our model outputs 3D mesh
The multi-stage training scheme and adversarial learning and texture predictions from an input image that is from
pipeline are presented in Section 3.2 and 3.3, respectively. g
camera view-1, Iv1 . We render the image from these 3D
predictions with the target view, view-2, and output the im-
3.1 Motivation r
age Iv2 . We calculate the losses on these novel predictions.
Differentiable rendering enable training neural networks to The target view is randomly selected among the sequence.
perform 3D inference such as predicting 3D mesh geometry The motivation of the first stage is to capture reliable 3D
and textures from images [6]. However, they require multi- mesh predictions and reasonable texture estimations. At this
view images, camera parameters, and object silhouettes to stage, we do not expect a high quality texture estimation
achieve high performance models. Such data is expensive to given the inconsistencies across views. Our experiments
obtain. StyleGAN generated datasets remove the expensive show that when the network is guided with the losses from
r g
labeling effort via the latent codes that control the camera the same view as input (Iv1 vs. Iv1 ), the errors in missing
viewpoints. When few viewpoints are selected and anno- parts and so the errors in segmentation mask annotations
tated, multi-view images can be generated in infinite num- propagate to the mesh predictions. This instabilizes the
bers for those viewpoints. The annotations require 1 minute training even when the model is trained with multi-view
r r
[58] no matter how many images are generated because consistency, e.g. objectives calculated from both Iv1 and Iv2 .
they are all aligned across different examples. As for the Therefore, in the first stage, we learn to reconstruct an object
segmentation masks the renderers utilize during training, from the image of the object from a different view. This
they can be obtained by off-the-shelf instance segmentation way model does not overfit to the errors of the given view
models [14]. However, learning a high performing model since it receives feedback from a novel view. Note that novel
from these datasets remains a challenge since generated view may and does also have errors but since a novel view
images suffer from precise multi-view consistency as shown is randomly sampled from a sequence and errors are not
in Fig. 2 by red rectangles. Additionally, some examples consistent among the views, the model outputs the most
have missing parts as shown in Fig. 2 by blue rectangles, plausible 3D model to minimize losses in a sense similar to
the head of the horse is not generated in good quality which majority voting.
also transfers to instance segmentation mask (fourth-row). The training losses at this stage is calculated as fol-
In this work, we address these challenges by proposing lows. We use perceptual image reconstruction loss between
g
a robust multi-stage training pipeline and an adversarial the ground-truth image of the target view (Iv2 ) and the
r
learning set-up. rendered image view (Iv2 ). We mask the images with
4
View 1 Projection
stride 2 + down
stride 2 + down
Camera-1
stride 2
stride 2
stride 2
stride 2
Input
128
256
256
256
64
512
Inner
Product
Fake Pair
Generator x
stride 2
stride 2
stride 2
stride 2
stride 1
128
256
512
+
64
1
Camera-1
Input
Real Pair
Camera-2
Input
View 2 Projection
Fig. 3: Gan training pipeline. Discriminator is trained with fake and real pairs and a conditioning pair. First, texture and
mesh predictions are outputted by the generator. With the estimated mesh and given camera parameters of a second view
image, texture and visibility maps are projected from the second view image. For fake pair, estimated texture is partially
erased by the visibility map. Additionally, conditioning input pair is obtained by projecting from the first view image with
the estimated mesh predictions and given camera parameters of the first view.
g
ground-truth silhouette (mask) predictions, Sv2 . This way The rendered and ground-truth images are masked
reconstruction loss is only calculated on the object. As a based on the silhouette predictions of the model and per-
reconstruction loss, we use perceptual losses from Alexnet ceptual loss is calculated as given in Eq. 4. This way, we
(Φ) at different feature layers (j ) between these images from rely on the first stage training for the mesh prediction and
the loss objective as given in Eq. 1. learn improved textures with the same view training for
the visible parts. Mesh predictions can still improve via the
g g r g
Lp−nv = ||Φj (Iv2 ∗ Sv2 ) − Φj (Iv2 ∗ Sv2 )||2 (1) reconstruction losses since they still receive a feedback via
For shapes, we use an IoU loss between the silhouette image reconstruction, but we do not guide it directly with
r
rendered (Sv2
g
) and the silhouette (Sv2 ) of the input object. the silhouettes.
g r
||Sv2 Sv2 ||1 g
Lp−sv = ||Φj (Iv1 r
∗ Sv1 r
) − Φj (Iv1 r
∗ Sv1 )|| (4)
Lsil = 1 − g r g r
(2)
||Sv2 + Sv2 − Sv2 Sv2 ||
We use the following loss in additional iterations:
Similar to [6], [28], we also regularize predicted mesh
using a laplacian loss (Llap ) constraining neighboring mesh
triangles to have similar normals. Following are our base
Lsecond = Lf irst + λps Lp−sv (5)
losses: The second stage training starts after first stage training
Lf irst = λpn Lp−nv + λs Lsil + λlap Llap (3) converges. That is because we rely on the models mesh
predictions in the newly introduces losses. Results improve
The model outputs reliable 3D mesh predictions since significantly but the results are not sharp as the training
it gets feedback from different views. Note that, we do not data.
use many regularizes such as mean template and penalizing Third Stage. After learning reliable mesh representation
deformation vertices as previous works [2], [6], [58] and still and high quality texture predictions, we use generative
achieve a stable training with the first stage objectives. learning pipeline to improve realism of our predictions. In
Second Stage. In the second stage, we rely on our 3D this stage, we rely on our predictions to generate pseudo
mesh predictions and introduce additional losses between ground-truths to enable adversarial learning which is ex-
r g
Iv1 vs. Iv1 , same view as the input image. In 3D inference plained in the next section.
predictions, we expect the model to output predictions that
faithfully match with object for the input view. Therefore,
in the second stage, we add additional reconstruction losses 3.3 Adversarial Learning
from the input view. However, we do not add silhouette Training the model with an adversarial loss applied on the
loss for the input view because there are noise in the seg- rendered images do not improve the results due to the
mentation masks. Furthermore, for the reconstruction loss, shortcomings of renderers [40]. Therefore, we convert tex-
we do not mask the input image and rendered output with ture learning into 2D image synthesis task. Texture learning
the ground-truth mask since it is noisy. We rely on the 3D in UV space is previously explored with successful results
prediction of our model and mask the reconstruction loss [8], [11], [40]. However, our set-up is different as we learn
based on the projected mesh prediction. a single view image inference, the texture projection and
5
pseudo ground-truth generations are online in our training, class, and two representing articulated class. For car and
we do not learn a GAN trained from scratch for texture horse dataset, official models from StyleGAN2 ( [19]) repos-
generation rather we tune our 3D reconstruction network, itory are used. These models are trained on LSUN Car
and we propose a multi-view training in our discriminator. dataset with 5.7M images and LSUN horse dataset with 2M
While previous methods use different networks for texture images [54]. We also use a model trained on a bird class
projection and texture generation, we achieve both with the on NABirds dataset [47] with 48k images. The StyleGAN
same architecture. This enables us to improve the generator generated images are aligned for few view-points and those
further and achieve state-of-the-art results. views are annotated for one example which takes 1 minutes
As shown in Fig. 3, during our training, we obtain in total. Please refer to Zhang et. al. [58] for more details in
projected texture maps for the input view (v1) and a dif- the dataset generation pipeline.
ferent view (v2). We obtain those by first predicting 3D Architectural Details of Generator. Our generator has
meshes from an input view (v1) via our generator. The input an encoder-decoder architecture as shown in Fig. 4. In the
images are projected onto the UV map of the predicted mesh encoder, for predicting deformation and texture maps, the
template based on the camera parameters of each image encoder receives 512 × 512 image. It has 7 convolutional
via an inverse rendering. In this process, mesh predictions blocks with each convolution layer with 3 × 3 filters and
are transformed onto 2D screen by projection with camera stride of 2. The number of channels of the convolution layers
parameters. Then transformed mesh coordinates and UV are (64, 128, 256, 256, 256, 128, 128). There is a ReLU non-
map coordinates are used in reverse way and real images are linearity after each convolution layer. The encoder decreases
projected onto UV map with the renderer. Visibility masks the spatial size to 4 × 4.The encoded features (128 × 4 × 4)
are also obtained in this set-up. With this set-up, we obtain go through a fully connected layer and after reshaping, we
a real partial texture (from v2) and a conditioning texture output features with dimension of 512 × 8 × 4. Note that,
map (from v1) to train our discriminator. In this set-up, it we start with width of 8 and height of 4. While generating
is important for the mesh predictions to be accurate for the texture and mesh predictions, we only predict half of the
correct inverse rendering. That is way we leave the GAN maps in the height since they are expected to be mostly
training to the third stage. symmetric in y axis. Later, we flip the predictions and
Discriminator. Finally, to provide adversarial feedback, concatenate with itself to expand the height.
we train a conditional discriminator. The discriminator is The texture and mesh predictions go through two shared
conditioned on the partial texture view 1. Partial texture blocks of convolutional layers at first. In the decoder, each
view 2 is a real example and the generated texture is a block has two convolutional layers. There is an adaptive
fake one. For the fake example, we mask the generated normalization and leaky ReLU after each convolutional
texture with the visibility mask from real example to prevent layer. There is also a skip connection from input to the
distribution mismatch. output at each block. After each block, there is a bilinear
In traditional image-to-image translation algorithms interpolation layer to upsample the feature maps at each
conditional input and target images are concatenated and layer. The first two blocks has channels of (512, 256) which
fed to the discriminators. However, in our case, the im- bring the feature maps to a spatial dimension of 32 × 16.
ages are partially missing and aligning them in input via After that mesh prediction branches out and there is an-
concatenation do not provide useful signals. Instead, we other block with channel size of 64 for the mesh prediction
use a projection based discriminator [35] where we process branch. After that there is a single convolutional layer with
the conditioned input via convolutional layers and global channel of 3. The output represents the deformations (x,y,z)
pooling until the spatial dimension decreases to 1×1. Again coordinates.
the reason to decrease the dimension is because the input is For texture prediction, there are 4 more convolution
partially missing, therefore we want a full receptive field of blocks with channels size of (256, 256, 128, 128). After these
the input image while conditioning on the patches. layers, the spatial resolution becomes 512 × 256. A reflection
We dot product the conditional input that is embedded symmetry is applied as we flip the texture predictions in y
and the discriminator outputs. This score is added to the axis and concatenate it with the original texture predictions.
final discriminator score. With the multi-view conditioning, This results in spatial resolution of 512 × 512. After that
the discriminator does not only consider if the patch is there is one more convolution block with channel size of 64
realistic but also if the predicted texture is consistent with its and a final convolution layer which decreases the number of
input pair. The GAN training is especially important in our channels to 3. They represent (R,G,B) channels of the texture
set-up since we do not have consistent multi-view images. prediction. The convolutional layers after symmetry relaxes
The overall objective for the third stage includes following the symmetry constrains as we do not expect a perfect
min-max optimization: symmetry in the texture.
Note that the mesh predictions are also estimated in
min max Lsecond + λadv Ladv (θg , θd ) (6)
θg θd a convolutional way as a UV deformation map [2], [8],
[40]. Our deformation is a representation on the function
where θd and θg refer to parameters of the discriminator and
of sphere directly with a fixed surface topology. We sample
the generator, respectively.
from the deformation map for the corresponding vertex
locations. We also apply symmetry on the predicted UV de-
4 E XPERIMENTS formation. We use DIB-R [6] as our differentiable renderer.
Datasets. First to generate datasets, we use three category- The renderer takes mesh and texture predictions and output
specific StyleGAN models, one representing a rigid object images for target camera parameters.
6
stride 2 + down
stride 2 + down
stride 2
stride 2
stride 2
stride 2
128
256
256
256
64
512
Embedding
64 stride2
Inner
Product
128 stride2
Fake Pair
256 stride2
256 stride2
stride 2
stride 2
stride 2
stride 2
stride 1
z
128
256
512
+
64
1
256 stride2
Sampling
Real Pair
128 stride2
128 stride2
Discriminator
Encoder
Fig. 6: Given input images (1st column), we predict 3D shape, texture, and render them into the same viewpoint and
novel view points for our model. We also show renderings of state-of-the-art models that are trained on real and synthetic
images. Since, models use different camera parameters, we did not align the results. However, the results shown from
similar viewpoints capture the behaviour of each model. Unicorn outputs similar 3D shape for different inputs. Pixelnerf
achieves a high quality same view results but the results are poor from novel views. Dib-r also suffers from the same issue.
Ganverse does not output realistic details. Our model achieves significantly better results than the previous works while
being trained on a synthetic (StyleGAN generated) dataset.
8
TABLE 1: We report results for same view (input and target have the same view) and novel view (input and target have
different view). We provide FID scores, LPIPS, MSE, SSIM, and 2D mIOU accuracy predictions and GT. We compare with
GanVerse since they are trained on the same dataset. We also provide results of each stage showcasing improvements of
progressive training.
Same View Novel View
Method FID⇓ LPIPS⇓ MSE⇓ SSIM⇑ IoU⇑ FID⇓ LPIPS ⇓ MSE⇓ SSIM⇑ IoU⇑
GanVerse [58] 28.04 0.1238 0.0060 0.8683 0.92 29.59 0.1333 0.0075 0.8599 0.93
Ours - Stage I 10.75 0.1011 0.0060 0.8695 0.94 11.98 0.1091 0.0074 0.8582 0.92
Car
Ours -Stage II 6.24 0.0737 0.0039 0.9027 0.94 9.05 0.1012 0.0072 0.8651 0.93
Ours - Stage III 4.56 0.0696 0.0040 0.9039 0.94 6.92 0.0965 0.0076 0.8632 0.93
GanVerse [58] 69.32 0.0742 0.0034 0.9230 0.82 63.89 0.0782 0.0037 0.9202 0.80
Ours - Stage I 69.58 0.0763 0.0035 0.9222 0.82 63.76 0.0764 0.0036 0.9222 0.81
Bird
Ours -Stage II 64.67 0.0720 0.0030 0.9258 0.82 60.97 0.0749 0.0036 0.9226 0.81
Ours - Stage III 61.21 0.0689 0.0030 0.9231 0.83 59.08 0.0712 0.0035 0.9201 0.82
GanVerse [58] 62.38 0.1272 0.0060 0.8852 0.78 83.82 0.1531 0.0100 0.8642 0.77
Horse
Ours - Stage I 83.66 0.1395 0.0088 0.8727 0.78 76.64 0.1442 0.0100 0.8669 0.78
Ours - Stage II 57.07 0.1037 0.0059 0.8971 0.79 67.77 0.1368 0.0101 0.8676 0.78
Ours - Stage III 56.83 0.1024 0.0055 0.9017 0.79 67.30 0.1367 0.0099 0.8684 0.78
Fig. 7: Qualitative results of our final model on Bird and Horse class. Given input images (1st row), we predict 3D shape,
texture, and render them into the same view point and novel view points.
First Stage
Second Stage
Third Stage
FID
iterations
Fig. 9: Validation FID curve on Car class with respect to First Stage Second Stage Third Stage
number of iterations. As shown, the improvements are not
Fig. 10: Qualitative results of our models from first, second,
coming from longer iterations but from the progressive
and third stage trainings on Car class. At each stage, results
training.
improve significantly. For example, zooming into tires in
the first stage, we see duplicated features. The second stage
solves that problem mostly but results are not as sharp as
for each image. The additional deformation is penalized
the third stage results.
for each image for stable training. Since, we rely on other
view training for mesh prediction, we do not put such
constrain on the vertices. Their architecture is based on inconsistent multi-view images. In the second-stage, texture
a U-Net and they do not employ a GAN training. Their improves significantly over first-stage. Finally, with GAN
results lack details and do not look as realistic compared training in the last stage, the colors look more realistic,
to our model. Note that StyleGAN generated dataset also sharp, and fine-details are generated.
removes the expensive labeling effort and models that train
We provide additional ablation study in Table 2 con-
on this dataset have the same motivation with the models
ducted on Car dataset. We also provide our each stage scores
that learn without annotations such as Unicorn. Generat-
in the first block to compare the results easily. In the second
ing StyleGAN dataset with annotations requires 1 minute,
block, we first experiment with no multi-stage training (No
whereas Pascal3D+ dataset requires 200h 350h work time for
Multi-Stage). This set-up refers to training the model from
the annotations [58]. Therefore, our comparisons provided
scratch with the final proposed loss function. This training
in Fig. 6 cover models learned from a broad range of dataset
results in poor results in all metrics and even worse than our
set-ups with different levels of annotation efforts. It includes
first-stage training results. Especially, adversarial training
unsupervised training data (Unicorn with Pascal3D+ im-
makes the training less stable when the model did not yet
ages), StyleGAN generated data which adds a minute longer
learn reliable predictions. It shows the importance of our
annotation effort (our method), a much more expensive data
multi-stage training pipeline.
with real images and with key-point annotations (Dib-r on
Next, we train with Same-view training objectives. This
labeled Pascal3D+ images), and a synthetic data (Pixelnerf
set-up is used when multi-view images are not available and
with ShapeNet dataset) with perfect annotations.
models have to be trained on single-view images. We use a
Lastly, we show the final results of our method on bird similar perceptual objective to Eq. 4 but with the ground-
and horse classes in Fig. 7. 3D models of these categories truth silhouettes as given in Eq. 7. We also add the silhouette
without textures are also shown in Fig. 8. Our method loss from the same view to guide the geometry predictions.
achieves realistic 3D predictions for these classes as well Here the input and target images share the same camera
even though StyleGAN generated datasets have inconsis- parameters.
tencies across views.
Ablation Study. First, we analyze the role of each stage g
Lp−sv−sil = ||Φj (Iv1 g
∗ Sv1 r
) − Φj (Iv1 g
∗ Sv1 )|| (7)
in our training pipeline. As given in Table 1, additional
training at each stage improves metrics consistently, espe- g r
||Sv1 Sv1 ||1
cially FIDs and LPIPS, the metrics that are shown to closely Lsil−sv = 1 − g r g r
(8)
correlate with human perception. In Fig. 9, we provide the ||Sv1 + Sv1 − Sv1 Sv1 ||
training curves of each stage. As can be seen from figure,
the improvements are not coming from longer trainings Lsv = λpn Lp−sv−sil + λs Lsil−sv + λlap Llap (9)
but from the progressive learning. We provide qualitative
comparisons of each stage’s output renderings in Fig. 10. The same-view training (model trained with objective
First stage rendering outputs results with a reliable geom- from Eq. 9) results are given in Table 2. With this set-
etry. However, the texture is not realistic, especially tires up, same view results look good quantitatively because the
have duplicated features which is understandable given network learns to reconstruct the input view. Especially IoU
that the model is minimizing the reconstruction loss from of the same view is better than the other set-ups because the
10
TABLE 2: Ablation study showing results without multi-stage pipeline, without multi-view discriminator, and without the
gaussian sampling in the generator on Car dataset. Results are given for same and novel views and contains scores, LPIPS,
MSE, SSIM, and 2D mIOU accuracy predictions and GT.
Same View Novel View
Method FID ⇓ LPIPS ⇓ MSE ⇓ SSIM ⇑ IoU ⇑ FID ⇓ LPIPS ⇓ MSE ⇓ SSIM⇑ IoU ⇑
Ours - Stage I 10.75 0.1011 0.0060 0.8695 0.94 11.98 0.1091 0.0074 0.8582 0.92
Ours -Stage II 6.24 0.0737 0.0039 0.9027 0.94 9.05 0.1012 0.0072 0.8651 0.93
Ours - Stage III 4.56 0.0696 0.0040 0.9039 0.94 6.92 0.0965 0.0076 0.8632 0.93
No Multi-Stage 66.59 0.1424 0.0118 0.8276 0.91 66.71 0.1485 0.0136 0.8200 0.90
Same-View Training 4.88 0.0759 0.0037 0.9034 0.96 42.24 0.1256 0.0105 0.8469 0.88
Multi-View Training 6.60 0.0741 0.0039 0.9026 0.94 10.00 0.1017 0.0071 0.8657 0.94
No Multi-View disc. 4.65 0.0702 0.0039 0.9042 0.94 7.02 0.0965 0.0076 0.8633 0.93
No Gaussian samp. 4.67 0.0696 0.0039 0.9043 0.94 7.05 0.0986 0.0075 0.8625 0.93
[2] A. Bhattad, A. Dundar, G. Liu, A. Tao, and B. Catanzaro. View A. Ghosh, and S. Zafeiriou. Avatarme: Realistically renderable 3d
generalization for single image textured 3d models. In Proceedings facial reconstruction. In Proceedings of the IEEE/CVF Conference on
of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- Computer Vision and Pattern Recognition, pages 760–769, 2020. 1
tion (CVPR), pages 6081–6090, June 2021. 1, 4, 5 [24] X. Li, S. Liu, K. Kim, S. De Mello, V. Jampani, M.-H. Yang,
[3] E. R. Chan, C. Z. Lin, M. A. Chan, K. Nagano, B. Pan, S. De Mello, and J. Kautz. Self-supervised single-view 3d reconstruction via
O. Gallo, L. J. Guibas, J. Tremblay, S. Khamis, et al. Efficient semantic consistency. arXiv preprint arXiv:2003.06473, 2020. 1, 2
geometry-aware 3d generative adversarial networks. In Proceed- [25] C.-H. Lin, J. Gao, L. Tang, T. Takikawa, X. Zeng, X. Huang,
ings of the IEEE/CVF Conference on Computer Vision and Pattern K. Kreis, S. Fidler, M.-Y. Liu, and T.-Y. Lin. Magic3d:
Recognition, pages 16123–16133, 2022. 1, 2 High-resolution text-to-3d content creation. arXiv preprint
[4] E. R. Chan, M. Monteiro, P. Kellnhofer, J. Wu, and G. Wetzstein. arXiv:2211.10440, 2022. 2
pi-gan: Periodic implicit generative adversarial networks for 3d- [26] F. Liu and X. Liu. 2d gans meet unsupervised single-view 3d
aware image synthesis. In Proceedings of the IEEE/CVF Conference reconstruction. arXiv preprint arXiv:2207.10183, 2022. 1
on Computer Vision and Pattern Recognition, pages 5799–5809, 2021. [27] G. Liu, A. Dundar, K. J. Shih, T.-C. Wang, F. A. Reda, K. Sapra,
2 Z. Yu, X. Yang, A. Tao, and B. Catanzaro. Partial convolution for
[5] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, padding, inpainting, and image synthesis. IEEE Transactions on
Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. Shapenet: Pattern Analysis and Machine Intelligence, 2022. 1
An information-rich 3d model repository. arXiv preprint [28] S. Liu, T. Li, W. Chen, and H. Li. Soft rasterizer: A differentiable
arXiv:1512.03012, 2015. 1, 2, 8 renderer for image-based 3d reasoning. In Proceedings of the IEEE
[6] W. Chen, H. Ling, J. Gao, E. Smith, J. Lehtinen, A. Jacobson, and International Conference on Computer Vision, pages 7708–7717, 2019.
S. Fidler. Learning to predict 3d objects with an interpolation- 2, 4
based differentiable renderer. In Advances in Neural Information [29] Y. Liu, Z. Shu, Y. Li, Z. Lin, R. Zhang, and S. Kung. 3d-fm gan:
Processing Systems, pages 9609–9619, 2019. 1, 2, 3, 4, 5, 8 Towards 3d-controllable face manipulation. In Computer Vision–
[7] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. 3d- ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27,
r2n2: A unified approach for single and multi-view 3d object 2022, Proceedings, Part XV, pages 107–125. Springer, 2022. 2
reconstruction. In European conference on computer vision, pages [30] M. M. Loper and M. J. Black. Opendr: An approximate differen-
628–644. Springer, 2016. 1 tiable renderer. In European Conference on Computer Vision, pages
[8] A. Dundar, J. Gao, A. Tao, and B. Catanzaro. Fine detailed texture 154–169. Springer, 2014. 2
learning for 3d meshes with generative models. arXiv preprint [31] R. Martin-Brualla, N. Radwan, M. S. Sajjadi, J. T. Barron, A. Doso-
arXiv:2203.09362, 2022. 2, 4, 5, 6 vitskiy, and D. Duckworth. Nerf in the wild: Neural radiance fields
[9] A. Dundar, K. Sapra, G. Liu, A. Tao, and B. Catanzaro. Panoptic- for unconstrained photo collections. In Proceedings of the IEEE/CVF
based image synthesis. In Proceedings of the IEEE/CVF Conference Conference on Computer Vision and Pattern Recognition, pages 7210–
on Computer Vision and Pattern Recognition, pages 8070–8079, 2020. 7219, 2021. 2
1 [32] G. Metzer, E. Richardson, O. Patashnik, R. Giryes, and D. Cohen-
[10] J. Gao, T. Shen, Z. Wang, W. Chen, K. Yin, D. Li, O. Litany, Or. Latent-nerf for shape-guided generation of 3d shapes and
Z. Gojcic, and S. Fidler. Get3d: A generative model of high textures. arXiv preprint arXiv:2211.07600, 2022. 2
quality 3d textured shapes learned from images. arXiv preprint [33] O. Michel, R. Bar-On, R. Liu, S. Benaim, and R. Hanocka.
arXiv:2209.11163, 2022. 1, 2 Text2mesh: Text-driven neural stylization for meshes. In Proceed-
[11] B. Gecer, S. Ploumpis, I. Kotsia, and S. Zafeiriou. Ganfit: Gener- ings of the IEEE/CVF Conference on Computer Vision and Pattern
ative adversarial network fitting for high fidelity 3d face recon- Recognition, pages 13492–13502, 2022. 2
struction. In Proceedings of the IEEE Conference on Computer Vision [34] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ra-
and Pattern Recognition, pages 1155–1164, 2019. 1, 4 mamoorthi, and R. Ng. Nerf: Representing scenes as neural ra-
[12] S. Goel, A. Kanazawa, and J. Malik. Shape and viewpoint without diance fields for view synthesis. In European conference on computer
keypoints. arXiv preprint arXiv:2007.10982, 2020. 2 vision, pages 405–421. Springer, 2020. 2, 8
[13] J. Gu, L. Liu, P. Wang, and C. Theobalt. Stylenerf: A style-based 3d- [35] T. Miyato and M. Koyama. cgans with projection discriminator. In
aware generator for high-resolution image synthesis. arXiv preprint International Conference on Learning Representations, 2018. 5, 6
arXiv:2110.08985, 2021. 2 [36] T. Monnier, M. Fisher, A. A. Efros, and M. Aubry. Share with
[14] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In thy neighbors: Single-view reconstruction by cross-instance con-
Proceedings of the IEEE international conference on computer vision, sistency. arXiv preprint arXiv:2204.10310, 2022. 1, 8
pages 2961–2969, 2017. 3 [37] T. Nguyen-Phuoc, C. Li, L. Theis, C. Richardt, and Y.-L. Yang.
[15] P. Henderson, V. Tsiminaki, and C. H. Lampert. Leveraging 2d Hologan: Unsupervised learning of 3d representations from nat-
data to learn textured 3d mesh generation. In Proceedings of the ural images. In Proceedings of the IEEE International Conference on
IEEE/CVF Conference on Computer Vision and Pattern Recognition, Computer Vision, pages 7588–7597, 2019. 2
pages 7498–7507, 2020. 2 [38] M. Niemeyer and A. Geiger. Giraffe: Representing scenes as
[16] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochre- compositional generative neural feature fields. In Proceedings of
iter. Gans trained by a two time-scale update rule converge to a the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
local nash equilibrium. Advances in neural information processing pages 11453–11464, 2021. 2
systems, 30, 2017. 6 [39] T. Park, M.-Y. Liu, T.-C. Wang, and J.-Y. Zhu. Semantic image
[17] A. Kanazawa, S. Tulsiani, A. A. Efros, and J. Malik. Learning synthesis with spatially-adaptive normalization. In Proceedings
category-specific mesh reconstruction from image collections. In of the IEEE Conference on Computer Vision and Pattern Recognition,
Proceedings of the European Conference on Computer Vision (ECCV), pages 2337–2346, 2019. 1
pages 371–386, 2018. 1, 2, 10 [40] D. Pavllo, G. Spinks, T. Hofmann, M.-F. Moens, and A. Lucchi.
[18] T. Karras, M. Aittala, J. Hellsten, S. Laine, J. Lehtinen, and Convolutional generation of textured 3d meshes. arXiv preprint
T. Aila. Training generative adversarial networks with limited arXiv:2006.07660, 2020. 4, 5, 10
data. Advances in neural information processing systems, 33:12104– [41] H. Pehlivan, Y. Dalva, and A. Dundar. Styleres: Transforming
12114, 2020. 10 the residuals for real image editing with stylegan. arXiv preprint
[19] T. Karras, S. Laine, and T. Aila. A style-based generator architec- arXiv:2212.14359, 2022. 2
ture for generative adversarial networks. In Proceedings of the IEEE [42] B. Poole, A. Jain, J. T. Barron, and B. Mildenhall. Dreamfusion:
conference on computer vision and pattern recognition, pages 4401– Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988,
4410, 2019. 1, 2, 5 2022. 2
[20] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila. [43] N. Ravi, J. Reizenstein, D. Novotny, T. Gordon, W.-Y. Lo, J. John-
Analyzing and improving the image quality of stylegan. In Pro- son, and G. Gkioxari. Accelerating 3d deep learning with py-
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern torch3d. arXiv preprint arXiv:2007.08501, 2020. 2
Recognition, pages 8110–8119, 2020. 1, 2 [44] N. Tritrong, P. Rewatbowornwong, and S. Suwajanakorn. Re-
[21] H. Kato, Y. Ushiku, and T. Harada. Neural 3d mesh renderer. In purposing gans for one-shot semantic part segmentation. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Proceedings of the IEEE/CVF conference on computer vision and pattern
Recognition, pages 3907–3916, 2018. 2 recognition, pages 4475–4485, 2021. 1
[22] J. Ko, K. Cho, D. Choi, K. Ryoo, and S. Kim. 3d gan inversion with [45] S. Tulsiani, A. A. Efros, and J. Malik. Multi-view consistency as
pose optimization. In Proceedings of the IEEE/CVF Winter Conference supervisory signal for learning shape and pose prediction. In
on Applications of Computer Vision, pages 2967–2976, 2023. 2 Proceedings of the IEEE conference on computer vision and pattern
[23] A. Lattas, S. Moschoglou, B. Gecer, S. Ploumpis, V. Triantafyllou, recognition, pages 2897–2905, 2018. 2
12