0% found this document useful (0 votes)

47 views14 pages

Diffusion With Forward Models Solving Stochastic Inverse Problems Without Direct Supervision

This paper introduces a new method to integrate differentiable forward models with conditional diffusion models, allowing them to sample from distributions of signals that are only indirectly observed through partial observations generated by the forward model. It applies this framework to build the first model that can sample 3D scenes from single images by integrating differentiable rendering into the diffusion process. Experiments show it can perform 3D scene completion from images as well as single-image motion prediction and GAN inversion.

Uploaded by

linkzd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views14 pages

Diffusion With Forward Models Solving Stochastic Inverse Problems Without Direct Supervision

Uploaded by

linkzd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Diffusion with Forward Models: Solving Stochastic

Inverse Problems Without Direct Supervision

∗ ∗
Ayush Tewari1 Tianwei Yin1 George Cazenavette1 Semon Rezchikov4
1,2,3 1 1
Joshua B. Tenenbaum Frédo Durand William T. Freeman Vincent Sitzmann1
1 2 3 4
MIT CSAIL MIT BCS MIT CBMM Princeton IAS

Abstract

Denoising diffusion models have emerged as a powerful class of generative models

capable of capturing the distributions of complex, real-world signals. However,
current approaches can only model distributions for which training samples are
directly accessible, which is not the case in many real-world tasks. In inverse
graphics, for instance, we seek to sample from a distribution over 3D scenes
consistent with an image but do not have access to ground-truth 3D scenes, only
2D images. We present a new class of conditional denoising diffusion probabilistic
models that learn to sample from distributions of signals that are never observed
directly, but instead are only measured through a known differentiable forward
model that generates partial observations of the unknown signal. To accomplish
this, we directly integrate the forward model into the denoising process. At test
time, our approach enables us to sample from the distribution over underlying
signals consistent with some partial observation. We demonstrate the efficacy of
our approach on three challenging computer vision tasks. For instance, in inverse
graphics, we demonstrate that our model in combination with a 3D-structured
conditioning method enables us to directly sample from the distribution of 3D
scenes consistent with a single 2D input image.

1 Introduction
Consider the problem of reconstructing a 3D scene from a single picture. Since much of the 3D scene
is unobserved, there are an infinite number of 3D scenes that could have produced the image, due to
the 3D-to-2D projection, occlusion, and limited field-of-view that leaves a large part of the 3D scene
unobserved. Given the ill-posedness of this problem, it is desirable for a reconstruction algorithm to
be able to sample from the distribution over all plausible 3D scenes that are consistent with the 2D
image, generating unseen parts in plausible manners. Previous data-completion methods, such as
in-painting in 2D images, are trained on large sets of ground-truth output images along with their
incomplete (input) counterparts. Such techniques do not easily extend to 3D scene completion, since
curating a large dataset of ground-truth 3D scene representations is very challenging.
This 3D scene completion problem, known as inverse graphics, is just one instance of a broad
class of problems often referred to as Stochastic Inverse Problems, which arise across scientific
disciplines whenever we capture partial observations of the world through a sensor. In this paper, we
introduce a diffusion-based framework that can tackle this problem class, enabling us to sample from
a distribution of signals that are consistent with a set of partial observations that are generated from
the signal by a non-invertible, generally nonlinear, forward model. For instance, in inverse graphics,
we learn to sample 3D scenes given an image, yet never observe paired observations of images and
3D scenes at training time, nor observe 3D scenes directly.
∗
Equal Contribution. Project page: diffusion-with-forward-models.github.io

37th Conference on Neural Information Processing Systems (NeurIPS 2023).

While progress in deep learning for generative modeling has been impressive, this problem remains
unsolved. In particular, variational autoencoders and conditional neural processes are natural ap-
proaches but have empirically fallen short of modeling the multi-modal distributions required in, for
instance, inverse graphics. They have so far been limited to simple datasets. Emerging diffusion
models [1], in contrast, enable sampling from highly complex conditional distributions but require
samples from the output distribution that is to be modeled for training, e.g. full 3D models. Some
recent work in inverse graphics has resorted to a two-stage approach, where one first reconstructs
a large dataset of 3D scenes to then train an image-conditional diffusion model to sample from the
conditional distribution over these scenes [2, 3]. To avoid a two-stage approach, another recent line
of work trains a conditional diffusion model to sample from the distribution over novel views of
a scene, only requiring image observations at training time [4, 5]. However, such methods do not
model the distribution over 3D scenes directly and therefore cannot sample from the distribution
over 3D scenes consistent with an image observation. Thus, a multi-view consistent 3D scene can
only be obtained in a costly post-processing stage [6]. A notable exception is the recently proposed
RenderDiffusion [7], demonstrating that it is possible to train an unconditional diffusion model over
3D scenes from observing only monocular images. While one can perform conditional sampling even
with unconditional models, they are fundamentally limited to simple distributions, in this case, single
objects in canonical orientations.
Our core contribution is a novel approach for integrating any differentiable forward model that
describes how partial observations are obtained from signals, such as 2D image observations and 3D
scenes, with conditional denoising diffusion models. By sampling an observation from our model, we
jointly sample the signal that gave rise to that observation. Our approach has a number of advantages
that make it highly attractive for solving complex Stochastic Inverse Problems. First, our model is
trained end-to-end and does away with two-stage approaches that first require reconstruction of a
large dataset of signals. Second, our model directly yields diverse samples of the signal of interest.
For instance, in the inverse graphics setting, our model directly yields highly diverse samples of 3D
scenes consistent with an observation that can then be rendered from novel views with guaranteed
multi-view consistency. Finally, our model naturally leverages domain knowledge in the form of
known forward models, such as differentiable rendering, with all guarantees that such forward models
provide. We validate our approach on three challenging computer vision tasks: inverse graphics (the
focus of this paper), as well as single-image motion prediction and GAN inversion.
In summary, we make the following contributions:

1. We propose a new method that integrates differentiable forward models with conditional
diffusion models, replacing prior two-step approaches with a conditional generative model trained
end-to-end.
2. We apply our framework to build the first conditional diffusion model that learns to sample from
the distribution of 3D scenes trained only on 2D images. In contrast to prior work, we directly
learn image-conditional 3D radiance field generation, instead of sampling from the distribution
of novel views conditioned on a context view. Our treatment of inverse graphics exceeds a mere
application of the proposed framework, contributing a novel, 3D-structured denoising step that
leverages differentiable rendering both for conditioning and for the differentiable forward model.
3. We formally prove that under natural assumptions, as the number of observations of each signal
in the training set goes to infinity, the proposed model maximizes not only the likelihood of
observations, but also the likelihood of the unobserved signals.
4. We demonstrate the efficacy of our model for two more downstream tasks with structured forward
models: single-image motion prediction, where the forward model is a warping operation, and
GAN inversion, where the forward model is a pretrained StyleGAN [8] generator.

2 Method
Consider observations (Oij , ϕij ) that are generated from underlying signals Sj according to a known
forward model forward(), i.e., Oij = forward(Sj , ϕij ), where ϕij are parameters of the forward
model corresponding to observation Oij . Each observation can be partial. Specifically, given a
single observation, there is an infinite number of signals that could have generated this observation.
However, we assume that given a hypothetical set of all possible observations, the signal is fully

2
(a) Data: Tuples of observations (c) Graphical Model of Diffusion Process

observed
unobserved

(b) Single Denoising Step

Figure 1: Overview of our proposed method. (a) We assume a dataset of tuples of observations (O, ϕ)i ,
generated from unobserved signals S via a differentiable forward model. (b) We propose to integrate the forward
model directly into the denoising step of a diffusion model: given a pair of observations of the same signal,
we designate context Octxt and target Otrgt . We add noise to Otrgt , then feed (Octxt , ϕctxt , Otrgt trgt
t , ϕ ) to a neural
network denoise to estimate the signal St-1 . We then apply the forward model to obtain an estimate of the
clean target observation, Ôtrgt
t-1 . (c) The graphical model of the diffusion process.

determined. In the case of inverse graphics, Oij are image observations of 3D scenes Sj and ϕij
are the camera parameters, where we index scenes with j and observations of the j-th scene via i.
forward() is the rendering function. Note that if we were to capture every possible image of a 3D
scene, the 3D scene is uniquely determined, but given a single image, there are an infinite number of
3D scenes that could have generated that image, both due to the fact that rendering is a projection
from 3D and 2D, and due to the fact that a single image only constrains the visible part of the 3D
scene. We will drop the subscript j in the following, and leave it implied that we always consider
many observations generated from many signals. Fig. 1 provides an illustration of the data.
We are now interested in training a model that, at test time, allows us to sample from the distribution
of signals that are consistent with a previously unseen observation O. Formally, we aim to model the
conditional distribution p(S|O, ϕ). We make the following assumptions:
• We have access to a differentiable implementation of forward().
• We have access to a large dataset of observations and corresponding parameters of the forward
model, {(Oi , ϕi )}Ni .
• In our training set, we have access to several observations per signal.
Crucially, we do not assume that we have direct access to the underlying signal that gave rise to a
particular observation, i.e., we do not assume access to tuples of (O, ϕ, S). Further, we also do not
assume that we have access to any prior distribution over the signal of interest, i.e., we never observe
a dataset of signals of the form {Sj }j , and thus cannot train a generative model to sample from an
unconditional distribution over signals.
Recent advances in deep-learning-based generative modeling have seen the emergence of denoising
diffusion models as powerful generative models that can be trained to generate highly diverse samples
from complex, multi-modal distributions. We are thus motivated to leverage denoising diffusion
probabilistic models to model p(S|O, ϕ). However, existing approaches cannot be trained if we do
not have access to signals S. In the following, we give background on denoising diffusion models
and discuss the limitation.

2.1 Background: Denoising Diffusion Probabilistic Models and their Limitation

Denoising diffusion probablistic models are a class of generative models that learn to sample from
a distribution by learning to iteratively denoise samples. Consider the problem of modeling the
distribution pθ (x) over samples x. A forward Markovian process q(x0:T ) adds noise to the data as
p
q(xt | xt−1 ) = N (xt ; 1 − βt xt−1 , βt I). (1)
Here, βt , t ∈ 1 . . . T are the hyperparameters that control the variance schedule. A denoising
diffusion model learns the reverse process, where samples from a distribution p(xT ) = N (0, I) are
QT
transformed incrementally into the data manifold as pθ (x0:T ) = p(xT ) t=1 pθ (xt−1 | xt ), where
pθ (xt−1 | xt ) = N (xt−1 ; µ(xt , t), Σ(xt , t)). (2)
A neural network denoiseθ () with learnable parameters θ learns to reverse the diffusion process.
It is also possible to model conditional distributions pθ (x0:T | c), where the output is computed as

3
denoiseθ (xt , t, c). The forward process does not change in this case; in practice, we merely add the
conditional signal as input to the denoising model.
Unfortunately, we cannot train existing denoising diffusion models to sample from p(S | O, ϕ), or, in
fact, even from an unconditional distribution p(S). This would require computation of the Markovian
forward process in Eq. 1. However, recall that we do not have access to any signals {Sj }j - we thus
can not add any noise to any signals to then train a denoising neural network. In other words, since
no S is directly observed, we cannot compute q(St | St−1 ).

2.2 Integrating Denoising Diffusion with Differentiable Forward Models

We now introduce a class of denoising diffusion models that we train to directly model the distribution
p(S | Octxt ; ϕctxt ) over signals S given an observation (Octxt , ϕctxt ). Our key contribution is to
directly integrate the differentiable forward model forward() into the iterative conditional denoising
process. This enables us to add noise to and denoise the observations, while nevertheless sampling
the underlying signal that generates that observation.
Our model is trained on pairs of “context” and “target” observations of the same signal, denoted
as Octxt and Otrgt . √ As in conventional diffusion models, for the forward process, we have q(Otrgt t |
Otrgt trgt trgt
t-1 ) = N (Ot ; 1 − βt Ot−1 , βt I). In the reverse process, we similarly denoise O
trgt
conditional
on Octxt :
T
Y
pθ (Otrgt0:T | O ctxt
; ϕ ctxt
, ϕtrgt
) = p(O trgt
T ) pθ (Otrgt trgt
t−1 | Ot , O
ctxt
; ϕctxt , ϕtrgt ), (3)
t=0

However, unlike conventional diffusion models, we implement pθ (Otrgt trgt

t−1 | Ot , O
ctxt
; ϕctxt , ϕtrgt ) by
first predicting an estimate of the underlying signal St-1 and then mapping it to an estimate of the
denoised observations via the differentiable forward:
St-1 = denoiseθ (Octxt , Otrgt
t ; t, ϕ
ctxt
, ϕtrgt ), (4)
Ôtrgt trgt
t-1 = forward(St-1 , ϕ ) (5)
Otrgt
t-1 ∼ N (Otrgt trgt
t-1 ; Ct-1 Ôt-1 , β̂t-1 I) (6)
Here, Ôtrgt
is an estimate of the clean observation, and the constants Ct-1 and β̂t-1 are chosen to
t-1
match the total noise added by the forward process at time t-1. See Fig. 1 for an overview. At test
time, a signal is sampled by iterating Eq. 4, 5, and 6 starting with p(Otrgt
t=T ) ∼ N (0, I). Importantly,
our models define a generative model over the underlying signal via Eq. 4:
T
Y
pθ,ϕtrgt (S0:T | Octxt ; ϕctxt ) = pθ (St-1 | Otrgt
t ,O
ctxt
; ϕctxt , ϕtrgt ). (7)
t=1
We will suppress the subscript in the notation, and refer to this distribution as p(S0:T | Octxt ; ϕctxt )
for brevity from now.
Loss Function. We train to minimize the following two loss terms:
h i
Ltrgt
θ = EO ,O ,ϕ ,ϕ ,t ∥O
ctxt trgt ctxt trgt
trgt
− forward(denoiseθ (Octxt , Otrgt
t ; t, ϕ
ctxt
, ϕtrgt ), ϕtrgt ) ∥2 ,
| {z }
=Ôtrgt
t-1
(8)
h i
Lnovel
θ = EOctxt ,Onovel ,ϕctxt ,ϕtrgt ,ϕnovel ,t ∥Onovel − forward(denoiseθ (Octxt , Otrgt
t ; t, ϕ
ctxt
, ϕtrgt ), ϕnovel ) ∥2 .
| {z }
=Ônovel
t-1
(9)
Here, we compute the estimate of the observation from the target, as well as a separate, novel forward
model parameter ϕnovel . In the supplemental document, we show that these losses approximate a total
observation loss, maximizing the likelihood of all possible observations of the signal S.
Characterizing the Conditional Distribution Over Signals. Due to the complexity of the reverse
process, it may not be clear that the learned distribution over signals will agree with the true
distribution, even in the limit of infinite data. However, this model will indeed asymptotically learn
the true conditional distribution over signals, as we formally prove in the supplement:

4
1 Render Deterministic Conditioning Input 2 Generate 3D Scene & Target View

Volume Rendering

Volume Rendering
Generated 3D Scene

Φσ Φσ
MLPt=0 MLPt
Generated
Trgt. View

Target pose
Encoder Noisy target Encoder Encoder
enct=0 enct encv=0

Ctxt. image at pose

Figure 2: Overview of 3D Generative Modeling. We build a 3D-structured denoise operator on top of

pixelNeRF [9] that learns to sample from the distribution of 3D scenes from image observations only. Given a
context image Octxt with camera pose ϕctxt , we pick a target pose ϕtrgt . We render out a deterministic estimate
of the depth, RGB, and features of the target view Otrgt det using pixel-aligned features f
ctxt
extracted from the
context view with encoder enct=0 (left, only RGB shown here). To generate a 3D scene, we concatenate the
deterministic estimate with noise Otrgt trgt
t , and extract features ft for the target view with enct . fttrgt and f ctxt now
jointly parameterize the radiance field of the generated scene St-1 , and we may render an estimate of the clean
target view Ôtrgt
t-1 . The model is trained end-to-end via a re-rendering loss.

Proposition 1. Suppose that any signal S can be reconstructed from the set of all all possible
observations of S. Under this assumption, if in the limit as the number of known observations per
signal goes to infinity, there are parameters θ such that Ltrgt
θ +L
novel
is minimized, then the conditional
probability distribution over signals discovered by our model p(S | Octxt ; ϕctxt ) agrees with the true
distribution ptrue (S | Octxt ; ϕctxt ).

The proof follows by showing that our losses implicitly minimize a diffusion model loss over total
observations, which are collections of all possible observations of our signal. As such, when the
observations suffice to completely reconstruct the signal, the correctness of the estimated distribution
over total observations forces the estimated distribution over signals to be correct, as well.

3 Prior Work on Latent Variable Models for Inverse Problems

Variational Autoencoders [10, 11], normalizing flows [12], conditional [13] and attentive neural
processes [14] are latent-variable models that can be combined with forward models to learn to sample
from the distribution of unobserved signals from observations [15, 16]. However, they empirically
fall short of accurately modeling complex signal distributions - in inverse graphics, for instance,
such models have so far been limited to synthetic 3D scenes. Generative Adversarial Networks can
be trained with differentiable forward models in-the-loop, and have yielded impressive results in
unconditional generative modeling of unobserved signals [17–19]. Similarly, in concurrent work,
diffusion models have been leveraged for unconditional generative modeling through differentiable
forward models [2, 7, 20]. However, unconditional models are limited to tight distributions, and
no conditional generative modeling of similar quality has been demonstrated. Diffusion models
trained directly on signals have been effectively applied to diverse inverse problems such as super-
resolution [21–25], inpainting [21, 23–26], and medical imaging [27]. These works utilize the learned
prior of the data distribution to recover the latent signal through a “plug and play” approach [28–30],
integrating the diffusion model with a forward measurement process according to Bayes’ rule. These
approaches are versatile and can easily adapt to new inverse problems without retraining. However,
unlike our models, they rely on direct supervision over the signals in the form of large datasets.

4 Applications
We now apply our framework to three stochastic inverse problems. We focus on applications in
computer vision, where we tackle the problems of inverse graphics, single-image motion prediction,
and GAN inversion. For each application, we give a detailed description of the forward model, the
dataset and baselines, as well as a brief description of prior work.

5
Context Target Target Novel Intermediate Views Intermediate Depths
Frame Frame Depth Frame 5 Frame 50 Frame 5 Frame 50

Sample 1
GT

Sample 2

Determ.
Sample 3

Context
Sample 1
Sample 2

GT Deter

Figure 3: Sample Diversity. We illustrate different 3D scenes sampled from the same context image for
RealEstate10k and Co3D datasets. Unlike deterministic methods like pixelNeRF [9], our method generates
diverse and distinct 3D scenes that all align with the context image. Co3D results are generated using autoregres-
sive sampling, where a 360 degree trajectory can be generated by iteratively sampling target images. Note the
photorealism and diversity of the generated structures for the indoor scene, such as doors and cabinets. Also
note the high-fidelity geometry of the occluded parts of the hydrant and the diverse background appearance.

4.1 Inverse Graphics

We seek to learn a model that, given a single image of a 3D scene enables us to sample from the
distribution over 3D scenes that are consistent with the observation. We expect that 3D regions visible
in the image are reconstructed faithfully, while unobserved parts are generated plausibly. Every time
we sample, we expect a different plausible 3D generation. Signals S are 3D scenes, and observations
are 2D images O and their camera parameters ϕ. At training time, we assume that we have access to
at least two image observations and their camera parameters per scene, such that we can assemble
tuples of (Octxt , ϕctxt , Otrgt , ϕtrgt ), with 2D images Octxt , Otrgt , and camera parameters ϕctxt , ϕtrgt .
Scope. We note that our treatment of inverse graphics exceeds a mere application of the presented
framework. In particular, we not only integrate the differentiable rendering forward function, but
further propose a novel 3D-structured denoise function. Here, we enable state-of-the-art conditional
generation of complex, real-world 3D scenes.
Related Work. Few-shot reconstruction of 3D scene representations via differentiable rendering was
pioneered by deterministic methods [9, 31, 32, 32–41] that blur regions of the 3D scene unobserved
in the context observations. Probabilistic methods have been proposed that can sample from the
distribution of novel views trained only on images [4, 5, 42–45]. While results are impressive, these
methods do not allow sampling from the distribution of 3D scenes, but only from the distribution
of novel views. Generations are not multi-view consistent. Obtaining a 3D scene requires costly
post-processing via score distillation [6]. Several approaches [2, 3] use a two-stage design: they first
reconstruct a dataset of 3D scenes, and then train a 3D diffusion model. However, pre-computing large
3D datasets is expensive. Further, to obtain high-quality results, dense observations are required per
scene. RenderDiffusion [7] and HoloDiffusion [20] integrate differentiable forward rendering with
an unconditional diffusion model, enabling unconditional sampling of simple, single-object scenes.
Similar to us, RenderDiffusion performs denoising in the image space, while HoloDiffusion uses a
3D denoising architecture. Other methods use priors learned by text-conditioned image diffusion
models to optimize 3D scenes [46–48]. Here, the generative model does not have explicit knowledge
about the 3D information of scenes. These methods often suffer from geometric artifacts.
Structure of S and forward model render. We can afford only an abridged discussion here -
please see the supplement for a more detailed description. We use NeRF [49] as the parameterization
of 3D scenes, such that S is a function that maps a 3D coordinate p to a color c and density σ as
S(p) = (σ, c). We require a generalizable NeRF that is predicted in a feed-forward pass by an
encoder that takes a set of M context images and corresponding camera poses {(Oi , ϕi )}M
i as input.

6
Input pixelNeRF SparseFusion Ours Input pixelNeRF SparseFusion Ours
Diffusion Out. Score Distill. Diffusion Out. Score Distill.

GT GT

No No
Depth Depth
Input Input

GT GT

No No
Depth Depth

Figure 4: Qualitative Comparison for Inverse Graphics application. We benchmark with SparseFusion [5]
and the deterministic pixelNeRF [9]. SparseFusion samples 2D novel views conditioned on a deterministic
rendering (Diffusion Out.), and generates multi-view consistent 3D scenes only after Score Distillation. Our
method consistently generates higher-quality scenes, while directly sampling 3D scenes.

We base our model on pixelNeRF [9]. pixelNeRF first extracts image features {Fi }i from each
context observation via an encoder enc as Fi = enc(Oi ). Given a 3D point p, it obtains its pixel
coordinates in each context view via ppixi = π(p, ϕi ) via the projection operator π, and recovers a
corresponding feature as fi = Fi (ppix
i ) by sampling the feature map at pixel coordinate ppix
i . It then
parameterizes S via an MLP as:
S(p) = (σ(p), c(p)) = MLP({(fi ⊕ pi }M
i ), (10)
where ⊕ is concatenation and pi is the 3D point p transformed into the camera coordinates of
observation i. The number of context images M is flexible, and we may condition S on a single or
several observations. It will be convenient to refer to a pixelNeRF that is reconstructed from context
and target observations (Octxt , ϕctxt ) and (Otrgt , ϕtrgt ) as
S(· | enc(Octxt ), enc(Otrgt )), (11)
where we make the pixelNeRF encoder enc explicit and drop the poses ϕtrgt and ϕctxt . We leverage
differentiable volume rendering [49] as forward model, such that
O = render(S, ϕ), (12)
where S is rendered from a camera with parameters ϕ.
Implementation of denoise. Fig. 2 gives an overview of the denoising procedure. Following our
framework, we obtain the denoised target observation Ôtrgt
t-1 as:

Ôtrgt trgt
t-1 = render(St-1 , ϕ ), where (13)
St-1 = S(· | enct=0 (O ctxt
), enct (Otrgt
t )), (14)
where the image encoder enct is now conditioned on the timestep t. In other words, we will generate a
target view Ôtrgt
t-1 by rendering the pixelNeRF conditioned on the context and noisy target observations.
However, feeding the noisy Otrgt t directly to pixelNeRF is insufficient. This is because the pixel-
aligned features enct (O) are obtained from each view separately - thus, the features generated by
enct (Otrgt
t ) will be uninformative. To successfully generate a 3D scene, we have to augment the Ot
trgt

with information from the context view. We propose to generate conditioning information for Otrgt t
by rendering a deterministic estimate Otrgt
det = render (S(· | enct=0 (O
ctxt
)), ϕtrgt ). I.e., we condition
pixelNeRF only on the context view, and render an estimate of the target view via volume rendering.
However, in the extreme case of a completely uncertain target view, this results in a completely blurry
image. We thus propose to additionally render high-dimensional features. Recall that any 3D point p,
we have (σ(p), c(p)) = MLPt (p). We modify MLPt to also output a high-dimensional feature and

7
Single-Image Motion Prediction Probabilistic GAN Inversion
Input Motion Field Samples Determ. Patch Samples from GAN inversion Determ.

Near-Zero
Magnitude

Figure 5: Qualitative Results for Single-Image Motion Prediction (left) and GAN Inversion (right).

render a deterministic feature map to augment Otrgtt (only RGB shown in figure). We generate the
final 3D scene as St-1 = S(· | enct=0 (Octxt ), enct (Otrgt trgt
det ⊕ Ot )). The final denoised target view is
then obtained according to the rendering Eq. 13 above.
Loss and Training. Our loss consists of simple least-squares terms on re-rendered views, identical to
the general loss terms presented in Eqs. 8 and 9, in addition to regularizers that penalize degenerate
3D scenes. We discuss these regularizers, as well as training details, in the supplement.

4.1.1 Results
Datasets We evaluate on two challenging real-world datasets. We use Co3D hydrants [50] to
evaluate our method on object-centric scenes. For scene-level 3D synthesis, we use the challenging
RealEstate10k dataset [51], consisting of indoor and outdoor videos of scenes.
Baselines We compare our approach with state-of-the-art approaches in deterministic and probabilistic
3D scene completion. We use pixelNeRF as the representative method for deterministic methods that
takes a single image as input and deterministically reconstructs a 3D scene. Our method is the first
to probabilistically reconstruct 3D scenes in an end-to-end manner. Regardless, we compare with
the concurrent SparseFusion [52] that learns an image-space generative model over novel views of
a 3D scene. Score distillation of this generative model is required every time we want to obtain a
multi-view consistent 3D scene, which is costly.
Qualitative Results. In Fig. 3, we show multiple samples of 3D scenes sampled from a monocular
image. For the indoor scenes of RealEstate10k, there are large regions of uncertainty. We can sample
from the distribution of valid 3D scenes, resulting in significantly different 3D scenes with plausible
geometry and colors. The objects are faithfully reconstructed for the object-centric Co3D scenes,
and the uncertainty in the scene is captured. We can sample larger 3D scenes and render longer
trajectories by autoregressive sampling, i.e., we treat intermediate diffused images as additional
context observations to sample another target observation. The Co3D results in Fig. 3 were generated
autoregressively for a complete 360 degrees trajectory. In Fig. 4, we compare our results with
pixelNeRF [9] and SparseFusion [5]. pixelNeRF is a deterministic method and thus leads to very
blurry results in uncertain regions. SparseFusion reconstructs scenes by score-distillation over a 2D
generative model. This optimization is very expensive, and does not lead to natural-looking results.
Quantitative Results. For the object-centric Co3D dataset, we evaluate the accuracy of novel views
using PSNR and LPIPS [53] metrics. Note that PSNR/LPIPS are not meaningful metrics for large
scenes since the predictions have a large amount of uncertainty, i.e., a wide range of novel view images
can be consistent with any input image. Thus, we report FID [54] and KID [55] scores to evaluate
the realism of reconstructions in these cases. Our approach outperforms all baselines for LPIPS,
FID, and KID metrics, as our model achieves more realistic results. We achieve slightly lower PSNR
compared to pixelNeRF [9]. Note that PSNR favors mean estimates, and that we only evaluate our
model using a single randomly sampled scene for an input image due to computational constraints.

4.2 Single-Image Motion Prediction

Here, we seek to train a model that, given a single static image, allows us to sample from all possible
motions of pixels in the image. Given, for instance, an image of a person performing a task, such
as kicking a soccer ball, it is possible to predict potential future states. This is a stochastic problem,
as there are multiple possible motions consistent with an image. We train on a dataset of natural
videos [56]. We only observe RGB frames and never directly observe the underlying motion, i.e, the
pixel correspondences in time are unavailable. We use tuples of two frames from videos within a
small temporal window, and use them as our context and target observations for training.

8
3D Scene Completion GAN Inversion
Co3D RealEstate10k FFHQ
PSNR↑ LPIPS↓ FID↓ KID↓ FID↓ KID↓ FID↓ KID↓
pixelNeRF 17.93 0.54 180.20 0.14 195.40 0.14 Determ. 25.7 0.019
SparseFusion 12.06 0.63 252.13 0.16 99.44 0.04 Ours 7.45 0.002
Ours 17.47 0.42 84.63 0.05 42.84 0.01
Table 1: Quantitative evaluation. (left) We benchmark our 3D generative model with state-of-the-art baselines
pixelNeRF [9] and SparseFusion [5]. (right) We benchmark with a deterministic baseline on GAN inversion,
which we drastically outperform.

Related Work. Several papers tackle this problem, where motion in the form of optical flow [57–59],
2D trajectories [60, 61], and human motion [62, 63] are recovered from a static image; however, all
these methods assume supervision over the underlying motion. Learning to reason about motion
requires the neural network to learn about the properties and behavior of the different objects in the
world. Thus, this serves as a useful proxy task for representation learning, and can be used as a
backbone for many downstream applications [60, 64].
Structure of S and forward model warp. Our signal S stores the appearance and motion information
in a 2D grid. At any pixel u, the signal is defined as S(u) = (Sc (u), Sm (u)), where Sc (u) ∈ R3 is
the color value, and Sm (u) ∈ R2 is a 2D motion vector. The forward model is a warping operator,
such that warp(S, ϕ)(u + ϕSm (u)) = Sc (u) and ϕ is a scalar that changes the magnitude of motion.
We implement this function using a differentiable point splatting operation [65].
Implementation of denoise. The inset figure illustrates our Frame Noise Motion Frame
warp
design. We use a 2D network that takes Octxt , Otrgt
t , and t as
input, and generates the motion map Sm as the output. The

denoise
signal is then reconstructed as S = (Octxt , Sm ). Context and
target frames correspond to parameters ϕctxt = 0 and ϕtrgt = 1,
and can be reconstructed from the signal using warp.
Loss and Evaluation. Similar to inverse graphics, we use reconstruction and regularization losses.
The reconstruction losses are identical to Eqs. 8 and 9, and the regularization loss is a smoothness
term that encourages a natural motion of the scene, see supplement for details. We show results in
Fig. 5 (left), where we can estimate a diverse set of possible motion flows from monocular images.
By smoothly interpolating ϕ, we can generate short video sequences, even though our model only
saw low-framerate video frames during training. We also train a deterministic baseline, which only
generates a single motion field. Due to the amount of uncertainty in this problem, the deterministic
estimate collapses to a near-zero motion field regardless of the input image, and thus, fails to learn
any meaningful features from images.

4.3 GAN Inversion

Projecting images onto the latent space of generative adversarial networks is a well-studied problem
[8, 66], and enables interesting applications, as manipulating latents along known directions
allows a user to effectively edit images [67–69]. Here, we solve the problem of projecting partial
images: given a small visible patch in an image, our goal is to model the distribution of possible
StyleGAN2 [8] latents that agree with the input patch. There are a diverse set of latents that can
correspond to the input observation, and we train our method without observing supervised (image,
latent) pairs. Instead, we train on pairs of (Octxt , Otrgt ) observations, where Octxt are the small
patches in images, and Otrgt are the full images.
Related Work. While most GAN inversion methods focus on inverting a complete image into
the generator’s latent space [70–78], some also reconstruct GAN latents from small patches via
supervised training. Inversion is not trivial, and papers often rely on regularization [77] or integrate
the inversion with editing tasks [79] for higher quality. We also integrate the inpainting task with
the inversion, and seek to model the uncertainty of the GAN inversion task given only a partial
observation (patch) of the target image.
Structure of S and forward model synthesize. Our signal S ∈ R512 is a 512 dimensional latent
code representing the “w” space of StyleGAN2 [8] trained on the FFHQ [80] dataset. The forward
model synthesize(S, ϕ) = GAN(S)[ϕ] first reconstructs the image corresponding to S using a

9
forward pass of the GAN. It then extracts a patch using the forward model’s parameters ϕ that encode
the patch coordinates.
Patch Noise Latent Image
Implementation of denoise, Loss, and Evaluation. Please see synthesize
the inset figure for an illustration of the method. The denoising

denoise
trgt
network receives Octxt , Ot , and timestep t as input, and generates w
an estimate of the StyleGAN latent w. The loss function is identical
to Eq. 8 and compares the reconstructed sample with ground truth.
We show results in Fig. 5 (right). We obtain diverse samples that are all consistent with the input
patch. We also compare with a deterministic baseline that minimizes the same loss but only produces
a single estimate. While this deterministic estimate also agrees with the input image, it does not
model the diversity of outputs. We consequently achieve significantly better FID [54] and KID [55]
scores than the deterministic baseline, reported in Tab. 1 (right).

5 Discussion
Limitations. While our method makes significant advances in generative modeling, it still has several
limitations. Sampling 3D scenes at test time can be very slow, due to the expensive nature of the
denoising process and the cost of volume rendering. We need multi-view observations of training
scenes for the inverse graphics application. Our models are not trained on very large-scale datasets,
and can thus not generalize to out-of-distribution data.
Conclusion We have introduced a new method that tightly integrates differentiable forward models
and conditional diffusion models. Our model learns to sample from the distribution of signals trained
only using their observations. We demonstrate the efficacy of our approach on three challenging
computer vision problems. In inverse graphics, our method, in combination with a 3D-structured
conditioning method, enables us to directly sample from the distribution of real-world 3D scenes
consistent with a single image observation. We can then render multi-view consistent novel views
while obtaining diverse samples of 3D geometry and appearance in unobserved regions of the scene.
We further tackle single-image conditional motion synthesis, where we learn to sample from the
distribution of 2D motion conditioned on a single image, as well as GAN inversion, where we learn to
sample images that exist in the latent space of a GAN that are consistent with a given patch. With this
work, we make contributions that broaden the applicability of state-of-the-art generative modeling
to a large range of scientifically relevant applications, and hope to inspire future research in this
direction.
Acknowledgements. This work was supported by the National Science Foundation under Grant No.
2211259, by the Singapore DSTA under DST00OECI20300823 (New Representations for Vision),
by the NSF award 1955864 (Occlusion and Directional Resolution in Computational Imaging), by
the ONR MURI grant N00014-22-1-2740, and by the Amazon Science Hub. We are grateful for
helpful conversations with members of the Scene Representation Group David Charatan, Cameron
Smith, and Boyuan Chen. We thank Zhizhuo Zhou for thoughtful discussions about the SparseFusion
baseline. This article solely reflects the opinions and conclusions of its authors and no other entity.
Author contributions. Ayush and Vincent conceived the idea of diffusion with forward models,
designed experiments, generated most figures, and wrote most of the paper. Ayush contributed the
key insight to integrate differentiable rendering with diffusion models by denoising in image space
while generating 3D scenes. Ayush and Vincent generalized this to general forward models, and
conceived the single-image motion application. Vincent contributed the 3D-structured conditioning
and generated the overview and methods figures. Ayush wrote all initial code and ran all initial
experiments. Ayush and Tianwei implemented the inverse graphics application and generated most of
the 3D results of our model, while George helped with the baseline 3D results. Ayush executed all
single-image motion experiments. George conceived, implemented, and executed all GAN inversion
experiments. Semon helped formalizing the method and wrote the proposition and its proof. Frédo
and Bill were involved in regular meetings and gave valuable feedback on results and experiments.
Josh provided intriguing cognitive science perspectives and feedback on results and experiments,
and provided a significant part of the compute. Vincent’s Scene Representation Group provided a
significant part of the compute, and the project profited from code infrastructure developed by and
conversations with other members of the Scene Representation Group.

10
References
[1] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning
using nonequilibrium thermodynamics. In Proc. ICML, 2015. 2

[2] Norman Müller, , Yawar Siddiqui, Lorenzo Porzi, Samuel Rota Bulò, Peter Kontschieder, and Matthias
Nießner. Diffrf: Rendering-guided 3d radiance field diffusion. Proc. CVPR, 2023. 2, 5, 6

[3] Seung Wook Kim, Bradley Brown, Kangxue Yin, Karsten Kreis, Katja Schwarz, Daiqing Li, Robin
Rombach, Antonio Torralba, and Sanja Fidler. Neuralfield-ldm: Scene generation with hierarchical latent
diffusion models. Proc. CVPR, 2023. 2, 6

[4] Eric R Chan, Koki Nagano, Matthew A Chan, Alexander W Bergman, Jeong Joon Park, Axel Levy, Miika
Aittala, Shalini De Mello, Tero Karras, and Gordon Wetzstein. Generative novel view synthesis with
3d-aware diffusion models. arXiv preprint arXiv:2304.02602, 2023. 2, 6

[5] Zhizhuo Zhou and Shubham Tulsiani. Sparsefusion: Distilling view-conditioned diffusion for 3d recon-
struction. Proc. CVPR, 2023. 2, 6, 7, 8, 9

[6] Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion.
Proc. ICLR, 2023. 2, 6

[7] Titas Anciukevičius, Zexiang Xu, Matthew Fisher, Paul Henderson, Hakan Bilen, Niloy J Mitra, and Paul
Guerrero. Renderdiffusion: Image diffusion for 3d reconstruction, inpainting and generation. Proc. CVPR,
2023. 2, 5, 6

[8] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and
improving the image quality of stylegan. In Proc. CVPR, 2020. 2, 9

[9] Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelnerf: Neural radiance fields from one or
few images. In Proc. CVPR, 2021. 5, 6, 7, 8, 9

[10] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. Proc. ICLR, 2014. 5

[11] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approxi-
mate inference in deep generative models. In Proc. ICML, 2014. 5

[12] Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In Proc. ICML, 2015.
5

[13] Marta Garnelo, Dan Rosenbaum, Christopher Maddison, Tiago Ramalho, David Saxton, Murray Shanahan,
Yee Whye Teh, Danilo Rezende, and SM Ali Eslami. Conditional neural processes. In Proc. ICML, 2018.
5

[14] Hyunjik Kim, Andriy Mnih, Jonathan Schwarz, Marta Garnelo, Ali Eslami, Dan Rosenbaum, Oriol Vinyals,
and Yee Whye Teh. Attentive neural processes. Proc. ICLR, 2019. 5

[15] Adam R Kosiorek, Heiko Strathmann, Daniel Zoran, Pol Moreno, Rosalia Schneider, Sona Mokrá, and
Danilo Jimenez Rezende. Nerf-vae: A geometry aware 3d scene generative model. In Proc. ICML, 2021. 5

[16] Pol Moreno, Adam R Kosiorek, Heiko Strathmann, Daniel Zoran, Rosalia G Schneider, Björn Winckler,
Larisa Markeeva, Théophane Weber, and Danilo J Rezende. Laser: Latent set representations for 3d
generative modeling. arXiv preprint arXiv:2301.05747, 2023. 5

[17] Eric R Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu, and Gordon Wetzstein. pi-gan: Periodic implicit
generative adversarial networks for 3d-aware image synthesis. In Proc. CVPR, 2021. 5

[18] Jun Gao, Tianchang Shen, Zian Wang, Wenzheng Chen, Kangxue Yin, Daiqing Li, Or Litany, Zan Gojcic,
and Sanja Fidler. Get3d: A generative model of high quality 3d textured shapes learned from images. Proc.
NeurIPS, 2022. 5

[19] Terrance DeVries, Miguel Angel Bautista, Nitish Srivastava, Graham W Taylor, and Joshua M Susskind.
Unconstrained scene generation with locally conditioned radiance fields. In Proc. ICCV, 2021. 5

[20] Animesh Karnewar, Andrea Vedaldi, David Novotny, and Niloy Mitra. Holodiffusion: Training a 3d
diffusion model using 2d images. Proc. CVPR, 2023. 5, 6

[21] Hyungjin Chung, Jeongsol Kim, Michael Thompson Mccann, Marc Louis Klasky, and Jong Chul Ye.
Diffusion posterior sampling for general noisy inverse problems. In Proc. ICLR, 2023. 5

11
[22] Jooyoung Choi, Sungwon Kim, Yonghyun Jeong, Youngjune Gwon, and Sungroh Yoon. Ilvr: Conditioning
method for denoising diffusion probabilistic models. Proc. ICCV, 2021. 5
[23] Bahjat Kawar, Michael Elad, Stefano Ermon, and Jiaming Song. Denoising diffusion restoration models.
In Proc. NeurIPS, 2022. 5
[24] Jiaming Song, Arash Vahdat, Morteza Mardani, and Jan Kautz. Pseudoinverse-guided diffusion models for
inverse problems. In Proc. ICLR, 2023. 5
[25] Zahra Kadkhodaie and Eero Simoncelli. Stochastic solutions for linear inverse problems using the prior
implicit in a denoiser. Advances in Neural Information Processing Systems, 34:13242–13254, 2021. 5
[26] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole.
Score-based generative modeling through stochastic differential equations. In Proc. ICLR, 2021. 5
[27] Yang Song, Liyue Shen, Lei Xing, and Stefano Ermon. Solving inverse problems in medical imaging with
score-based generative models. Proc. ICLR, 2022. 5
[28] Johnathan M Bardsley. Mcmc-based image reconstruction with uncertainty quantification. SIAM Journal
on Scientific Computing, 34(3):A1316–A1332, 2012. 5
[29] Singanallur Venkatakrishnan, Charles A. Bouman, and Brendt Wohlberg. Plug-and-play priors for model
based reconstruction. 2013 IEEE Global Conference on Signal and Information Processing, pages 945–948,
2013. 5
[30] Yaniv Romano, Michael Elad, and Peyman Milanfar. The little engine that could: Regularization by
denoising (red). SIAM Journal on Imaging Sciences, 10(4):1804–1844, 2017. 5
[31] Vincent Sitzmann, Michael Zollhöfer, and Gordon Wetzstein. Scene representation networks: Continuous
3d-structure-aware neural scene representations. Proc. NeurIPS, 2019. 6
[32] Michael Niemeyer, Lars Mescheder, Michael Oechsle, and Andreas Geiger. Differentiable volumetric
rendering: Learning implicit 3d representations without 3d supervision. In Proc. CVPR, 2020. 6
[33] Philipp Henzler, Jeremy Reizenstein, Patrick Labatut, Roman Shapovalov, Tobias Ritschel, Andrea Vedaldi,
and David Novotny. Unsupervised learning of 3d object categories from videos in the wild. In Proc. CVPR,
2021. 6
[34] Prafull Sharma, Ayush Tewari, Yilun Du, Sergey Zakharov, Rares Andrei Ambrus, Adrien Gaidon,
William T Freeman, Fredo Durand, Joshua B Tenenbaum, and Vincent Sitzmann. Neural groundplans:
Persistent neural scene representations from a single image. In Proc. ICLR. 6
[35] Anpei Chen, Zexiang Xu, Fuqiang Zhao, Xiaoshuai Zhang, Fanbo Xiang, Jingyi Yu, and Hao Su. Mvsnerf:
Fast generalizable radiance field reconstruction from multi-view stereo. In Proc. ICCV, 2021. 6
[36] Yilun Du, Cameron Smith, Ayush Tewari, and Vincent Sitzmann. Learning to render novel views from
wide-baseline stereo pairs. In Proc. CVPR, 2023. 6
[37] Alex Trevithick and Bo Yang. Grf: Learning a general radiance field for 3d representation and rendering.
In Proc. ICCV, 2021. 6
[38] Mohammed Suhail, Carlos Esteves, Leonid Sigal, and Ameesh Makadia. Generalizable patch-based neural
rendering. In Proc. ECCV, 2022. 6
[39] Julian Chibane, Aayush Bansal, Verica Lazova, and Gerard Pons-Moll. Stereo radiance fields (srf):
Learning view synthesis for sparse views of novel scenes. In Proc. CVPR, 2021. 6
[40] Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul P Srinivasan, Howard Zhou, Jonathan T Barron,
Ricardo Martin-Brualla, Noah Snavely, and Thomas Funkhouser. Ibrnet: Learning multi-view image-based
rendering. In Proc. CVPR, 2021. 6
[41] Shamit Lal, Mihir Prabhudesai, Ishita Mediratta, Adam W Harley, and Katerina Fragkiadaki. Coconets:
Continuous contrastive 3d scene representations. In Proc. CVPR, 2021. 6
[42] Daniel Watson, William Chan, Ricardo Martin-Brualla, Jonathan Ho, Andrea Tagliasacchi, and Mohammad
Norouzi. Novel view synthesis with diffusion models. Proc. ICLR, 2023. 6
[43] SM Ali Eslami, Danilo Jimenez Rezende, Frederic Besse, Fabio Viola, Ari S Morcos, Marta Garnelo,
Avraham Ruderman, Andrei A Rusu, Ivo Danihelka, Karol Gregor, et al. Neural scene representation and
rendering. Science, 360(6394):1204–1210, 2018. 6

12
[44] Hung-Yu Tseng, Qinbo Li, Changil Kim, Suhib Alsisan, Jia-Bin Huang, and Johannes Kopf. Consistent
view synthesis with pose-guided diffusion models. arXiv preprint arXiv:2303.17598, 2023. 6

[45] Jiatao Gu, Qingzhe Gao, Shuangfei Zhai, Baoquan Chen, Lingjie Liu, and Josh Susskind. Learning
controllable 3d diffusion models from single-view images. arXiv preprint arXiv:2304.06700, 2023. 6

[46] Luke Melas-Kyriazi, Christian Rupprecht, Iro Laina, and Andrea Vedaldi. Realfusion: 360 {\deg}
reconstruction of any object from a single image. Proc. CVPR, 2023. 6

[47] Lukas Höllein, Ang Cao, Andrew Owens, Justin Johnson, and Matthias Nießner. Text2room: Extracting
textured 3d meshes from 2d text-to-image models. arXiv preprint arXiv:2303.11989, 2023. 6

[48] Rafail Fridman, Amit Abecasis, Yoni Kasten, and Tali Dekel. Scenescape: Text-driven consistent scene
generation. arXiv preprint arXiv:2302.01133, 2023. 6

[49] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren
Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In Proc. ECCV, 2020. 6, 7

[50] Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David
Novotny. Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction.
In Proc. ICCV, 2021. 8

[51] Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification:
Learning view synthesis using multiplane images. ACM Trans. Graph. (Proc. SIGGRAPH), 37, 2018. 8

[52] Yi Ding, Alex Rich, Mason Wang, Noah Stier, Matthew Turk, Pradeep Sen, and Tobias Höllerer. Sparse
fusion for multimodal transformers. arXiv preprint arXiv:2111.11992, 2021. 8

[53] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable
effectiveness of deep features as a perceptual metric. In Proc. CVPR, 2018. 8

[54] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans
trained by a two time-scale update rule converge to a local nash equilibrium. Proc. NeurIPS, 2017. 8, 10

[55] Mikołaj Bińkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans.
arXiv preprint arXiv:1801.01401, 2018. 8, 10

[56] Tianfan Xue, Baian Chen, Jiajun Wu, Donglai Wei, and William T Freeman. Video enhancement with
task-oriented flow. International Journal of Computer Vision, 127:1106–1125, 2019. 8

[57] Ruohan Gao, Bo Xiong, and Kristen Grauman. Im2flow: Motion hallucination from static images for
action recognition. In Proc. CVPR, 2018. 9

[58] Jacob Walker, Abhinav Gupta, and Martial Hebert. Dense optical flow prediction from a static image. In
Proc. ICCV, 2015. 9

[59] Silvia L Pintea, Jan C van Gemert, and Arnold WM Smeulders. Déja vu: Motion prediction in static
images. In Proc. ECCV, 2014. 9

[60] Jacob Walker, Carl Doersch, Abhinav Gupta, and Martial Hebert. An uncertain future: Forecasting from
static images using variational autoencoders. In Proc. ECCV. Springer, 2016. 9

[61] Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu, and Ming-Hsuan Yang. Flow-grounded
spatial-temporal video prediction from still images. In Proc. ICCV, 2018. 9

[62] Jacob Walker, Kenneth Marino, Abhinav Gupta, and Martial Hebert. The pose knows: Video forecasting
by generating pose futures. In Proc. ICCV, 2017. 9

[63] Angjoo Kanazawa, Shubham Tulsiani, Alexei A Efros, and Jitendra Malik. Learning category-specific
mesh reconstruction from image collections. In Proc. ECCV, 2018. 9

[64] Subhabrata Choudhury, Laurynas Karazija, Iro Laina, Andrea Vedaldi, and Christian Rupprecht. Guess
what moves: unsupervised video and image segmentation by anticipating motion. arXiv preprint
arXiv:2205.07844, 2022. 9

[65] Simon Niklaus and Feng Liu. Softmax splatting for video frame interpolation. In Proc. CVPR, 2020. 9

[66] Jun-Yan Zhu, Philipp Krähenbühl, Eli Shechtman, and Alexei A. Efros. Generative visual manipulation on
the natural image manifold. In Proc. ECCV, 2016. 9

13
[67] Erik Härkönen, Aaron Hertzmann, Jaakko Lehtinen, and Sylvain Paris. Ganspace: Discovering interpretable
gan controls. Proc. NeurIPS, 2020. 9

[68] Ayush Tewari, Mohamed Elgharib, Gaurav Bharaj, Florian Bernard, Hans-Peter Seidel, Patrick Pérez,
Michael Zollhofer, and Christian Theobalt. Stylerig: Rigging stylegan for 3d control over portrait images.
In Proc. CVPR, 2020. 9

[69] Yujun Shen, Ceyuan Yang, Xiaoou Tang, and Bolei Zhou. Interfacegan: Interpreting the disentangled
face representation learned by gans. IEEE transactions on pattern analysis and machine intelligence,
44(4):2004–2018, 2020. 9

[70] Rameen Abdal, Yipeng Qin, and Peter Wonka. Image2stylegan: How to embed images into the stylegan
latent space? In Proc. ICCV, 2019. 9
[71] Rameen Abdal, Yipeng Qin, and Peter Wonka. Image2stylegan++: How to edit the embedded images? In
Proc. CVPR, 2020. 9

[72] David Bau, Hendrik Strobelt, William Peebles, Jonas Wulff, Bolei Zhou, Jun-Yan Zhu, and Antonio
Torralba. Semantic photo manipulation with a generative image prior. arXiv preprint arXiv:2005.07727,
2020. 9

[73] Yuval Alaluf, Or Patashnik, and Daniel Cohen-Or. Restyle: A residual-based stylegan encoder via iterative
refinement. In Proc. ICCV, 2021. 9
[74] Shanyan Guan, Ying Tai, Bingbing Ni, Feida Zhu, Feiyue Huang, and Xiaokang Yang. Collaborative
learning for faster stylegan embedding. arXiv preprint arXiv:2007.01758, 2020. 9

[75] Stanislav Pidhorskyi, Donald A Adjeroh, and Gianfranco Doretto. Adversarial latent autoencoders. In
Proc. CVPR, 2020. 9

[76] Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan, Yaniv Azar, Stav Shapiro, and Daniel
Cohen-Or. Encoding in style: a stylegan encoder for image-to-image translation. In Proc. CVPR, 2021. 9
[77] Omer Tov, Yuval Alaluf, Yotam Nitzan, Or Patashnik, and Daniel Cohen-Or. Designing an encoder for
stylegan image manipulation. ACM Transactions on Graphics (TOG), 40(4):1–14, 2021. 9

[78] Tengfei Wang, Yong Zhang, Yanbo Fan, Jue Wang, and Qifeng Chen. High-fidelity gan inversion for image
attribute editing. In Proc. CVPR, 2022. 9
[79] Ayush Tewari, Mohamed Elgharib, Florian Bernard, Hans-Peter Seidel, Patrick Pérez, Michael Zollhöfer,
and Christian Theobalt. Pie: Portrait image embedding for semantic control. ACM Transactions on
Graphics (TOG), 39(6):1–14, 2020. 9
[80] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial
networks. In Proc. CVPR, 2019. 9

Diffusion: by Aryan Jain
100% (1)
Diffusion: by Aryan Jain
55 pages
Aiml Online Brochure
No ratings yet
Aiml Online Brochure
16 pages
Generative Ai
No ratings yet
Generative Ai
7 pages
Automated Floorplan Generation in Architectural Design - 2022 - Automation in C
No ratings yet
Automated Floorplan Generation in Architectural Design - 2022 - Automation in C
13 pages
Diffusion
No ratings yet
Diffusion
55 pages
Image Super-Resolution Via Iterative Refinement
No ratings yet
Image Super-Resolution Via Iterative Refinement
28 pages
Make-A-Video - Text-to-Video Generation Without Text-Video Data - 2209.14792
No ratings yet
Make-A-Video - Text-to-Video Generation Without Text-Video Data - 2209.14792
13 pages
Consistency Models
No ratings yet
Consistency Models
42 pages
DDGAN
No ratings yet
DDGAN
28 pages
Denoising Diffusion Probabilistic Models
No ratings yet
Denoising Diffusion Probabilistic Models
25 pages
Mixed Precision Training
No ratings yet
Mixed Precision Training
12 pages
Adversarial Training Technique
No ratings yet
Adversarial Training Technique
3 pages
Diffusion Based Representation Learning
No ratings yet
Diffusion Based Representation Learning
20 pages
Mask DM
No ratings yet
Mask DM
23 pages
Deep Learning Unit 1
No ratings yet
Deep Learning Unit 1
35 pages
SegDiff - Image Segmentation With Diffusion Probabilistic Models
No ratings yet
SegDiff - Image Segmentation With Diffusion Probabilistic Models
13 pages
Denoising Diffusion Restoration Models
No ratings yet
Denoising Diffusion Restoration Models
32 pages
Control 3 Diff
No ratings yet
Control 3 Diff
21 pages
COMP9491 Week2 Deep - Learning 1
No ratings yet
COMP9491 Week2 Deep - Learning 1
66 pages
2582 Elucidating The Design Space o
No ratings yet
2582 Elucidating The Design Space o
13 pages
Conditional Diffusion Model
No ratings yet
Conditional Diffusion Model
11 pages
Denoising Diffusion Implicit Models
No ratings yet
Denoising Diffusion Implicit Models
22 pages
Diffusion Based Causal Representation Learning
No ratings yet
Diffusion Based Causal Representation Learning
17 pages
Efficient Diffusion Model For Image Super Resolution
No ratings yet
Efficient Diffusion Model For Image Super Resolution
19 pages
Diffusion Model
No ratings yet
Diffusion Model
17 pages
Final Term Paper Draft 2
No ratings yet
Final Term Paper Draft 2
33 pages
Deep Aqua
No ratings yet
Deep Aqua
12 pages
CVPR2022 Tutorial Diffusion Model
No ratings yet
CVPR2022 Tutorial Diffusion Model
188 pages
A Survey On Generative AI and LLM For Video
No ratings yet
A Survey On Generative AI and LLM For Video
16 pages
Elucidating The Design Space of Diffusion-Based Generative Models
No ratings yet
Elucidating The Design Space of Diffusion-Based Generative Models
47 pages
Denoising Diffusion Probabilistic Models For Generation of Realistic Fully-Annotated Microscopy Image Datasets PLOS Computational Biology
No ratings yet
Denoising Diffusion Probabilistic Models For Generation of Realistic Fully-Annotated Microscopy Image Datasets PLOS Computational Biology
10 pages
AI - Deep Learning Article
No ratings yet
AI - Deep Learning Article
6 pages
DN CNN
No ratings yet
DN CNN
14 pages
Image-to-Image Difussion Models
No ratings yet
Image-to-Image Difussion Models
29 pages
Eneralization in Diffusion Models Arises From Geometry Adaptive Harmonic Representations
No ratings yet
Eneralization in Diffusion Models Arises From Geometry Adaptive Harmonic Representations
25 pages
Generative AI-enabled Vehicular Networks: Fundamentals, Framework, and Case Stud
No ratings yet
Generative AI-enabled Vehicular Networks: Fundamentals, Framework, and Case Stud
8 pages
Gen AI 1
No ratings yet
Gen AI 1
4 pages
Diffusion Models
No ratings yet
Diffusion Models
46 pages
History
No ratings yet
History
75 pages
Fraud Detection in Mobile Payment Systems Using An Xgboost Based Framework
No ratings yet
Fraud Detection in Mobile Payment Systems Using An Xgboost Based Framework
19 pages
A Survey On Generative Diffusion Models
No ratings yet
A Survey On Generative Diffusion Models
26 pages
Chung MedIA 2022
No ratings yet
Chung MedIA 2022
19 pages
Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models For 3D Generation
No ratings yet
Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models For 3D Generation
16 pages
Diffusion Model 5
No ratings yet
Diffusion Model 5
51 pages
Diffusion Models in Vision A Survey
No ratings yet
Diffusion Models in Vision A Survey
20 pages
Score Approximation, Estimation and Distribution Recovery of Diffusion Models On Low-Dimensional Data
No ratings yet
Score Approximation, Estimation and Distribution Recovery of Diffusion Models On Low-Dimensional Data
52 pages
Improving Diffusion Models For Inverse Problems Using Manifold Constraints
No ratings yet
Improving Diffusion Models For Inverse Problems Using Manifold Constraints
29 pages
Diffusion Survey
No ratings yet
Diffusion Survey
38 pages
Diffusion Models A Concise Perspective
No ratings yet
Diffusion Models A Concise Perspective
8 pages
Universal Guidance For Diffusion Models
No ratings yet
Universal Guidance For Diffusion Models
20 pages
Weekly Update 10-05-24
No ratings yet
Weekly Update 10-05-24
1 page
Super-Resolution of Medical Images Using Real ESRGAN
No ratings yet
Super-Resolution of Medical Images Using Real ESRGAN
17 pages
Ji Yang Luo Survey Symbolic Music Generation
No ratings yet
Ji Yang Luo Survey Symbolic Music Generation
39 pages
Progressive Deblurring of Diffusion Models For Coarse-to-Fine Image Synthesis
No ratings yet
Progressive Deblurring of Diffusion Models For Coarse-to-Fine Image Synthesis
10 pages
D P S G N I P: Iffusion Osterior Ampling For Eneral Oisy Nverse Roblems
No ratings yet
D P S G N I P: Iffusion Osterior Ampling For Eneral Oisy Nverse Roblems
30 pages
Robust Semi-Supervised Regression For Vehicle Interior Noise Prediction
No ratings yet
Robust Semi-Supervised Regression For Vehicle Interior Noise Prediction
13 pages
06-Diffusion3d Minhyuk Final
No ratings yet
06-Diffusion3d Minhyuk Final
89 pages
Diffusion Models in Vision: A Survey: Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah
No ratings yet
Diffusion Models in Vision: A Survey: Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah
25 pages
CCS OD Gen AI Foundations
No ratings yet
CCS OD Gen AI Foundations
5 pages
Solving Inverse Problems
No ratings yet
Solving Inverse Problems
31 pages
Full Download (Ebook) Generative Artificial Intelligence: Exploring The Power and Potential of Generative AI by Shivam R Solanki, Drupad K Khublani ISBN 9798868804021, 8868804026 PDF
100% (1)
Full Download (Ebook) Generative Artificial Intelligence: Exploring The Power and Potential of Generative AI by Shivam R Solanki, Drupad K Khublani ISBN 9798868804021, 8868804026 PDF
66 pages
Tackling Photonic Inverse Design With Machine Learning
No ratings yet
Tackling Photonic Inverse Design With Machine Learning
15 pages
Harmonizing Minds and Machines
No ratings yet
Harmonizing Minds and Machines
13 pages
Indigo: An Inn-Guided Probabilistic Diffusion Algorithm For Inverse Problems
No ratings yet
Indigo: An Inn-Guided Probabilistic Diffusion Algorithm For Inverse Problems
6 pages
Decomposed Diffusion Sampler For Accelerating Large Scale Inverse Problems
No ratings yet
Decomposed Diffusion Sampler For Accelerating Large Scale Inverse Problems
28 pages
PixelTransformer - Sample Conditioned Signal Generation
No ratings yet
PixelTransformer - Sample Conditioned Signal Generation
10 pages
Renderdiffusion: Image Diffusion For 3D Reconstruction, Inpainting and Generation
No ratings yet
Renderdiffusion: Image Diffusion For 3D Reconstruction, Inpainting and Generation
15 pages
Denker2021 - Conditional Normalizing Flow
No ratings yet
Denker2021 - Conditional Normalizing Flow
27 pages
Videomage:: Multi-Subject and Motion Customization of Text-To-Video Diffusion Models
No ratings yet
Videomage:: Multi-Subject and Motion Customization of Text-To-Video Diffusion Models
14 pages
Latent
No ratings yet
Latent
10 pages
GAN-based Synthetic Medical Image Augmentation
No ratings yet
GAN-based Synthetic Medical Image Augmentation
10 pages
Image Caption Generator Using Deep Learning
No ratings yet
Image Caption Generator Using Deep Learning
5 pages
Synthetic Data Generation For Credit Scoring Models
No ratings yet
Synthetic Data Generation For Credit Scoring Models
36 pages
D P S G N I P: Iffusion Osterior Ampling For Eneral Oisy Nverse Roblems
No ratings yet
D P S G N I P: Iffusion Osterior Ampling For Eneral Oisy Nverse Roblems
30 pages
Vasluianu NTIRE 2023 Image Shadow Removal Challenge Report CVPRW 2023 Paper
No ratings yet
Vasluianu NTIRE 2023 Image Shadow Removal Challenge Report CVPRW 2023 Paper
20 pages
1 s2.0 S1361841523001329 Main
No ratings yet
1 s2.0 S1361841523001329 Main
16 pages
DF综述
No ratings yet
DF综述
49 pages
Generative Diffusion Prior For Unified Image Restoration and Enhancement
No ratings yet
Generative Diffusion Prior For Unified Image Restoration and Enhancement
12 pages
Advanced AI Viva Questions
No ratings yet
Advanced AI Viva Questions
2 pages
Structured Denoising Diffusion Models in Discrete State-Spaces
No ratings yet
Structured Denoising Diffusion Models in Discrete State-Spaces
33 pages
Arbitrary-Steps Image Super-Resolution Via Diffusion Inversion 2412.09013v2
No ratings yet
Arbitrary-Steps Image Super-Resolution Via Diffusion Inversion 2412.09013v2
16 pages
D D B M: Enoising Iffusion Ridge Odels
No ratings yet
D D B M: Enoising Iffusion Ridge Odels
26 pages
547 Diffusion Posterior Sampli
No ratings yet
547 Diffusion Posterior Sampli
27 pages
A Survey On Diffusion Models For Inverse Problems: 1.1 Problem Setting
No ratings yet
A Survey On Diffusion Models For Inverse Problems: 1.1 Problem Setting
38 pages
Deep Equilibrium Approaches To Diffusion Models
No ratings yet
Deep Equilibrium Approaches To Diffusion Models
27 pages
AV P S I P D M: Ariational Erspective On Olving Nverse Roblems With Iffusion Odels
No ratings yet
AV P S I P D M: Ariational Erspective On Olving Nverse Roblems With Iffusion Odels
24 pages
Generative AI
No ratings yet
Generative AI
2 pages
Guided Diffusion Sampling
No ratings yet
Guided Diffusion Sampling
30 pages
Bringing Generative AI To Adaptive Learning in Education
No ratings yet
Bringing Generative AI To Adaptive Learning in Education
14 pages
Multi View Three Dimensional Reconstruction: Advanced Techniques for Spatial Perception in Computer Vision
From Everand
Multi View Three Dimensional Reconstruction: Advanced Techniques for Spatial Perception in Computer Vision
Fouad Sabry
No ratings yet

Diffusion With Forward Models Solving Stochastic Inverse Problems Without Direct Supervision

Uploaded by

Diffusion With Forward Models Solving Stochastic Inverse Problems Without Direct Supervision

Uploaded by

Diffusion with Forward Models: Solving Stochastic

Inverse Problems Without Direct Supervision

Denoising diffusion models have emerged as a powerful class of generative models

37th Conference on Neural Information Processing Systems (NeurIPS 2023).

(b) Single Denoising Step

2.1 Background: Denoising Diffusion Probabilistic Models and their Limitation

2.2 Integrating Denoising Diffusion with Differentiable Forward Models

However, unlike conventional diffusion models, we implement pθ (Otrgt trgt

Ctxt. image at pose

Figure 2: Overview of 3D Generative Modeling. We build a 3D-structured denoise operator on top of

3 Prior Work on Latent Variable Models for Inverse Problems

4.1 Inverse Graphics

4.2 Single-Image Motion Prediction

4.3 GAN Inversion

You might also like