0% found this document useful (0 votes)

58 views

Diffusion Based Representation Learning

Uploaded by

markus.aurelius

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views

Diffusion Based Representation Learning

Uploaded by

markus.aurelius

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Diffusion Based Representation Learning

Sarthak Mittal * 1 2 Korbinian Abstreiter * 3 Stefan Bauer 4 5 Bernhard Schölkopf 6 Arash Mehrjou † 3 6

Abstract shapes (Cai et al., 2020), and audio (Chen et al., 2020b;
Diffusion-based methods, represented as stochas- Kong et al., 2021). Two promising approaches apply step-
tic differential equations on a continuous-time wise perturbations to samples of the data distribution until
domain, have recently proven successful as non- the perturbed distribution matches a known prior (Song &
adversarial generative models. Training such Ermon, 2019; Ho et al., 2020). A model is then trained to es-
models relies on denoising score matching, which timate the reverse process, which transforms samples of the
can be seen as multi-scale denoising autoencoders. prior to samples of the data distribution (Saremi et al., 2018).
Here, we augment the denoising score match- Diffusion models were further refined (Nichol & Dhariwal,
ing framework to enable representation learning 2021; Luhman & Luhman, 2021) and even achieved bet-
without any supervised signal. GANs and VAEs ter image sample quality than GANs (Dhariwal & Nichol,
learn representations by directly transforming la- 2021; Ho et al., 2021; Mehrjou et al., 2017). Further, Song
tent codes to data samples. In contrast, the in- et al. showed that these frameworks are discrete versions of
troduced diffusion-based representation learning continuous-time perturbations modeled by stochastic differ-
relies on a new formulation of the denoising score ential equations and proposed a diffusion-based generative
matching objective and thus encodes the infor- modeling framework on continuous time. Unlike generative
mation needed for denoising. We illustrate how models such as GANs and various forms of autoencoders,
this difference allows for manual control of the the original form of diffusion models does not come with a
level of details encoded in the representation. Us- fixed architectural module that captures the representations
ing the same approach, we propose to learn an of the data samples.
infinite-dimensional latent code that achieves im- Learning desirable representations has been an integral com-
provements on state-of-the-art models on semi- ponent of generative models such as GANs and VAEs (Ben-
supervised image classification. We also compare gio et al., 2013; Radford et al., 2016; Chen et al., 2016;
the quality of learned representations of diffusion van den Oord et al., 2017; Donahue & Simonyan, 2019;
score matching with other methods like autoen- Chen et al., 2020a; Schölkopf et al., 2021). Recent works
coder and contrastively trained systems through on visual representation learning achieve impressive perfor-
their performances on downstream tasks. Finally, mance on the downstream task of classification by applying
we also ablate with a different SDE formulation contrastive learning (Chen et al., 2020d; Grill et al., 2020;
for diffusion models and show that the benefits Chen & He, 2020; Caron et al., 2021; Chen et al., 2020c).
on downstream tasks are still present on changing However, contrastive learning requires additional supervi-
the underlying differential equation. sion of augmentations that preserve the content of the data,
and hence these approaches are not directly comparable to
representations learned through generative systems like Vari-
1. Introduction ational Autoencoders (Kingma & Welling, 2013; Rezende
et al., 2014) and the current work which are considered
Diffusion-based models have recently proven successful
fully unsupervised. Moreover, training the encoder to output
for generating images (Sohl-Dickstein et al., 2015; Song &
similar representation for different views of the same image
Ermon, 2020; Song et al., 2020), graphs (Niu et al., 2020),
removes information about the applied augmentations, thus
∗
Equal contribution, † Senior authorship, 1 Mila 2 Université de the performance benefits are limited to downstream tasks
Montréal 3 ETH Zürich 4 Helmholtz AI 5 Technical University of that do not depend on the augmentation, which has to be
Munich 6 Max Planck Institute for Intelligent Systems. Correspon- known beforehand. Hence our proposed algorithm does
dence to: Sarthak Mittal <[email protected]>, Arash Mehrjou not restrict the learned representations to specific down-
<[email protected]>.
stream tasks and solves a more general problem instead. We
Proceedings of the 40 th International Conference on Machine provide a summary of contrastive learning approaches in
Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright Appendix A. Similar to our approach, Denoising Autoen-
2023 by the author(s).

1
Diffusion Based Representation Learning

which is a bottleneck in diffusion-based generative

models compared with GANs and VAEs, without sac-
rificing image quality.

Denoising score matching Conditional score matching 1.1. Diffusion-based generative modeling
Representation learning

We first give a brief overview of the technical background

Figure 1. Conditional score matching with a parametrized latent for the framework of the diffusion-based generative model
code is representation learning. Denoising score matching esti- as described in (Song et al., 2021b). The forward diffusion
mates the score at each xt ; we add a latent representation z of the process of the data is modeled as an SDE on a continuous-
clean data x0 as additional input to the score estimator.
time domain t ∈ [0, T ]. Let x0 ∈ Rd denote a sample
coders (DAE) (Vincent et al., 2008) can be used to encode from the data distribution x0 ∼ p0 , where d is the data
representations that can be manually controlled by adjusting dimension. The trajectory (xt )t∈[0,T ] of data samples is a
the noise scale (Geras & Sutton, 2015; Chandra & Sharma, function of time determined by the diffusion process. The
2014; Zhang & Zhang, 2018). Note that, unlike DAEs, the SDE is chosen such that the distribution p0T (xT |x0 ) for
encoder in our approach does not receive noisy data as input, any sample x0 ∼ p0 can be approximated by a known prior
but instead extracts features based on the clean images. For distribution. Notice that the subscript 0T of p0T refers to the
example, this key difference allows DRL to be used to limit conditional distribution of the diffused data at time T given
the encoding to fine-grained features when focusing on low the data at time 0. For simplicity we limit the remainder of
noise levels, which is not possible with DAEs. this paper to the so-called Variance Exploding SDE (Song
et al., 2021b), that is,
Recently, there have been some works that rely on addi- r
tional encoders in the model architecture of diffusion based d[σ 2 (t)]
dx = f (x, t) dt + g(t) dw := dw, (1)
models (Preechakul et al., 2022; Mittal et al., 2021a; Sinha dt
et al., 2021). Sinha et al. (2021) considers an autoencoder
where w is the standard Wiener process. The perturbation
based setup with the diffusion model defining the prior
kernel of this diffusion process has a closed-form solution
whereas Pandey et al. (2022) considers the opposite where
being p0t (xt |x0 ) = N (xt ; x0 , [σ 2 (t) − σ 2 (0)]I). It was
a diffusion model is used to further improve the decoded
shown by Anderson (1982) that the reverse diffusion process
samples from a VAE. Preechakul et al. (2022) is a concur-
is the solution to the following SDE:
rent work that is closest to our setup, however, instead of
relying on time-conditioned encoder, they rely only on an dx = [f (x, t) − g 2 (t)∇x log pt (x)] dt + g(t) dw, (2)
unconditional encoder. Further, they concentrate more on
generation-based tasks while our approach focuses more on where w is the standard Wiener process when the time
evaluating the representations learned for downstream tasks. moves backwards. Thus, given the score function
∇x log pt (x) for all t ∈ [0, T ], we can generate samples
The main contributions of this work are
from the data distribution p0 (x). In order to learn the score
function, the simplest objective is Explicit Score Matching
• We present an alternative formulation of the denoising (ESM) (Hyvärinen & Dayan, 2005), that is,
score matching objective, showing that the objective
Ext ∥sθ (xt , t) − ∇xt log pt (xt )∥22 .

cannot be reduced to zero. We leverage this property (3)
to learn representations for downstream tasks.
Since the ground-truth score function ∇xt log pt (xt ) is gen-
• We introduce Diffusion-based Representation Learning erally not known, one can apply denoising score matching
(DRL), a novel framework for representation learning (DSM) (Vincent, 2011), which is defined as the following:
in diffusion-based generative models. We show how
this framework allows for manual control of the level JtDSM (θ) =Ex0 {Ext |x0 [∥sθ (xt , t)
(4)
of details encoded in the representation through an − ∇xt log p0t (xt |x0 )∥22 ]}.
infinite-dimensional code. We evaluate the proposed
approach on downstream tasks using the learned repre- The training objective over all t is augmented by Song et al.
sentations directly as well as using it as a pre-training (2021b) with a time-dependent positive
weighting
function
step for semi-supervised image classification, thereby λ(t), that is, J DSM (θ) = Et λ(t)JtDSM (θ) . One can
improving state-of-the-art approaches for the latter. also achieve class-conditional generation in diffusion-based
models by training an additional time-dependent classifier
• We evaluate the effect of the initial noise scale and pt (y|xt ) (Song et al., 2021b)). In particular, the condi-
achieve significant improvements in sampling speed, tional score for a fixed y can be expressed as the sum of

2
Diffusion Based Representation Learning

Figure 2. Results of proposed DRL models trained on MNIST and CIFAR-10 with point clouds visualizing the latent representation of
test samples, colored according to the digit class. The models are trained with Left: uniform sampling of t and Right: a focus on high
noise levels. Samples are generated from a grid of latent values ranging from -1 to 1.

the unconditional score and the score of the classifier, that Hence, the loss is invariant to the intermediate values of
is, ∇xt log pt (xt |y) = ∇xt log pt (xt ) + ∇xt log pt (y|xt ). the noise schedule. However, the weight functions λ(·) is
We take motivation from an alternative way to allow for still an important hyper-parameter whose choice might be
controllable generation, which, given supervised samples affected by the non-vanishing constant in Equation 6.
(x, y(x)), uses the following training objective for each time
To the best of our knowledge, there is no known theoretical
t
justification for the values of σ(t). While these hyperpa-
JtCSM (θ) = Ex0 {Ext |x0 [∥sθ (xt , t, y(x0 )) rameters could be optimized in ESM using gradient-based
(5) learning, this ability is severely limited by the non-vanishing
− ∇xt log p0t (xt |x0 )∥22 ]}.
constant in Equation 6.
The objective in Equation 5 is minimized if and only
Even though the non-vanishing constant in the denoising
if the model equals the conditional score function
score matching objective presents a burden in multiple ways
∇xt log pt (xt |y(x0 ) = ŷ) for all labels ŷ.
such as hyperparameter search and model evaluation, it
provides an opportunity for latent representation learning,
2. Diffusion-based Representation Learning which will be described in the following sections. We note
that this is different from Sinha et al. (2021); Mittal et al.
We begin this section by presenting an alternative formu-
(2021b) as they consider a Variational Autoencoder model
lation of the Denoising Score Matching (DSM) objective,
followed by diffusion in the latent space, where their repre-
which shows that this objective cannot be made arbitrarily
sentation learning objective is still guided by reconstruction.
small. Formally, the formula of the DSM objective can be
Contrary to this, our representation learning approach does
rearranged as
not utilize a variational autoencoder model and is guided by
JtDSM (θ) = Ex0 {Ext |x0 ∥sθ (xt , t) − ∇xt log pt (xt )∥22

denoising instead. Our approach is similar to Preechakul
et al. (2022) but we also condition the encoder system on the
+ ∥∇xt log p0t (xt |x0 ) − ∇xt log pt (xt )∥22 }.

time-step, thereby improving representation capacity and
(6)
leading to parameterized curve-based representations.
The above formulation holds, because the DSM objec-
tive in Equation 4 is minimized when ∀xt : sθ (xt , t) = 2.1. Learning latent representations
∇xt log pt (xt ), and differs from ESM in Equation 3 only
Since supervised data is limited and rarely available, we
by a constant (Vincent, 2011). Hence, the constant is equal
propose to learn a labeling function y(x0 ) at the same time
to the minimum achievable value of the DSM objective. A
as optimizing the conditional score matching objective in
detailed proof is included in the Appendix B.
Equation 5. In particular, we represent the labeling func-
It is noteworthy that the second term in the right-hand side tion as a trainable encoder Eϕ : Rd → Rc , where Eϕ (x0 )
of the Equation 6 does not depend on the learned score func- maps the data sample x0 to its corresponding code in the
tion of xt for every t ∈ [0, T ]. Rather, it is influenced by the c-dimensional latent space. The code is then used as ad-
diffusion process that generates xt from x0 . This observa- ditional input to the score model. Formally, the proposed
tion has not been emphasized previously, probably because learning objective for Diffusion-based Representation Learn-
it has no direct effect on the learning of the score func- ing (DRL) is the following:
tion, which is handled by the second term in the Equation
J DRL (θ, ϕ) = Et,x0 ,xt [λ(t)∥sθ (xt , t, Eϕ (x0 ))
6. However, the additional constant has major implications (7)
for finding other hyperparameters such as the function λ(t) − ∇xt log p0t (xt |x0 )∥22 + γ∥Eϕ (x0 )∥1 ]
and the choice of σ(t) in the forward SDE. As (Kingma
where we add a small amount of L1 regularization, con-
et al., 2021) shows, changing the integration variable from
trolled by γ, on the output of the trainable encoder.
time to signal-to-noise ratio (SNR) simplifies the diffusion
loss such that it only depends on the end values of SNR. To get a better idea of the above objective, we provide an

3
Diffusion Based Representation Learning

Figure 3. Results of proposed VDRL models trained on MNIST and CIFAR-10 with point clouds visualizing the latent representation of
test samples, colored according to the digit class. The models are trained with Left: uniform sampling of t and Right: a focus on high
noise levels. Samples are generated from a grid of latent values ranging from -2 to 2.

intuition for the role of Eϕ (x0 ) in the input of the model. formation channel that controls the amount of information
The model sθ (·, ·, ·) : Rd × R × Rc → Rd is a vector- that the diffusion model receives from the initial point of the
valued function whose output points to different directions diffusion process. With this perspective, any deterministic
based on the value of its third argument. In fact, Eϕ (x0 ) or stochastic function that can manipulate I(xt , x0 ), the
selects the direction that best recovers x0 from xt . Hence, mutual information between x0 and xt , can be used. This
when optimizing over ϕ, the encoder learns to extract the opens up the room for stochastic encoders similar to VAEs
information from x0 in a reduced-dimensional space that which we call Variational Diffusion-based Representation
helps recover x0 by denoising xt . Learning (VDRL). The formal objective of VDRL is
We show in the following that Equation 7 is a valid repre- J V DRL (θ, ϕ) = Et,x0 ,xt [Ez∼Eϕ (Z|x0 ) [λ(t)∥sθ (xt , t, z)
sentation learning objective. The score of the perturbation
kernel ∇xt log p0t (xt |x0 ) is a function of only t, xt and x0 . − ∇xt log p0t (xt |x0 )∥22 ] (8)
Thus, the objective can be reduced to zero if all information + DKL (Eϕ (Z|x0 )||N (Z; 0, I)]
about x0 is contained in the latent representation Eϕ (x0 ).
When Eϕ (x0 ) has no mutual information with x0 , the ob- 2.2. Infinite-dimensional representation of data
jective can only be reduced up to the constant in Equation
6. Hence, our proposed formulation takes advantage of the We now present an alternative version of DRL where the
non-zero lower-bound of Equation 6, which can only vanish representation is a function of time. Instead of emphasizing
when the encoder Eϕ (·) properly distills information from on different noise levels by weighting the training objective,
the unperturbed data into a latent code, which is an addi- as done in the previous section, we can provide the time t
tional input to the score model. These properties show that as input to the encoder. Formally, the new objective is
Equation 7 is a valid objective for representation learning.
Et,x0 ,xt [λ(t)∥sθ (xt , t, Eϕ (x0 , t))
Our proposed representation learning objective enjoys the
continuous nature of SDEs, a property that is not available − ∇xt log p0t (xt |x0 )∥22 + γ∥Eϕ (x0 , t)∥1 ] (9)
in many previous representation learning methods (Radford
where Eϕ (x0 ) in Equation 7 is replaced by Eϕ (x0 , t). In-
et al., 2016; Chen et al., 2016; Locatello et al., 2019). In
tuitively, it allows the encoder to extract the necessary in-
DRL, the encoder is trained to represent the information
formation of x0 required to denoise xt for any noise level.
needed to denoise x0 for different levels of noise σ(t). We
This leads to richer representation learning since normally
hypothesize that by adjusting the weighting function λ(t),
in autoencoders or other static representation learning meth-
we can manually control the granularity of the features en-
ods, the input data x0 ∈ Rd is mapped to a single point
coded in the representation and provide empirical evidence
z ∈ Rc in the latent space. In contrast, we propose a richer
as support. Note that t → T is associated with higher levels
representation where the input x0 is mapped to a curve in
of noise and the mutual information of xt and x0 starts to
Rc instead of a single point. Hence, the learned latent code
vanish. In this case, denoising requires all information about
is produced by the map x0 → (Eϕ (x0 , t))t∈[0,T ] where the
x0 to be contained in the code. In contrast, t → 0 corre-
infinite-dimensional object (Eϕ (x0 , t))t∈[0,T ] is the encod-
sponds to low noise levels and hence xt contains coarse-
ing for x0 .
grained features of x0 and only fine-grained properties may
have been washed out. Hence, the encoded representation Proposition 2.1. For any downstream task, the infinite-
learns to keep the information needed to recover these fine- dimensional code (Eϕ (x0 , t))t∈[0,T ] learned using the ob-
grained details. We provide empirical evidence to support jective in Equation 9 is at least as good as finite-dimensional
this hypothesis in Section 3. static codes learned by the reconstruction of x0 .
It is noteworthy that Eϕ does not need to be a determinis- Proof sketch. Let LD (z, y) be the per-sample loss for a
tic function and can be a probabilistic map similar to the supervised learning task calculated for the pair (z, y) where
encoder of VAEs. In principle, it can be viewed as an in- z = z(x, t) is the representation learned for the input x at

4
Diffusion Based Representation Learning

CIFAR10 CIFAR100 Mini ImageNet

80
70
60
Model
50 VDRL
DRL
Accuracy

40 AE
VAE
SimCLR
30 SimCLR-Gauss
DAE
20 CDAE
10
0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Time Time Time

Figure 4. Comparing the performance of the proposed diffusion-based representations (DRL and VDRL) with the baselines that include
autoencoder (AE), variational autoencoder (VAE), simple contrastive learning (simCLR) and its restricted variant (simCLR-Gauss) which
exclude domain-specific data augmentation from the original simCLR algorithm.

time t and y is the label. The representation function is also classification in Section 3.2.
a function of the scalar t that takes values from a closed
subset U of R. For any value s ∈ U , it is obvious that 3. Results
mint∈U LD (z(x, t), y) < LD (z(x, s), y). (10) For all experiments, we use the same function σ(t), t ∈
[0, 1] as in Song et al. (2021b), which is σ(t) =
Taking into account the extra argument t, the representa- t
σmin (σmax /σmin ) , where σmin = 0.01 and σmax = 50.
tion function z(x, t) can be seen as an infinite dimensional Further, we use a 2d latent space for all qualitative experi-
representation. The argument t actually controls which rep- ments (Section 3.3) and 128 dimensional latent space for the
resentation of x has to be passed to the downstream task. downstream tasks (Section 3.1) and semi-supervised image
The conventional representation learning algorithms corre- classification (Section 3.2). We also set λ(t) = σ 2 (t), which
spond to choosing the t argument apriori and keep it fixed has been shown to yield the KL-Divergence objective (Song
independent of x. Here, by minimizing over t, the passed et al., 2021a). Our goal is not to produce state-of-the-art
representation cannot be worse than the results of conven- image quality, rather showcase the representation learning
tional representation learning methods. Note that LD (·, ·) method. Because of that and also limited computational
here can be any metric that we require, however gradient- resources, we did not carry out an extensive hyperparameter
based learning and optimization issues can still affect the sweep (check Appendix D for details). Note that all experi-
actual performance achieved . ments were conducted on a single RTX8000 GPU, taking
The score matching objective can be seen as a reconstruction up to 30 hours of wall-clock time, which only amounts to
objective of x0 conditioned on xt . The terminal time T 15% of the iterations proposed in (Song et al., 2021b).
is chosen large enough so that xT is independent of x0 ,
hence the objective for t = T is equal to a reconstruction 3.1. Downstream Classification
objective without conditioning. Therefore, there exists a
We directly evaluate the representations learned by different
t ∈ [0, T ] where the learned representation Eϕ (x0 , t) is the
algorithms on downstream classification tasks for CIFAR10,
same representation learned by the reconstruction objective
CIFAR100, and Mini-ImageNet datasets. The represen-
of a vanilla autoencoder. The full proof for Proposition 2.1
tation is first learned using the proposed diffusion-based
can be found in the Appendix C
method. Then, the encoder (either deterministic or proba-
A downstream task can leverage this rich encoding in var- bilistic) is frozen and a single-layered neural network is
ious ways, including the use of either the static code for a trained on top of it for the downstream prediction task.
fixed t, or the use of the whole trajectory (Eϕ (x0 , t))t∈[0,T ] For the baselines, we consider an Autoencoder (AE), a
as input. We posit the conjecture that the proposed rich Variational Autoencoder (VAE), two versions of Denois-
representation is helpful for downstream tasks when used ing Autoencoders (DAE and CDAE) and two verisons of
for pretraining, where the value of t could either be a model Contrastive Learning (SimCLR(Chen et al., 2020c) and
selection parameter or be jointly optimized with other pa- SimCLR-Gauss explained below) setup to compare with the
rameters during training. We leave investigations along proposed methods (DRL and VDRL). Figure 4 shows that
these directions as important future work. We show the DRL and VDRL outperforms autoencoder-styled baselines
performance of the proposed model on downstream tasks in as well as the restricted contrastive learning baseline.
Section 3.1 and also evaluate it on semi-supervised image

5
Diffusion Based Representation Learning

CIFAR10 CIFAR100 Mini ImageNet

80
70
60
Model
50 VDRL
DRL
Accuracy

40 AE
VAE
30 SimCLR
SimCLR-Gauss
20 DAE
CDAE
10
0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Time Time Time

Figure 5. Comparing the performance of the proposed diffusion-based representations (DRL and VDRL) with the baselines that include
autoencoder (AE), variational autoencoder (VAE), simple contrastive learning (simCLR) and its restricted variant (simCLR-Gauss) which
exclude domain-specific data augmentation from the original simCLR algorithm.

Standard Autoencoders— Standard autoencoders (AE and stricted the transformations used by the simCLR method
VAE) rely on learning of representations of the input data to the additive pixel-wise Gaussian noise (SimCLR-Gauss)
using an encoder in such a way that it can be reconstructed as this was the only domain-agnostic transformation in the
back, using a decoder, solely based on the representation SimCLR pipeline. The original SimCLR expectedly out-
learned. Such systems can be trained without any regulariza- performs the other methods because it uses the privileged
tion on the representation space (AE), or in a probabilistic information injected by the employed data augmentation
fashion which relies on variational inference and ultimately methods. For example, random cropping is an inductive bias
leads to a KL-Divergence based regularization on the repre- that reflects the spatial regularity of the images. Even though
sentation space (VAE). Figure 4 shows that the time-axis is it is possible to strengthen our method and autoencoder-
not meaningful for such training, as expected. based baselines such as VAEs with such augmentation-based
strategies, it still doesn’t provide the additional inductive
Denoising Autoencoders— While the problem of reconstruc-
bias of preservation of high-level information in the pres-
tion is easily solved given a big enough network (i.e. capable
ence of these augmentations, which SimCLR directly uses.
of learning the identity mapping), this problem can be made
Thus, we restricted all baselines to the generic setting with-
harder by considering a noisy version of the data as input
out this inductive bias and leave the domain-specific im-
with the task of predicting its denoised version, as opposed
provements for future work.
to vanilla reconstruction in standard autoencoders. Such
approaches are referred to as Denoising Autoencoders, and It is seen that the DRL and VDRL methods significantly
we consider its two variants. In the first variant, DAE, a outperform the baselines on all the datasets at a number
noisy version of the image is given as input xt (higher t of different time-steps t. We further evaluate the infinite-
implying more noise) and the task of the model is to predict dimensional representation on few-shot image classification
the denoised version x0 . Since larger t implies learning of using the representation at different timescales as input.
representations from more noise, we can see a sharp decline The detailed results are shown in Appendix E. In summary,
in performance of DAE systems with increasing t in Fig- the representations of DRL and VDRL achieve significant
ure 4. The second variant, CDAE, considers xt as the noisy improvements as compared to an autoencoder or VAE for
input again, but predicts the denoised version based on a rep- several values of t .
resentation of xt combined with a learned time-conditioned
Overall the results align with the theoretical argument of
representation of the true input Eϕ (x0 , t), similar to the
Proposition 2.1 that the rich representation of DRL is at
DRL setups. This approach is arguably similar to DRL with
least as good as the static code learned using a recon-
the sole difference being that Eϕ (·, ·) in DRL had the incen-
struction objective. It further shows that in practice, the
tive of predicting the right score function, whereas in CDAE
infinite-dimensional code is superior to the static (finite-
the incentive is to denoise in a single step. As highlighted
dimensional) representation for downstream applications
in Figure 4, the performance increases with increasing t
such as image classification by a significant margin.
because the encoder Eϕ (·, ·) is useless in low-noise settings
(as all the data is already there in the input) but becomes As a further analysis, we consider the same experiments
increasingly meaningful as noise increases. when the DRL models are trained on the Variance Preserv-
Restricted SimCLR— While we compare against the stan-
dard SimCLR model, to obtain a fair comparison, we re-

6
Diffusion Based Representation Learning

LaplaceNet Ours
Pretraining None DRL VDRL
Mixup No Yes No Yes No
Dataset #labels
CIFAR-10 100 73.68 75.29 74.31 64.67 81.63
500 91.31 92.53 92.70 92.31 92.79
1000 92.59 93.13 93.24 93.42 93.60
2000 94.00 93.96 94.18 93.91 93.96
4000 94.73 94.97 94.75 95.22 95.00
CIFAR-100 1000 55.58 55.24 55.85 55.74 56.47
4000 67.07 67.25 67.22 67.47 67.54
10000 73.19 72.84 73.31 73.66 73.50
20000 75.80 76.07 76.46 76.88 76.64
Mini ImageNet 4000 58.40 58.84 58.95 59.29 59.14
10000 66.65 66.80 67.31 66.63 67.46

Table 1. Comparison of classifier accuracy in % for different pretraining settings. Scores better than the SOTA model (LaplaceNet) are in
bold. “DRL” pretraining is our proposed representation learning, and “VDRL” the respective version which uses a probabilistic encoder.

ing SDE formulation (Song et al., 2021b). of classification. Note that pretraining the classifier as part
1 p of an autoencoder did not yield any improvements (Table
dx = − β(t)x dt + β(t) dw, (11) 4 in the Appendix). Combining DRL with mixup yields
2
inconsistent improvements, results are reported in Table 5
Figure 5 shows that even in this formualtion, DRL and of the Appendix. In addition, DRL pretraining achieves
VDRL models outperform their autoencoder and denoising much better performances when only limited computational
autoencoder competitors and perform better than restricted resources are available (Tables 2, 3 in the Appendix).
constrastive learning, showing that this approach can be
easily adapted to various different diffusion models. 3.3. Qualitative Results

3.2. Semi-Supervised Image Classification We first train a DRL model with L1 -regularization on the
latent code on MNIST (LeCun & Cortes, 2010) and CIFAR-
The current state-of-the-art model for many semi-supervised 10. Figure 2 (left) shows samples from a grid over the latent
image classification benchmarks is LaplaceNet (Sellars space and a point cloud visualization of the latent values
et al., 2021). It alternates between assigning pseudo-labels z = Eϕ (x0 ). For MNIST, we can see that the value of
to samples and supervised training of a classifier. The key z1 controls the stroke width, while z2 weakly indicates the
idea is to assign pseudo-labels by minimizing the graphical class. The latent code of CIFAR-10 samples mostly encodes
Laplacian of the prediction matrix, where similarities of information about the background color, which is weakly
data samples are calculated on a hidden layer representation correlated to the class. The use of a probabilistic encoder
in the classifier. Note that LaplaceNet applies mixup (Zhang (VDRL) leads to similar representations, as seen in Fig.
et al., 2017) that changes the input distribution of the clas- 3 (left). We further want to point out that the generative
sifier. We evaluate our method with and without mixup on process using the reverse SDE involves randomness and thus
CIFAR-10 (Krizhevsky et al., a), CIFAR-100 (Krizhevsky generates different samples for a single latent representation.
et al., b) and MiniImageNet (Vinyals et al., 2016). The diversity of samples however steadily decreases with
In the following, we evaluate the infinite-dimensional repre- the dimensionality of the latent space, shown in Figure 7 of
sentation (Eϕ (x0 , t))t∈[0,T ] on semi-supervised image clas- the Appendix.
sification, where we use DRL and VDRL as pretraining Next, we analyze the behavior of the representation when
for the LaplaceNet classifier. Table 1 depicts the classifier adjusting the weighting function λ(t) to focus on higher
accuracy on test data for different pretraining settings. De- noise levels, which can be done by changing the sam-
tails for architecture and hyperparameters are described in pling distribution of t. To this end, we sample t ∈ [0, 1]
Appendix G. such that σ(t) is uniformly sampled from the interval
Our proposed pretraining using DRL significantly improves [σmin , σmax ] = [0.01, 50]. Figure 2 (right) shows the re-
the baseline and often surpasses the state-of-the-art perfor- sulting representation of DRL and Figure 3 (right) for the
mance of LaplaceNet. Most notable are the results of DRL VDRL results. As expected, the latent representation for
and VDRL without mixup, which achieve high accuracies MNIST encodes information about classes rather than fine-
without being specifically tailored to the downstream task grained features such as stroke width. This validates our

7
Diffusion Based Representation Learning

hypothesis of Section 2.1 that we can control the granu- representation uses the natural diffusion process that is em-
larity of features encoded in the latent space. For CIFAR- ployed in score-based models as a continuous obfuscation
10, the model again only encodes information about the of the information content. Moreover, unlike the loss func-
background, which contains the most information about the tion of the contrastive-based methods that are specifically
image class. A detailed analysis of class separation in the designed to learn the invariances of manually augmented
extreme case of training on single timescales is included in data, our method uses the same loss function that is used to
Appendix H. learn the score function for generative models. The repre-
sentation is learned based on a generic information-theoretic
Overall, the difference in the latent codes for varying λ(t)
concept which is an encoder (information channel) that con-
shows that we can control the granularity encoded in the
trols how much information of the input has to be passed
representation of DRL. This ability provides a significant
to the score function at each step of the diffusion process.
advantage when there exists some prior information about
We also provided theoretical motivation for this information
the level of detail that we intend to encode in the target
channel. The algorithm cannot ignore this source of infor-
representation. We further illustrate how the representa-
mation because it is the only way to reduce a non-negative
tion encodes information for the task of denoising in the
loss arbitrarily close to zero.
Appendix (Fig. 6).
Our experiments on diffusion-based representation learn-
We also provide further analysis into the impact of noise
ing methods highlight its benefits when compared to fully
scales on generation in Appendix I.
unsupervised models like autoencoders, variational or de-
noising. The proposed methodology does not rely on ad-
4. Conclusion ditional supervision regarding augmentations, and can be
easily adapted to any representation learning paradigm that
We presented Diffusion-based Representation Learning
previously relied on reconstruction-based autoencoder meth-
(DRL), a new objective for representation learning based
ods.
on conditional denoising score matching. In doing so, we
turned the original non-vanishing objective function into one
that can be reduced arbitrarily close to zero by the learned Acknowledgements
representation. We showed that the proposed method learns
SM would like to acknowledge the support of UNIQUE’s
interpretable features in the latent space. In contrast to some
and IVADO’s scholarships towards his research. This re-
of the previous approaches that required specialized architec-
search was enabled in part by compute resources provided
tural changes or data manipulations, denoising score match-
by Mila (mila.quebec).
ing comes with a natural ability to control the granularity
of features encoded in the representation. We demonstrated
that the encoder can learn to separate classes when focusing
on higher noise levels and encodes fine-grained features
such as stroke-width when mainly trained on smaller noise
variance. In addition, we proposed an infinite-dimensional
representation and demonstrated its effectiveness for down-
stream tasks such as few-shot classification. Using the rep-
resentation learning as pretraining for a classifier, we were
able to improve the results of LaplaceNet, a state-of-the-art
model on semi-supervised image classification.
Starting from a different origin but conceptually close, con-
trastive learning as a self-supervised approach could be com-
pared with our representation learning method. We should
emphasize that there are fundamental differences both at
theoretical and algorithmic levels between contrastive learn-
ing and our diffusion-based method. The generation of
positive and negative examples in contrastive learning re-
quires the domain knowledge of the applicable invariances.
This knowledge might be hard to obtain in scientific do-
mains such as genomics where the knowledge of invariance
amounts to the knowledge of the underlying biology which
in many cases is not known. However, our diffusion-based

8
Diffusion Based Representation Learning

References Dhariwal, P. and Nichol, A. Diffusion models beat gans on

image synthesis, 2021.
Anderson, B. D. Reverse-time diffusion equation
models. Stochastic Processes and their Applica- Donahue, J. and Simonyan, K. Large scale adversarial
tions, 12(3):313–326, 1982. ISSN 0304-4149. representation learning, 2019.
doi: https://doi.org/10.1016/0304-4149(82)90051-5.
URL https://www.sciencedirect.com/ Geras, K. J. and Sutton, C. Scheduled denoising autoen-
science/article/pii/0304414982900515. coders, 2015.

Bengio, Y., Courville, A., and Vincent, P. Representation Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond,
learning: A review and new perspectives. IEEE transac- P. H., Buchatskaya, E., Doersch, C., Pires, B. A., Guo,
tions on pattern analysis and machine intelligence, 35(8): Z. D., Azar, M. G., Piot, B., Kavukcuoglu, K., Munos,
1798–1828, 2013. R., and Valko, M. Bootstrap your own latent: A new
approach to self-supervised learning, 2020.
Bromley, J., Bentz, J., Bottou, L., Guyon, I., Lecun, Y.,
Moore, C., Sackinger, E., and Shah, R. Signature verifica- Ho, J., Jain, A., and Abbeel, P. Denoising diffusion prob-
tion using a ”siamese” time delay neural network. Interna- abilistic models. CoRR, abs/2006.11239, 2020. URL
tional Journal of Pattern Recognition and Artificial Intelli- https://arxiv.org/abs/2006.11239.
gence, 7:25, 08 1993. doi: 10.1142/S0218001493000339. Ho, J., Saharia, C., Chan, W., Fleet, D. J., Norouzi, M., and
Cai, R., Yang, G., Averbuch-Elor, H., Hao, Z., Belongie, S., Salimans, T. Cascaded diffusion models for high fidelity
Snavely, N., and Hariharan, B. Learning gradient fields image generation. arXiv preprint arXiv:2106.15282,
for shape generation, 2020. 2021.

Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Hyvärinen, A. and Dayan, P. Estimation of non-normalized
and Joulin, A. Unsupervised learning of visual features statistical models by score matching. Journal of Machine
by contrasting cluster assignments, 2021. Learning Research, 6(4), 2005.
Kingma, D., Salimans, T., Poole, B., and Ho, J. Varia-
Chandra, B. and Sharma, R. Adaptive noise schedule
tional diffusion models. Advances in neural information
for denoising autoencoder. volume 8834, pp. 535–542,
processing systems, 34:21696–21707, 2021.
11 2014. ISBN 978-3-319-12636-4. doi: 10.1007/
978-3-319-12637-1 67. Kingma, D. P. and Welling, M. Auto-encoding variational
bayes. arXiv preprint arXiv:1312.6114, 2013.
Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan,
D., and Sutskever, I. Generative pretraining from pixels. Kong, Z., Ping, W., Huang, J., Zhao, K., and Catanzaro, B.
In International Conference on Machine Learning, pp. Diffwave: A versatile diffusion model for audio synthesis,
1691–1703. PMLR, 2020a. 2021.
Chen, N., Zhang, Y., Zen, H., Weiss, R. J., Norouzi, M., and Krizhevsky, A., Nair, V., and Hinton, G. Cifar-10 (canadian
Chan, W. Wavegrad: Estimating gradients for waveform institute for advanced research). a. URL http://www.
generation, 2020b. cs.toronto.edu/˜kriz/cifar.html.
Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A Krizhevsky, A., Nair, V., and Hinton, G. Cifar-100 (canadian
simple framework for contrastive learning of visual rep- institute for advanced research). b. URL http://www.
resentations. In International conference on machine cs.toronto.edu/˜kriz/cifar.html.
learning, pp. 1597–1607. PMLR, 2020c.
LeCun, Y. and Cortes, C. MNIST handwritten digit
Chen, T., Kornblith, S., Swersky, K., Norouzi, M., and database. 2010. URL http://yann.lecun.com/
Hinton, G. E. Big self-supervised models are strong semi- exdb/mnist/.
supervised learners. CoRR, abs/2006.10029, 2020d. URL
https://arxiv.org/abs/2006.10029. Locatello, F., Bauer, S., Lucic, M., Raetsch, G., Gelly, S.,
Schölkopf, B., and Bachem, O. Challenging common
Chen, X. and He, K. Exploring simple siamese representa- assumptions in the unsupervised learning of disentangled
tion learning, 2020. representations. In international conference on machine
learning, pp. 4114–4124. PMLR, 2019.
Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever,
I., and Abbeel, P. Infogan: Interpretable representation Luhman, E. and Luhman, T. Knowledge distillation in
learning by information maximizing generative adversar- iterative generative models for improved sampling speed,
ial nets, 2016. 2021.

9
Diffusion Based Representation Learning

Mehrjou, A., Schölkopf, B., and Saremi, S. An- Schölkopf, B., Locatello, F., Bauer, S., Ke, N. R., Kalch-
nealed generative adversarial networks. arXiv preprint brenner, N., Goyal, A., and Bengio, Y. Toward causal
arXiv:1705.07505, 2017. representation learning. Proceedings of the IEEE, 109(5):
612–634, 2021.
Mittal, G., Engel, J., Hawthorne, C., and Simon, I. Symbolic
music generation with diffusion models. arXiv preprint Sellars, P., Avilés-Rivero, A. I., and Schönlieb, C.
arXiv:2103.16091, 2021a. Laplacenet: A hybrid energy-neural model for deep semi-
supervised classification. CoRR, abs/2106.04527, 2021.
Mittal, G., Engel, J. H., Hawthorne, C., and Simon, I. Sym-
URL https://arxiv.org/abs/2106.04527.
bolic music generation with diffusion models. CoRR,
abs/2103.16091, 2021b. URL https://arxiv.org/ Sinha, A., Song, J., Meng, C., and Ermon, S. D2C: diffusion-
abs/2103.16091. denoising models for few-shot conditional generation.
Nichol, A. and Dhariwal, P. Improved denoising diffusion CoRR, abs/2106.06819, 2021. URL https://arxiv.
probabilistic models. CoRR, abs/2102.09672, 2021. URL org/abs/2106.06819.
https://arxiv.org/abs/2102.09672. Sohl-Dickstein, J., Weiss, E. A., Maheswaranathan, N., and
Niu, C., Song, Y., Song, J., Zhao, S., Grover, A., and Ermon, Ganguli, S. Deep unsupervised learning using nonequi-
S. Permutation invariant graph generation via score-based librium thermodynamics, 2015.
generative modeling, 2020. Song, J., Meng, C., and Ermon, S. Denoising diffusion
Pandey, K., Mukherjee, A., Rai, P., and Kumar, A. Dif- implicit models, 2020.
fusevae: Efficient, controllable and high-fidelity gen-
Song, Y. and Ermon, S. Generative modeling by es-
eration from low-dimensional latents. arXiv preprint
timating gradients of the data distribution. CoRR,
arXiv:2201.00308, 2022.
abs/1907.05600, 2019. URL http://arxiv.org/
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., abs/1907.05600.
Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,
Song, Y. and Ermon, S. Improved techniques for training
Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cour-
score-based generative models. CoRR, abs/2006.09011,
napeau, D., Brucher, M., Perrot, M., and Duchesnay, E.
2020. URL https://arxiv.org/abs/2006.
Scikit-learn: Machine learning in Python. Journal of
09011.
Machine Learning Research, 12:2825–2830, 2011.
Preechakul, K., Chatthee, N., Wizadwongsa, S., and Suwa- Song, Y., Durkan, C., Murray, I., and Ermon, S. Maxi-
janakorn, S. Diffusion autoencoders: Toward a meaning- mum likelihood training of score-based diffusion mod-
ful and decodable representation. In Proceedings of the els, 2021a. URL https://arxiv.org/pdf/2101.
IEEE/CVF Conference on Computer Vision and Pattern 09258v1.
Recognition, pp. 10619–10629, 2022. Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Er-
Radford, A., Metz, L., and Chintala, S. Unsupervised rep- mon, S., and Poole, B. Score-based generative modeling
resentation learning with deep convolutional generative through stochastic differential equations, 2021b.
adversarial networks, 2016. van den Oord, A., Vinyals, O., and Kavukcuoglu,
Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic K. Neural discrete representation learning. CoRR,
backpropagation and approximate inference in deep gen- abs/1711.00937, 2017. URL http://arxiv.org/
erative models. In International conference on machine abs/1711.00937.
learning, pp. 1278–1286. PMLR, 2014.
Vincent, P. A connection between score matching and de-
Rousseeuw, P. J. Silhouettes: A graphical aid to noising autoencoders. Neural Computation, 23(7):1661–
the interpretation and validation of cluster anal- 1674, 2011. doi: 10.1162/NECO a 00142.
ysis. Journal of Computational and Applied
Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-
Mathematics, 20:53–65, 1987. ISSN 0377-0427.
A. Extracting and composing robust features with de-
doi: https://doi.org/10.1016/0377-0427(87)90125-7.
noising autoencoders. In Proceedings of the 25th Inter-
URL https://www.sciencedirect.com/
national Conference on Machine Learning, ICML ’08,
science/article/pii/0377042787901257.
pp. 1096–1103, New York, NY, USA, 2008. Associa-
Saremi, S., Mehrjou, A., Schölkopf, B., and Hyvärinen, tion for Computing Machinery. ISBN 9781605582054.
A. Deep energy estimator networks. arXiv preprint doi: 10.1145/1390156.1390294. URL https://doi.
arXiv:1805.08306, 2018. org/10.1145/1390156.1390294.

10
Diffusion Based Representation Learning

Vinyals, O., Blundell, C., Lillicrap, T. P., Kavukcuoglu,

K., and Wierstra, D. Matching networks for one shot
learning. CoRR, abs/1606.04080, 2016. URL http:
//arxiv.org/abs/1606.04080.
Zhang, H., Cissé, M., Dauphin, Y. N., and Lopez-Paz, D.
mixup: Beyond empirical risk minimization. CoRR,
abs/1710.09412, 2017. URL http://arxiv.org/
abs/1710.09412.
Zhang, Q. and Zhang, L. Convolutional adaptive denoising
autoencoders for hierarchical feature extraction. Front.
Comput. Sci., 12(6):1140–1148, dec 2018. ISSN 2095-
2228. doi: 10.1007/s11704-016-6107-0. URL https:
//doi.org/10.1007/s11704-016-6107-0.

11
Diffusion Based Representation Learning

A. Related work on contrastive learning

The core idea of contrastive learning is to learn representations that are similar for different views of the same image and
distant for different images. In order to prevent the collapse of representations to a constant, various approaches have been
introduced. SimCLRv2 directly includes a loss term repulsing negative image pairs in addition to the attraction of different
views of positive pairs (Chen et al., 2020d)). In contrast, BYOL relies solely on positive pairs, preventing collapse by
enforcing similarity between the encoded representation of an image and the output of a momentum encoder applied to a
different view of the same image (Grill et al., 2020). An additional approach relies on online clustering and was proposed in
SwAV (Caron et al., 2021). Training in SwAV is based on enforcing consistency between cluster assignments produced for
different views of an image. Each of these methods relies on the foundation of Siamese networks (Bromley et al., 1993),
which were shown to be competitive for unsupervised pretraining for classification networks on its own when including a
stop-gradient operation on one of the branches (Chen & He, 2020).

B. Denoising Score Matching

The following is the proof for the new formulation of the denoising score matching objective in Equation 6.

Proof. It was shown by (Vincent, 2011) that Equation 4 is equal to explicit score matching up to a constant which is
independent of θ, that is,

Ex0 {Ext |x0 [∥sθ (xt , t) − ∇xt log p0t (xt |x0 )∥22 ]} (12)
= Ext ∥sθ (xt , t) − ∇xt log pt (xt )∥22 + c.

(13)

As a consequence, the objective is minimized when the model equals the ground-truth score function sθ (xt , t) =
∇x log pt (x). Hence we have:

Ex0 {Ext |x0 [∥∇xt log pt (xt ) − ∇xt log p0t (xt |x0 )∥22 ]} (14)
= Ext ∥∇xt log pt (xt ) − ∇xt log pt (xt )∥22 + c

(15)
= c. (16)

Combining these results leads to the claimed exact formulation of the Denoising Score Matching objective:

JtDSM (θ) = Ex0 {Ext |x0 [∥sθ (xt , t) − ∇xt log p0t (xt |x0 )∥22 ]} (17)
= Ext ∥sθ (xt , t) − ∇xt log pt (xt )∥22 + c

(18)
= Ext ∥sθ (xt , t) − ∇xt log pt (xt )∥22

(19)
+ Ex0 {Ext |x0 [∥∇xt log pt (xt ) − ∇xt log p0t (xt |x0 )∥22 ]}
=Ex0 {Ext |x0 [∥∇xt log p0t (xt |x0 ) − ∇xt log pt (xt )∥22
(20)
+ ∥sθ (xt , t) − ∇xt log pt (xt )∥22 ]}.

C. Representation learning
Here we present the proof for Proposition 2.1, stating that the infinite-dimensional code learned using DRL is at least as
good as a static code learned using a reconstruction objective.

Proof.
R We assume that the distribution of the diffused samples at time t = T matches a known prior pT (xT ). That is,
p(x0 )p0T (xT |x0 ) dx0 = pT (xT ). In practice T is chosen such that this assumption approximately holds.
Now consider the training objective in Equation 9 at time T , which can be transformed to a reconstruction objective in the

12
Diffusion Based Representation Learning

following way:
h i
2
λ(T )Ex0 ,xT ∥sθ (xT , T, Eϕ (x0 , T )) − ∇xT log p0T (xT |x0 )∥2 (21)
" #
2
x0 − xT
=λ(T )Ex0 ExT ∼pT (xT ) sθ (xT , T, Eϕ (x0 , T )) − 2 (22)
σ (T ) 2
h i
2
=λ(T )σ −4 (T )Ex0 ExT ∼pT (xT ) ∥Dθ (Eϕ (x0 , T )) − x0 ∥2 (23)
h i
2
=λ(T )σ −4 (T )Ex0 ∥Dθ (Eϕ (x0 , T )) − x0 ∥2 , (24)

0 D (E (x ,T ))−x
T
where we replaced the score model with a Decoder model sθ (xT , T, Eϕ (x0 , T )) = θ ϕσ2 (T ) and replaced the
x0 −xT
score function of the perturbation kernel ∇xT log p0T (xT |x0 ) with its known closed-form solution σ2 (T ) determined by
the Forward SDE in Equation 1. Hence the learned code at time t = T is equal to a code learned using a reconstruction
objective.
We model a downstream task as a minimization problem of a distance d : Ω × Ω → R in the feature space Ω between
the true feature extractor g : Rd → Ω which maps data samples x0 to a features space Ω and a model feature extractor
hψ : Rc → Ω doing the same given the code as input. The following shows that the infinite-dimensional representation is at
least as good as the static code:
inf min Ex0 [d(hψ (Eϕ (x0 , t)), g(x0 ))] ≤ min Ex0 [d(hψ (Eϕ (x0 , T )), g(x0 ))] (25)
t ψ ψ

Figure 6. Samples generated starting from xt (left column) using the diffusion model with the latent code of another x0 (top row) as input.
It shows that samples are denoised correctly only when conditioning on the latent code of the corresponding original image x0 .

D. Architecture and Hyperparameters

The model architecture we use for all experiments is based on “DDPM++ cont. (deep)” used for CIFAR-10 in (Song et al.,
2021b). It is composed of a downsampling and an upsampling block with residual blocks at multiple resolutions. We did

13
Diffusion Based Representation Learning

(a) 2-dimensional (b) 4-dimensional

(c) 8-dimensional (d) 16-dimensional

Figure 7. Samples generated using the same latent code for each generation, showing that the randomness of the code-conditional
generation of DRL reduces in higher dimensional latent spaces.

not change any of the hyperparameters of the optimizer. Depending on the dataset, we adjusted the number of resolutions,
number of channels per resolution, and the number of residual blocks per resolution in order to reduce training time.
For representation learning, we use an encoder with the same architecture as the downsampling block of the model, followed
by another three dense layers mapping to a low dimensional latent space. Another four dense layers map the latent code
back to a higher-dimensional representation. It is then given as input to the model in the same way as the time embedding.
That is, each channel is provided with a conditional bias determined by the representation and time embedding at multiple
stages of the downsampling and upsampling block.

Regularization of the latent space For both datasets, we use a regularization weight of 10−5 when applying L1-
regularization, and a weight of 10−7 when using a probabilistic encoder regularized with KL-Divergence.

MNIST hyperparameters Due to the simplicity of MNIST, we only use two resolutions of size 28 × 28 × 32 and
14 × 14 × 64, respectively. The number of residual blocks at each resolution is set to two. In each experiment, the model is
trained for 80k iterations. For a uniform sampling of σ we trained the models for an additional 80k iterations with a frozen
encoder and uniform sampling of t.

14
Diffusion Based Representation Learning

30 SM SM
VSM 38 VSM
AE AE
28 VAE 37 VAE
Accuracy in %

Accuracy in %
36
26
35
24 34
22 33
32
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
t t
(a) 100 labels (b) 1000 labels

Figure 8. Classifier accuracies for few shot learning on given 8-dimensional representations learned using DRL (SM), VDRL (VSM),
Autoencoder (AE) and Variational Autoencoder (VAE).

CIFAR-10 hyperparameters For the silhouette score analysis, we use three resolutions of size 32 × 32 × 32, 16 × 16 × 32,
and 8 × 8 × 32, again with only two residual blocks at each resolution. Each model is trained for 90k iterations.

CIFAR-10 (deep) hyperparameters While representation learning works for small models already, sample quality on
CIFAR-10 is poor for models of the size described above. Thus for models used to generate samples, we use eight residual
blocks per resolution and the following resolutions: 32 × 32 × 32, 16 × 16 × 64, 8 × 8 × 64, and 4 × 4 × 64. Each model
is trained for 300k iterations. Note that this number of iterations is not sufficient for convergence, however capable of
illustrating the representation learning with limited computational resources.

E. Evaluation of the infinite-dimensional representation

In order to evaluate our infinite-dimensional representation, we conduct an ablation study where we compare our proposed
method with Autoencoders (AE) and Variational Autoencoders (VAE) on CIFAR-10 images. We measure the accuracy of
an SVM provided by sklearn (Pedregosa et al., 2011) with default hyperparameters trained on the representation of 100
(resp. 1000) training samples and their class labels. For our time-dependent representation, this is done for fixed values of t
between 0.0 and 1.0 in steps of 0.1. This is done for both DRL and VDRL, where we use a probabilistic encoder regularized
by including an additional KL-Divergence term in the training objective. DRL and AE were regularized using L1-norm, and
the regularization weight was optimized for each model independently.
Results for few-shot learning with fixed representations are shown in Figure 8. As expected, the accuracies when training on
the score matching representations highly depend on the value of t. Overall our representation achieves much better scores
when using the best t, and performs comparable to AE and VAE for t = 1.0. This aligns with Proposition 2.1 claiming that
our representation learning method for t = 1.0 is similar to a static code learned using reconstruction objective. Note that
the shape of the time-dependent classifier accuracies resembles the one of the silhouette scores of CIFAR-10 in 12. This is
not surprising, since both training on single values of t and learning a time-dependent representation are both trained to find
the optimal representation for a given value of t. We further want to point out that representation learning through score
matching enjoys the training stability of diffusion-based generative models, which is often not the case in GANs and VAEs.

F. Downstream Image Classification

Architecture and Hyperparameters In all our experiments, we consider the small WideResNet model WRN-28-2 of
(Sellars et al., 2021) as the encoder module for all of the different settings: diffusion representation learning, autoencoder
and contrastive learning. We sample the time-steps at intervals of 0.1 from the range 0.0 − 1.0. Corresponding to each
time-step, we train a single layered non-linear MLP network for 50 epochs.

15
Diffusion Based Representation Learning

CIFAR10 CIFAR100 Mini ImageNet

18
60 20
16

50 14 Model
15 DRL
Accuracy

12 VDRL
40 10 AE
VAE
10 SimCLR
8 SimCLR-Gauss
30
6
5
20 4

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Time Time Time

Figure 9. Comparing the low-data regime (1000 labels) downstream performance of the proposed diffusion-based representations (DRL
and VDRL) with the baselines that include autoencoder (AE), variational autoencoder (VAE), simple contrastive learning (simCLR) and
its restricted variant (simCLR-Gauss) which exclude domain-specific data augmentation from the original simCLR algorithm.

CIFAR10 CIFAR100 Mini ImageNet

70 35
30
60 30
25
Model
25 DRL
Accuracy

50 20 VDRL
20 AE
VAE
40 15 15 SimCLR
SimCLR-Gauss
30 10 10

5
5
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Time Time Time

Figure 10. Comparing the low-data regime (5000 labels) downstream performance of the proposed diffusion-based representations (DRL
and VDRL) with the baselines that include autoencoder (AE), variational autoencoder (VAE), simple contrastive learning (simCLR) and
its restricted variant (simCLR-Gauss) which exclude domain-specific data augmentation from the original simCLR algorithm.

CIFAR10 CIFAR100 Mini ImageNet

40 35
70
35
30
60
30 Model
25 DRL
Accuracy

50 25 VDRL
20 AE
20 VAE
40 SimCLR
15 15 SimCLR-Gauss

30 10 10
5
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Time Time Time

Figure 11. Comparing the low-data regime (10000 labels) downstream performance of the proposed diffusion-based representations (DRL
and VDRL) with the baselines that include autoencoder (AE), variational autoencoder (VAE), simple contrastive learning (simCLR) and
its restricted variant (simCLR-Gauss) which exclude domain-specific data augmentation from the original simCLR algorithm.

16
Diffusion Based Representation Learning

Pretraining using
Dataset #labels No pretraining Improvement
DRL
CIFAR-10 100 64.12 69.79 +5.67
500 86.24 88.28 +2.04
1000 87.48 88.56 +1.08
2000 89.99 89.52 -0.47
4000 90.15 91.13 +0.98
CIFAR-100 1000 45.14 48.04 +2.90
4000 59.86 60.34 +0.48
10000 64.83 65.80 +0.97
20000 65.77 66.39 +0.62
MiniImageNet 4000 47.18 50.75 +3.57
10000 58.66 58.62 -0.04

Table 2. Classifier accuracy in % with and without DRL as pretraining of the classifier when training for 100 epochs only.

Results with Limited Data We perform additional experiments where the encoder system is as before and kept frozen,
but the MLP can only access a fraction of the training set for the downstream supervised classification task. We ablate over
three different number of labels provided to the MLP: 1000, 5000 and 10000. The results for the different datasets can be
seen in Figures 9-11 which shows that the trends are consistent even in low data regime.

G. Semi-supervised image classification

Architecture and Hyperparameters In all experiments, our encoder has the same architecture as the classifier, where
the hidden layer used to measure similarities for assigning pseudo-labels in LaplaceNet is used as the latent code in
representation learning. For all experiments, the input t to the encoder is included as a trainable parameter of the model and
initialized with t = 0.5. As done in the original paper, we train the model for 260 iterations, where each iteration consists of
assigning pseudo-labels and one epoch of supervised training on the assigned pseudo-labels. The training is preceded by
100 supervised epochs on the labeled data. We use the small WideResNet model WRN-28-2 of (Sellars et al., 2021) and the
same hyperparameters as the authors.

Evaluation with limited computation time In the following we include more detailed analysis of the scenario of a few
supervised labels and limited computational resources. Besides LaplaceNet and its version without mixup, we include an
ablation study of encoder pretraining as part of an autoencoder using binary cross-entropy as a reconstruction objective. In
addition, we propose to improve the search for the optimal value of t by the model selection, since the gradient for t is usually
noisy and small. Thus we include additional experiments where we chose the initial t based on the minimum training loss
after 100 epochs of supervised training. The optimal t is approximated by calculating the training loss for 11 equally spaced
values of t in the interval [0.001, 1]. The results are shown in Table 3. While mixup achieves no significant improvement
in the few-label case trained using 100 epochs, we can see that a simple autoencoder pretraining consistently improves
classifier accuracy. More notably, however, our proposed pretraining based on score matching achieves significantly better
results than both random initialization and autoencoder pretraining. In the t-search, we observed that for all datasets, our
proposed method selects t = 0.9, however it moves towards the interval [0.4, 0.6] during training. While this shows that
the approach of selecting t based on supervised training loss is not working, it demonstrates that the parameter t can very
well be learned in the training process, making the downstream task performance robust to the initial value of t. In our
experiments the final value of t was always in the range [0.4, 0.6], independent of the initial value of t.

H. Training on single timescales

To understand the effect of training DRL on different timescales more clearly, we limit the support of the weighting function
λ(t) to a single value of t. We analyze the resulting quality of the latent representation for different values of t using the
silhouette score with euclidean distance based on the dataset classes (Rousseeuw, 1987). It compares the average distance
between a point to all other points in its cluster with the average distance to points in the nearest different cluster. Thus

17
Diffusion Based Representation Learning

CIFAR-10 CIFAR-100 MiniImageNet

Pretraining Options
100 labels 1000 labels 4000 labels
None 64.12 45.14 47.18
None mixup 54.06 46.28 47.64
DRL 69.79 48.04 50.75
DRL t-search 67.07 47.08 50.31
Autoencoder 64.99 46.88 48.52

Table 3. Comparison of classifier accuracy in % for different pretraining methods in the case of few supervised labels when training for
100 epochs only.

CIFAR-10 CIFAR-100 MiniImageNet

Pretraining
100 labels 1000 labels 4000 labels
None 73.68 55.58 58.40
DRL 74.31 55.85 58.95
Autoencoder 58.84 55.41 57.93

Table 4. Classifier accuracy in % for autoencoder pretraining compared with the baseline and score matching as pretraining. No mixup is
applied for this ablation study.

Ours Ours Ours Ours Ours

Pretraining Basic Basic Mixup-DRL VDRL VDRL
Mixup in sup. training No Yes Yes No Yes
Dataset #labels
CIFAR-10 100 74.31 64.67 70.40 81.63 77.51
500 92.70 92.31 92.55 92.79 91.46
1000 93.24 93.42 93.14 93.60 93.33
2000 94.18 93.91 93.80 93.96 94.27
4000 94.75 95.22 94.75 95.00 94.87
CIFAR-100 1000 55.85 55.74 55.15 56.47 55.65
4000 67.22 67.47 67.09 67.54 67.52
10000 73.31 73.66 74.36 73.50 73.20
20000 76.46 76.88 77.04 76.64 76.68
MiniImageNet 4000 58.95 59.29 59.46 59.14 59.36
10000 67.31 66.63 67.31 67.46 66.79

Table 5. Evaluation of classifier accuracy in %, including the setting of using mixup during pretraining (right column). DRL pretraining is
our proposed representation learning, and ”Mixup-DRL” the respective version which additionally applies mixup during pretraining.
”VDRL” instead uses a probabilistic encoder.

18
Diffusion Based Representation Learning

0.10
0.2 0.08
Silhouette score

Silhouette score
0.1 0.06
0.04
0.0 0.02
0.00
0.1
0.02
0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0
t t
(a) MNIST (b) CIFAR-10

Figure 12. Mean and standard deviation of silhouette scores when training a DRL model on MNIST (left) and CIFAR-10 (right) using a
single t over three runs.

Gaussian Uniform + Gaussian

tinit σ(tinit )
FID ↓ FID ↓
0.5 0.71 218.95 25.02
0.6 1.66 75.11 5.15
0.7 3.88 12.57 2.98
0.8 9.10 3.05 2.99
0.9 21.33 2.97 2.94
1.0 50.00 3.01 2.99

Table 6. FID for different initial noise scales evaluated on 20k generated samples.

we measure how well the latent representation encodes classes, ignoring any other features. Note that after learning the
representation with a different distribution of t it is necessary to perform additional training with a uniform sampling of t
and a frozen encoder to achieve good sample quality.
Figure 12 shows the silhouette scores of latent codes of MNIST and CIFAR-10 samples for different values of t. In alignment
with our hypothesis of Section 2.1, training DRL on a small t and thus low noise levels leads to almost no encoded class
information in the latent representation, while the opposite is the case for a range of t which differs between the two
datasets. The decline in encoded class information for high values of t can be explained by the vanishing difference between
distributions of perturbed samples when t gets large. This shows that the distinction among the code classes represented by
the silhouette score is controlled by λ(t).

I. The choice of the initial noise scale

In the following, we evaluate image quality and diversity for different initial noise scales for CIFAR-10 dataset. Note that
we do not change σ(T ), but instead evaluate generated images for different initial times tinit , which implicitly define the
initial noise scale σ(tinit ). This reduces the number of sampling steps per image, which is 1000 × tinit and thus directly
proportional to tinit . Table 6 shows the FID of generated images for various values of tinit . As we can see, the first 200
sampling steps can safely be replaced by approximating the prior directly either with the Gaussian or the additional uniform
distribution. Interestingly, using the sum of the uniform and Gaussian random variables as a prior leads to improved image
quality. This approximation for p0.7 (x) allows us to reduce the number of sampling steps by 30% without sacrificing
image quality, which is further supported by the visual quality of generated samples shown in Figure 13. Further, note that
FID is occasionally lower for values of tinit < 1.0 than for tinit = 1. This suggests that up to these timescales, our prior
approximates the distribution better than the diffusion model when starting at tinit = 1.0.

19
Diffusion Based Representation Learning

(a) tinit = 0.5 (b) tinit = 0.6 (c) tinit = 0.7 (d) tinit = 0.8 (e) tinit = 0.9 (f) tinit = 1.0

(g) tinit = 0.5 (h) tinit = 0.6 (i) tinit = 0.7 (j) tinit = 0.8 (k) tinit = 0.9 (l) tinit = 1.0

Figure 13. Generated image samples for different values of tinit . Top row ((a)-(f)) uses the Gaussian prior, bottom row ((g)-(l)) uses the
version with an additional uniform random variable in the prior.

Momentum and Impulse
No ratings yet
Momentum and Impulse
4 pages
Chung_MedIA_2022
No ratings yet
Chung_MedIA_2022
19 pages
Consistency Models
No ratings yet
Consistency Models
42 pages
2302.07194v1
No ratings yet
2302.07194v1
52 pages
Consistency Models
No ratings yet
Consistency Models
41 pages
2024 - Denoising Autoregressive Representation Learning - Li Et Al
No ratings yet
2024 - Denoising Autoregressive Representation Learning - Li Et Al
22 pages
1-EFFECTIVE DATA AUGMENTATION WITH DIFFUSION MODELS
No ratings yet
1-EFFECTIVE DATA AUGMENTATION WITH DIFFUSION MODELS
23 pages
stimper22a (1)
No ratings yet
stimper22a (1)
22 pages
2304.12824v2
No ratings yet
2304.12824v2
31 pages
Vat D
No ratings yet
Vat D
11 pages
Stochastic Iterative Graph Matching
No ratings yet
Stochastic Iterative Graph Matching
11 pages
Stable Diffusion 3 Paper
No ratings yet
Stable Diffusion 3 Paper
28 pages
An analytic theory
No ratings yet
An analytic theory
32 pages
Generative Pretraining From Pixels
No ratings yet
Generative Pretraining From Pixels
13 pages
Zero Shot Text To Image Generation (DALL E)
No ratings yet
Zero Shot Text To Image Generation (DALL E)
20 pages
4 PLMS
No ratings yet
4 PLMS
24 pages
[2024-ICML] Variational Schrodinger Diffusion Models
No ratings yet
[2024-ICML] Variational Schrodinger Diffusion Models
24 pages
gan_diffusion
No ratings yet
gan_diffusion
9 pages
1 s2.0 S095741742302746X Main
No ratings yet
1 s2.0 S095741742302746X Main
13 pages
2502.09622v1
No ratings yet
2502.09622v1
32 pages
5450 Diffusion Model Augmented Beha (1)
No ratings yet
5450 Diffusion Model Augmented Beha (1)
25 pages
Our Classifier Is Secretly An Energy Based Model and You Should Treat It Like One
No ratings yet
Our Classifier Is Secretly An Energy Based Model and You Should Treat It Like One
23 pages
From Denoising Diffusions To Denoising Markov Models
No ratings yet
From Denoising Diffusions To Denoising Markov Models
55 pages
Cdtrans Cross-Domain Transformer For Unsupervised Domain Adaptation
No ratings yet
Cdtrans Cross-Domain Transformer For Unsupervised Domain Adaptation
14 pages
Score-Based Continuous-Time Discrete Diffusion Models
No ratings yet
Score-Based Continuous-Time Discrete Diffusion Models
16 pages
2021 emnlp-main 92论文
No ratings yet
2021 emnlp-main 92论文
14 pages
3564_generalization_for_discriminat-Supplementary Material
No ratings yet
3564_generalization_for_discriminat-Supplementary Material
19 pages
2023 - Diffusion-GAN - Wang Et Al
No ratings yet
2023 - Diffusion-GAN - Wang Et Al
26 pages
2020- 【Hard Negative】 - CONTRASTIVE LEARNING WITH Hard Negative Samples
No ratings yet
2020- 【Hard Negative】 - CONTRASTIVE LEARNING WITH Hard Negative Samples
28 pages
Dis Co
No ratings yet
Dis Co
40 pages
Approximations To The Fisher Information Metric of Deep Generative Models For Out-Of-Distribution Detection
No ratings yet
Approximations To The Fisher Information Metric of Deep Generative Models For Out-Of-Distribution Detection
32 pages
1 s2.0 S0893608023005695 Main
No ratings yet
1 s2.0 S0893608023005695 Main
14 pages
Progressive Distillation For Fast Sampling
No ratings yet
Progressive Distillation For Fast Sampling
21 pages
33370-Article Text-37438-1-2-20250410
No ratings yet
33370-Article Text-37438-1-2-20250410
9 pages
Denoising Diffusion Implicit Models
No ratings yet
Denoising Diffusion Implicit Models
22 pages
InfoNCE
No ratings yet
InfoNCE
9 pages
A Generalization of Transformer Networks To Graphs
No ratings yet
A Generalization of Transformer Networks To Graphs
8 pages
2304.00388 Multilevel CNNs For Parametric PDEs
No ratings yet
2304.00388 Multilevel CNNs For Parametric PDEs
42 pages
Geometric Latent Diffusion Model
No ratings yet
Geometric Latent Diffusion Model
18 pages
28682-Article Text-32736-1-2-20240324
No ratings yet
28682-Article Text-32736-1-2-20240324
11 pages
Understanding Diffusion Models by Feynman's Path Integral
No ratings yet
Understanding Diffusion Models by Feynman's Path Integral
27 pages
Diffusion time series prediction (general)
No ratings yet
Diffusion time series prediction (general)
30 pages
contrastive diffuser
No ratings yet
contrastive diffuser
16 pages
Understanding Self-Supervised Learning Dynamics Without Contrastive Pairs
No ratings yet
Understanding Self-Supervised Learning Dynamics Without Contrastive Pairs
23 pages
Paper About AI 1
No ratings yet
Paper About AI 1
28 pages
Spectral Normalization For GANs
No ratings yet
Spectral Normalization For GANs
26 pages
Technical paper on Sequential Recommendation
No ratings yet
Technical paper on Sequential Recommendation
9 pages
Spectral Journey - How Transformers Predict the Shortest Path
No ratings yet
Spectral Journey - How Transformers Predict the Shortest Path
12 pages
2501.19383v1
No ratings yet
2501.19383v1
25 pages
S, S & S C - T C M: Implifying Tabilizing Caling Ontinuous IME Onsistency Odels
No ratings yet
S, S & S C - T C M: Implifying Tabilizing Caling Ontinuous IME Onsistency Odels
38 pages
Vi Transformer
No ratings yet
Vi Transformer
21 pages
CADS Unleashing The Diversity of Diffusion Models Through Condition Annealed Sampling Paper 1
No ratings yet
CADS Unleashing The Diversity of Diffusion Models Through Condition Annealed Sampling Paper 1
33 pages
s41598-023-37933-0
No ratings yet
s41598-023-37933-0
17 pages
Levy Improving Distributional
No ratings yet
Levy Improving Distributional
16 pages
How To Compare Noisy Patches? Patch Similarity Beyond Gaussian Noise
No ratings yet
How To Compare Noisy Patches? Patch Similarity Beyond Gaussian Noise
30 pages
SUMSEM2023-24 CSI3901 ETH VL2023240701291 2024-06-06 Reference-Material-I
No ratings yet
SUMSEM2023-24 CSI3901 ETH VL2023240701291 2024-06-06 Reference-Material-I
10 pages
Data Augmentation With Variational Autoencoder
No ratings yet
Data Augmentation With Variational Autoencoder
12 pages
Write and Paint
No ratings yet
Write and Paint
25 pages
Exchanging-based Multimodal Fusion with Transformer
No ratings yet
Exchanging-based Multimodal Fusion with Transformer
13 pages
A Survey of Graph Neural Networks in Various Learning Paradigms Methods, Applications, and Challenges
No ratings yet
A Survey of Graph Neural Networks in Various Learning Paradigms Methods, Applications, and Challenges
70 pages
Mesh Generation: Advances and Applications in Computer Vision Mesh Generation
From Everand
Mesh Generation: Advances and Applications in Computer Vision Mesh Generation
Fouad Sabry
No ratings yet
A Comprehensive Survey On Deep Graph Representation Learning
No ratings yet
A Comprehensive Survey On Deep Graph Representation Learning
85 pages
CircularScribbleArts Small
No ratings yet
CircularScribbleArts Small
1 page
Vision Transformer Adapter For Dense Predictions
No ratings yet
Vision Transformer Adapter For Dense Predictions
20 pages
NeurIPS 2022 Poisson Flow Generative Models Supplemental Conference
No ratings yet
NeurIPS 2022 Poisson Flow Generative Models Supplemental Conference
33 pages
Interacting Particle Solutions of Fokker-Planck Equations Through Gradient-Log-Density Estimation
No ratings yet
Interacting Particle Solutions of Fokker-Planck Equations Through Gradient-Log-Density Estimation
34 pages
Cascaded Diffusion Models For High Fidelity Image Generation
No ratings yet
Cascaded Diffusion Models For High Fidelity Image Generation
33 pages
Fokker-Planck - and Langevin Equat
No ratings yet
Fokker-Planck - and Langevin Equat
16 pages
Towards Better Dynamic Graph Learning - New Architecture and Unified Library
No ratings yet
Towards Better Dynamic Graph Learning - New Architecture and Unified Library
24 pages
Stable Diffusion A Tutorial
100% (1)
Stable Diffusion A Tutorial
66 pages
WebtoonMe - A Data-Centric Approach For Full-Body Portrait Stylization
No ratings yet
WebtoonMe - A Data-Centric Approach For Full-Body Portrait Stylization
8 pages
‎‫מזמורי תהלים- תקציר‬‎
No ratings yet
‎‫מזמורי תהלים- תקציר‬‎
3 pages
PINT-Provably Expressive Temporal Graph Networks
No ratings yet
PINT-Provably Expressive Temporal Graph Networks
32 pages
PFGM++ - Unlocking The Potential of Physics-Inspired Generative Models
No ratings yet
PFGM++ - Unlocking The Potential of Physics-Inspired Generative Models
23 pages
Quadrics - Linear Algebra, Analytic Geometry, Differential Geometry
No ratings yet
Quadrics - Linear Algebra, Analytic Geometry, Differential Geometry
8 pages
2024學而思盃備考練習-K2【水印版】- 題解
No ratings yet
2024學而思盃備考練習-K2【水印版】- 題解
37 pages
High Dive Unit Problem
0% (1)
High Dive Unit Problem
8 pages
Tullio Levi Civita
No ratings yet
Tullio Levi Civita
5 pages
Place-Value Concepts: Intensive Intervention
No ratings yet
Place-Value Concepts: Intensive Intervention
54 pages
TD Matrices-Determinants
No ratings yet
TD Matrices-Determinants
2 pages
Aptitude Book - 240502 - 134150
No ratings yet
Aptitude Book - 240502 - 134150
212 pages
WEEK 4 College Algebra
100% (1)
WEEK 4 College Algebra
11 pages
MATH 223 Assignment 5A
No ratings yet
MATH 223 Assignment 5A
1 page
CALC TECH (Draft)
100% (1)
CALC TECH (Draft)
14 pages
Math Thesis in The Philippines
100% (3)
Math Thesis in The Philippines
5 pages
Assignment (1)
No ratings yet
Assignment (1)
3 pages
Final Departmental
No ratings yet
Final Departmental
3 pages
Curves in Space
No ratings yet
Curves in Space
9 pages
Understanding The Derivative Homework Packet Answers
100% (1)
Understanding The Derivative Homework Packet Answers
5 pages
MATLAB Examples - Mathematics
No ratings yet
MATLAB Examples - Mathematics
53 pages
BTech Aerospace Engineering Syllabus
No ratings yet
BTech Aerospace Engineering Syllabus
2 pages
CH 02 - Scalars & Vectors
No ratings yet
CH 02 - Scalars & Vectors
8 pages
Fermat and Euler's Theorems
No ratings yet
Fermat and Euler's Theorems
23 pages
Binary Arithmetic and Arithmetic Circuits-1
No ratings yet
Binary Arithmetic and Arithmetic Circuits-1
17 pages
Paper 1 Topical Examination Questions by KDC 2024
No ratings yet
Paper 1 Topical Examination Questions by KDC 2024
51 pages
Basic Rules of Differentiation
No ratings yet
Basic Rules of Differentiation
33 pages
Quiz - 2 (Page 2 of 4)
100% (1)
Quiz - 2 (Page 2 of 4)
10 pages
Mathematics Paper 1 HL
No ratings yet
Mathematics Paper 1 HL
16 pages
LQ2 Gen Math
No ratings yet
LQ2 Gen Math
2 pages
Teaching Exercise Matlab
No ratings yet
Teaching Exercise Matlab
28 pages
Maths P1 PP 2017
No ratings yet
Maths P1 PP 2017
24 pages
Untitled
No ratings yet
Untitled
13 pages

Diffusion Based Representation Learning

Uploaded by

Diffusion Based Representation Learning

Uploaded by

Diffusion Based Representation Learning

which is a bottleneck in diffusion-based generative

We first give a brief overview of the technical background

CIFAR10 CIFAR100 Mini ImageNet

CIFAR10 CIFAR100 Mini ImageNet

References Dhariwal, P. and Nichol, A. Diffusion models beat gans on

Vinyals, O., Blundell, C., Lillicrap, T. P., Kavukcuoglu,

A. Related work on contrastive learning

B. Denoising Score Matching

D. Architecture and Hyperparameters

(a) 2-dimensional (b) 4-dimensional

(c) 8-dimensional (d) 16-dimensional

E. Evaluation of the infinite-dimensional representation

F. Downstream Image Classification

CIFAR10 CIFAR100 Mini ImageNet

CIFAR10 CIFAR100 Mini ImageNet

CIFAR10 CIFAR100 Mini ImageNet

G. Semi-supervised image classification

H. Training on single timescales

CIFAR-10 CIFAR-100 MiniImageNet

CIFAR-10 CIFAR-100 MiniImageNet

Ours Ours Ours Ours Ours

Gaussian Uniform + Gaussian

I. The choice of the initial noise scale

You might also like