Boosting Latent Diffusion With Perceptual Objectives
Boosting Latent Diffusion With Perceptual Objectives
Objectives
Tariq Berrada Ifriqi1,2 , Pietro Astolfi1 , Jakob Verbeek1 , Melissa Hall1 , Marton Havasi1 , Michal Drozdzal1 ,
Yohann Benchetrit1 , Adriana Romero-Soriano1,3,4,5 , Karteek Alahari2
1
FAIR at Meta, 2 Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, France, 3 McGill
University, 4 Mila, Quebec AI institute, 5 Canada CIFAR AI chair
Latent di!usion models (LDMs) power state-of-the-art high-resolution generative image models.
LDMs learn the data distribution in the latent space of an autoencoder (AE) and produce images
arXiv:2411.04873v1 [cs.CV] 6 Nov 2024
by mapping the generated latents into RGB image space using the AE decoder. While this approach
allows for e"cient model training and sampling, it induces a disconnect between the training of the
di!usion model and the decoder, resulting in a loss of detail in the generated images. To remediate
this disconnect, we propose to leverage the internal features of the decoder to define a latent perceptual
loss (LPL). This loss encourages the models to create sharper and more realistic images. Our loss
can be seamlessly integrated with common autoencoders used in latent di!usion models, and can
be applied to di!erent generative modeling paradigms such as DDPM with epsilon and velocity
prediction, as well as flow matching. Extensive experiments with models trained on three datasets
at 256 and 512 resolution show improved quantitative – with boosts between 6% and 20% in FID
– and qualitative results when using our perceptual loss.
A kitten with its paws A brown grizzly bear, Photo of a red mush- A stone wall leads up to Statue of a beautiful Peach and cream brides
up. stands in the water as it room found in a pine a solitary tree at sunset. maiden hidden among bouquet for a rustic
hunts for fish. forest. the pretty pink flowers. vintage wedding.
Figure 1 Samples from models trained with and without our latent perceptual loss on CC12M. Samples
from our model with latent perceptual loss (bottom) have more detail and realistic textures.
1 Introduction
Latent di!usion models (LDMs) (Rombach et al., 2022) have enabled considerable advances in image generation,
and elevated the problem of generative image modeling to a level where it has become available as a technology
to the public. A critical part to this success is to define the generative model in the latent space of an
autoencoder (AE), which reduces the resolution of the representation over which the model is defined, thereby
making it possible to scale di!usion methods to larger datasets, resolutions, and architectures than original
pixel-based di!usion models (Dhariwal and Nichol, 2021; Sohl-Dickstein et al., 2015).
1
To train an LDM, all images are first projected into a latent space with the encoder of a pre-trained autoen-
coder, and then, the di!usion model is optimized directly in the latent space. Note that when learning the
di!usion model the AE decoder is not used – the di!usion model does not receive any training feedback that
would ensure that all latent values reachable by the di!usion process decode to a high quality image. This
training procedure leads to a disconnect between the di!usion model and the AE decoder, prompting the
LDM to produce low quality images that oftentimes lack high frequency image components. Moreover, we
note that the latent spaces of pre-trained LDM’s autoencoders tend to be highly irregular, in the sense that
small changes in the latent space can lead to large changes in the generated images, further exacerbating
the autoencoder-di!usion disconnect problem.
In this work, we propose to alleviate this autoencoder-di!usion disconnect and propose to include the AE
decoder in the training objective of LDM. In particular, we introduce latent perceptual loss (LPL) that acts
on the decoder’s intermediate features to enrich the training signal of LDM. This is similar to the use of
perceptual losses for image-to-image translation tasks (Johnson et al., 2016; Zhang et al., 2018), but we apply
this idea in the context of generative modeling and use the feature space of the pre-trained AE decoder
rather than that of an external pre-trained discriminative network. Our latent perceptual loss results in
sharper and more realistic images, and leads to better structural consistency than the baseline – see Figure 1.
We validate LPL on three datasets of di!erent sizes – the commonly used datasets ImageNet-1k (1M data
points) and CC12M (12M data points), and additionally a private dataset S320M (320M data points) – as
well as three generative models formulation – DDPM (Ho et al., 2020) with velocity and epsilon prediction,
and conditional flow matching model (Lipman et al., 2023). In our experiments, we report standard image
generative model metrics – such as FID (Heusel et al., 2017), CLIPScore (Hessel et al., 2021), as well as
Precision and Recall (Sajjadi et al., 2018; Kynkäänniemi et al., 2019). Our experiments show that the use of
LPL leads to consistent performance boosts between 6% and 20% in terms of FID. Our qualitative analysis
further highlights the benefits of LPL, showing images that are sharp and contain high-frequency image
details.
In summary, our contributions are:
• We propose the latent perceptual loss (LPL), a perceptual loss variant leveraging the intermediate feature
representation of the autoencoder’s decoder.
• We present extensive experimental results on the ImageNet-1k, CC12M, and S320M datasets, demonstrating
the benefits of LPL in boosting the model’s quality by 6% to 20% in terms of FID.
• We show that LPL is e!ective for a variety of generative model formulations including DDPM and conditional
flow matching approaches.
2 Related work
Diffusion models. The generative modeling landscape has been significantly impacted by di!usion models,
surpassing previous state-of-the-art GAN-based methods (Brock et al., 2019; Karras et al., 2019, 2020, 2021).
Di!usion models o!er advantages such as more stable training and better scalability, and were successfully
applied to a wide range of applications, including image generation (Chen et al., 2024; Ho et al., 2020), video
generation (Ho et al., 2022b; Singer et al., 2023), music generation (Levy et al., 2023; San Roman et al., 2023),
and text generation (Wu et al., 2023). Various improvements of the framework have been proposed, including
di!erent schedulers (Lin et al., 2024; Hang and Gu, 2024), loss weights (Choi et al., 2022; Hang et al., 2023),
and more recently generalizations of the framework with flow matching (Lipman et al., 2023). In our work we
evaluate the use of our latent perceptual loss in three di!erent training paradigms: DDPM under noise and
velocity prediction, as well as flow-based training with the optimal transport path.
Latent diffusion. Due to the iterative nature of the reverse di!usion process, training and sampling di!usion
models is computationally demanding, in particular at high resolution. Di!erent approaches have been
explored to generate high-resolution content. For example, Ho et al. (2022a) used a cascaded approach to
progressively add high-resolution details, by conditioning on previously generated lower resolution images. A
more widely adopted approach is to define the generative model in the latent space induced by a pretrained
autoencoder (Rombach et al., 2022), as previously explored for discrete autoregressive generative models (Esser
et al., 2021). Di!erent architectures have been explored to implement di!usion models in the latent space,
including convolutional UNet-based architectures (Rombach et al., 2022; Podell et al., 2024), and more recently
2
transformer-based ones (Peebles and Xie, 2023; Chen et al., 2024; Gao et al., 2023; Esser et al., 2024) which
show better scaling performance. Working in a lower-resolution latent space accelerates training and inference,
but training models using a loss defined in the latent space also deprives them from matching high-frequency
details from the training data distribution. Earlier approaches to address this problem include the use of
a refiner model (Podell et al., 2024), which consists of a second di!usion model trained on high-resolution
high-quality data that is used to noise and denoise the initial latents, similar to how SDEdit works for image
editing (Meng et al., 2022). Our latent perceptual loss addresses this issue in an orthogonal manner by
introducing a loss defined across di!erent layers of the AE decoder in the latter stages of the training process.
Our approach avoids the necessity of training on specialized curated data (Dai et al., 2023), and does not
increase the computational cost of inference.
Perceptual losses. The use of internal features of a fixed, pre-trained deep neural network to compare images
or image distributions has become common practice as they have been found to correlate to some extent
with human judgement of similarity (Johnson et al., 2016; Zhang et al., 2018). An example of this is the
widely used Fréchet Inception Distance (FID) to assess generative image models (Heusel et al., 2017). Such
“perceptual” distances have also been found to be e!ective as a loss to train networks for image-to-image tasks
and boost image quality as compared to using simple ω1 or ω2 reconstruction losses. They have been used
to train autoencoders (Esser et al., 2021), models for semantic image synthesis (Isola et al., 2017; Berrada
et al., 2024b) and super-resolution (Suvorov et al., 2022; Jo et al., 2020), and to assess the sample diversity of
generative image models (Schönfeld et al., 2021; Astolfi et al., 2024). In addition, recent works propose variants
that do not require pretrained image backbones (Amir and Weiss, 2021; Czolbe et al., 2020; Veeramacheneni
et al., 2023). An et al. (2024); Song et al. (2023) employed LPIPS as metric function in pixel space to train
cascaded di!usion models and consistency models, respectively. Most closely related to our work, Kang et al.
(2024) used a perceptual loss defined over latents to distill LDMs to conditional GANs, but used a separate
image classification network trained over latents rather than the autoencoder’s decoder to obtain the features
for this loss. In summary, compared to prior work on perceptual losses, our work is di!erent in that (i) LPL
is defined over the features of the decoder – that maps from latent space to RGB pixel space, rather than
using a network that takes RGB images as input – and (ii) we use LPL to train latent di!usion models.
3
(" (# (̂"
F*,
a. ℒ#
D!
ℒ! =∥ $" − $̂" ∥##
…
4% CN 4%( , 45%( 1% ℒ# = -# . / 0$ . 1$
$&'
(" (# (̂" 45' OD %
…
D!
45% OD ℒ!
ϵ ∼ $ 0, I
Figure 2 Overview of our approach. (a) Latent di!usion models compare clean latents and the predicted latents.
(b) Our LPL acts in the features of the autoencoder’s decoder e!ectively aligning the di!usion process with the decoder.
Fωe , Fωd : autoencoder encoder and decoder, D! : denoiser network, CN: cross normalization layer, OD: outlier detection.
We note that the presence of ω2 in the LDM objective has some important implications. First, the ω2 norm
treats all pixels in the latents as equally important and disregards the downstream structure induced by
the decoder whose objective is to reconstruct the image from its latents. This is problematic because the
autoencoder’s latent space has a highly irregular structure and is not equally influenced by the di!erent pixels
in the latent code. Thus, optimizing the ω2 distance in the di!usion model latent space could be di!erent from
optimizing the perceptual distance between images. Second, while an ω2 objective is theoretically justified in
the original DDPM formulation, generative models trained with an ω2 reconstruction objective have been
observed to produce blurry images, as is the case e.g. for VAE models (Kingma and Welling, 2014).
The problem of blurry images due to ω2 reconstruction losses has been addressed through the use of perceptual
losses such as LPIPS (Zhang et al., 2018), which provide a significant boost to the image quality in settings
such as autoencoding (Esser et al., 2021), super-resolution Ledig et al. (2017) and image-to-image generative
models (Isola et al., 2017; Park et al., 2019). However, in the case of LDM, a perceptual loss cannot be
used directly on the predicted latents, instead the latents would need to be decoded to an RGB image and
then entered into a feature-extraction network – introducing significant overhead in terms of memory and
computation. To avoid such overheads, we develop an alternative perceptual loss that operates directly on the
feature space of the decoder.
z0 , and estimated latents, ẑ0 (where for brevity we drop the dependence of ẑ0 on t):
' ( )
ς1 , ..., ςL = Fωd,l (z0 ) l↓[[1,L]] ,
( ) (1)
ς̂1 , ..., ς̂L = Fωd,l (ẑ0 ) l↓[[1,L]] .
Using these intermediate features, we can define our training objective. Our LPL, LLPL , is a weighted sum of
the quadratic distances between the feature representations at the di!erent decoding scales, obtained after
normalization:
* L Cl
-
+ ↼l + , ( ↘ ),2
LLPL = Et↓T ,ϑ↔N (0,I),x0 ↓DX φϱt ↗ςω , ↘ ,
↽l,c (ς̂l,c ) ↔ ςl,c ↓ ς̂l,c 2 . (2)
Cl c=1
l=1
4
Figure 3 Example of feature maps from the autoencoder’s decoder. The presence of outliers makes the
underlying feature representation di"cult to exploit. l refers to the block index, while c is the channel index within
the block. Top row: l = 4, c = 2, bottom row: l = 4, c = 8.
where ς↘l is the standardized version of ςl across the channel dimension, ↽l,c (ς̂l,c ) is a binary mask masking the
detected outliers in the feature map ς̂l,c , ↼l is a depth-specific weighting and Cl the channel dimensionality of
the feature tensor. Note that we better explain these terms later in this section. Moreover, to reduce both
the LPL computational complexity and memory overhead, we only apply our loss for high signal-to-noise
ratios (SNR). In particular, we impose a hard threshold ⇀ϱ and only apply the loss if the SNR is higher than
it φϱt ↗ςω (ϱt ).
The LPL loss is applied in conjunction with the standard di!usion loss, resulting in the following training
objective.
Ltot = LDi! + wLPL · LLPL (3)
Depth-specific weighting. Empirically, we find the loss amplitude at di!erent decoder layers to di!er signifi-
cantly – it grows with a factor of two when considering layers with a factor two increase resolution. To balance
the contributions from di!erent decoder layers, we therefore weight them by the inverse of the upscaling factor
w.r.t. the first layer, i.e. ↼l = 2↑rl /r1 where rl is the resolution of the l-th layer.
Outlier detection. When inspecting the decoder features we find artefacts at decoder’s deeper layers. Particularly,
in some cases a small number of decoder activations have very high absolute values, see Figure 3. This is
undesirable, as such outliers can dominate the perceptual loss, reducing its e!ectiveness. To prevent this,
we use a simple outlier detection algorithm to mask them when computing the perceptual loss. See the
supplementary material for details.
Normalization. Since the features in the decoder can have significantly varying statistics from each other, we
follow Zhang et al. (2018) and normalize them per channel so that the features in every channel in every layer
are zero mean and have unit variance. However, normalizing the feature maps corresponding to the original
and denoised latents with di!erent statistics can induce nonzero gradients even when the absolute value has
been correctly predicted. To obtain a coherent normalization, we use the feature statistics from the denoised
latents to normalize both tensors.
5
prediction (Salimans and Ho, 2022) and flow matching (Lipman et al., 2023). To do this, the only requirement
is to be able to estimate the original latents from the model predictions. Under general frameworks such as
DDPM and flows, we can write the forward equation in the form ↗t, xt = at x0 + bt εt , where the di!erent
paradigms only di!er in terms of the parameterization of at and bt . In Table 1, we provide a summary for
these di!erent formulations.
4 Experimental evaluation
In this section, we first present our experimental setup, and then go on to present our main results, as well as
qualitative results and a number of ablation studies.
6
w/o LPL
with LPL
Goat on the green grass series takes place within Mother of the monu- The top of the building These orange blueberry Model in a car show-
in top of mountain with the arena. ment on a cloudy day. behind a hill. mu!ns taste like flu"y room.
blurred background. little bites of sunshine.
Figure 4 Samples from models trained with and without our latent perceptual loss on S320M. Samples
from the model with perceptual loss (bottom row) show more realistic textures and details.
7
Paradigm DDPM-ϑ DDPM-v! Flow-OT Table 3 E!ect of the LPL on ImageNet-1k models
LPL ✁ ✂ ✁ ✂ ✁ ✂ at 512 resolution trained with di!erent methods.
FID (≃) 4.88 3.79 4.72 3.84 4.54 3.61 We observe consistent improvements on all metrics when
Coverage (⇐) 0.80 0.82 0.80 0.83 0.82 0.85 incorporating the LPL, except for density metric for which
Density (⇐) 1.14 1.13 1.15 1.14 1.14 1.29
Precision (⇐) 0.74 0.77 0.73 0.78 0.75 0.79 we observe a very slight degradation when using DDPM
Recall (⇐) 0.49 0.51 0.49 0.50 0.52 0.54 training.
Figure 5 Power spectrum of real and generated images. Di!erence in (log) power spectrum between image
generated with and without LPL. Using LPL strenghtens frequencies at the extremes (very low and very high).
than the baseline. To provide insight on the e!ect of our perceptual loss w.r.t. the frequency content of the
generated images, we compare the power spectrum profile of images generated with a model trained with and
without LPL on CC12M at 512 resolution as well as a set of real images from the validation set.
w. LPL w/o. LPL
In Figure 5, we plot the di!erence between log-power spectra be-
tween the three image sets. The left-most panel clearly shows the 105
presence of more high frequency signal in the generated images
when using LPL to train the model, confirming what has been 104
| f|
8
1.15
5 0.75 0.52
Coverage
Precision
Density
Recall
FID 0.80 1.10
4 0.50
1.05 0.70
0.78
1
3.0
0
0
.0
.0
3.0
0
0
.0
.0
3.0
0
0
.0
.0
3.0
0
0
.0
.0
3.0
0
0
.0
.0
0.
0.
2.
6.
0.
0.
2.
6.
0.
0.
2.
6.
0.
0.
2.
6.
0.
0.
2.
6.
10
20
10
20
10
20
10
20
10
20
Figure 7 Ablation study on the impact of the noise threshold εε . We report FID, coverage, density, precision
and recall. The dashed line corresponds to the baseline without LPL, note the logarithmic scaling of the noise threshold
on the horizontal axis.
Baseline w. LPL
7.5
7.0 40
6.0
FID@512
6.5
FID@512
FID@512
6.0 20
5.5 5.5
5.0 10
4.5 5.0
4.0 5
3.5 0 2 4 6 8 10
0 1 2 3 4 5 RGB 5 10 25 50 100 250
Blocks used NFE wLPL
Figure 8 Exploration of LPL Figure 9 Impact of LPL for di!er- Figure 10 Influence of the LPL
depth. Influence of decoder blocks ent number of sampling steps. loss weight on model perfor-
used in LPL on FID, zero cor- With higher numbers of sampling mance. The curve shows a sharp
responds to not using LPL. Disk steps, the di!erence between the decrease in FID before going back
radius shows GPU memory us- baseline and the model trained with up for larger weights.
age: w/o LPL=64.9 GB, LPL 5 LPL increases.
blocks=83.4 GB.
features), which results in degraded performance compared to the baseline while inducing a considerable
memory overhead.
Feature normalization. Before computing our perceptual loss, we normalize the decoder features. We compared
normalizing the features of the original latent and the predicted one separately, or normalizing both using the
statistics from the predicted latent. Our experiment is conducted on ImageNet-1k at 512 resolution. While
the model trained with separately normalized latents results in a slight boost of FID (4.79 vs. 4.88 for the
baseline w/o LPL), the model trained with shared normalization statistics leads to a much more significant
improvement and obtains an FID of 3.79.
SNR threshold value. We conduct an experiment on the influence of the SNR threshold which determines
at which time steps our perceptual loss is used for training. Lower threshold values correspond to using
LPL for fewer iterations that are closest to the noise-free targets. We report results across several metrics
in Figure 7 and illustrated with qualitative examples in the supplementary material in Figure 13. We find
improved performance over the baseline without LPL for all metrics and that the best values for each metric
are obtained for a threshold between three and six, except for the recall which is very stable (and better than
the baseline) for all threshold values under 20.
Reweighting strategy. We compare the performance when using uniform or depth-specific weights to combine
the contributions from di!erent decoder layers in the LPL. We find that using depth-specific weights results
in significant improvements in terms of image quality w.r.t. using uniform weights. While the depth-specific
weights achieve an FID of 3.79, the FID obtained using uniform weights is 4.38. Hence, while both strategies
improve image quality over the baseline (which achieves an FID of 4.88), reweighting the layer contributions
to be approximately similar further boosts performance and improves FID by 0.59 points.
LPL and convergence. As the LPL loss adds a non-negligible memory overhead, by having to evaluate and
backpropagate through the latent decoder, it is interesting to explore at which point in training it should be
introduced. We train models on ImageNet-1k at 512 resolution with di!erent durations of the post-training
stage. We use an initial post-training phase — of zero, 50k, or 400k iterations — in which LPL is not used,
followed by another 120k iterations in which we either apply LPL or not.
9
Initial post-train iters 0 50k 400k Table 4 E!ect of our perceptual loss on models pre-trained
” FID (≃) ↑0.58 ↑0.78 ↑0.97
without LPL for a set number of iterations. In each column, we
” coverage (⇐) +4.29 +3.51 +3.99 report the di!erence in metrics after post-training for 120k iterations
” density (⇐) +0.14 +0.12 +0.21 with or without LPL. All metrics improve when adding LPL in the
” precision (⇐) +4.01 +4.55 +5.89
” recall (⇐) +1.99 +2.32 +4.22 post-training phase.
The results in Table 4 indicate that in each case LPL improves all metrics and that the improvements are
larger when the model has been trained longer and is closer to convergence (except for the coverage metric
where we see the largest improvement when post-training for only 120k iterations). This suggests that better
models (ones trained for longer) benefit more from our perceptual loss.
Influence on sampling efficiency. We conduct an experiment to assess the influence of the perceptual loss on
the sampling e"ciency. To this end, we sample the ImageNet@512 model with di!erent numbers of function
evaluations (NFE) then check the trends for the baseline and the model trained with our method. For this
experiment, we use DDIM algorithm. Results are reported on Figure 9, where we find that for very low
numbers of function evaluations, both models perform similarly. The improvement gains from the LPL loss
start becoming considerable after 25 NFEs, where we observe a steady increase in performance gains with
respect to the number of function evaluations up to 100, afterwards both models stabilize at a point where
the model trained using LPL achieves an improvement of approximately 1.1 points over the baseline.
Impact on EMA. Since the LPL has the e!ect of increasing w/o. LPL w. LPL
5.0
the accuracy of the estimated latent during every timestep,
it reduces fluctuations between successive iterations of the 4.5
model during training. Consequently, when training with 4.0
LPL the EMA momentum can be reduced to obtain optimal
performance. In Figure 11 we report the results of a grid FID 3.5
5 Conclusion
In this work, we identified a disconnect between the decoder and the training of latent di!usion models, where
the di!usion model loss does not receive any feedback from the decoder resulting in perceptually non-optimal
generations that oftentimes lack high frequency details. To alleviate this disconnect we introduced a latent
perceptual loss (LPL) that provides perceptual feedback from the autoencoder’s decoder when training the
generative model. Our quantitative results showed that the LPL is generalizable and improves performance
for models trained on a variety of datasets, image resolutions, as well as generative model formulations. We
observe that our loss leads to improvements from 6% up to 20% in terms of FID. Our qualitative analysis
show that the introduction of LPL leads to models that produce images with better structural consistency
and sharper details compared to the baseline training. Given its generality, we hope that our work will play
an important role in improving the quality of future latent generative models.
10
References
Dan Amir and Yair Weiss. Understanding and simplifying perceptual distances. In CVPR, 2021.
Jie An, Zhengyuan Yang, Jianfeng Wang, Linjie Li, Zicheng Liu, Lijuan Wang, and Jiebo Luo. Bring metric functions
into di!usion models. In IJCAI, 2024.
Pietro Astolfi, Marlene Careil, Melissa Hall, Oscar Mañas, Matthew Muckley, Jakob Verbeek, Adriana Romero Soriano,
and Michal Drozdzal. Consistency-diversity-realism Pareto fronts of conditional image generative models. arXiv
preprint, 2406.10429, 2024.
Tariq Berrada, Pietro Astolfi, Melissa Hall, Reyhane Askari-Hemmat, Yohann Benchetrit, Marton Havasi, Matthew
Muckley, Karteek Alahari, Adriana Romero-Soriano, Jakob Verbeek, and Michal Drozdzal. On improved conditioning
mechanisms and pre-training strategies for di!usion models, 2024a. https://arxiv.org/abs/2411.03177.
Tariq Berrada, Jakob Verbeek, Camille Couprie, and Karteek Alahari. Unlocking pre-trained image backbones for
semantic image synthesis. In CVPR, 2024b.
Andrew Brock, Je! Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis.
In ICLR, 2019.
Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12M: Pushing web-scale image-text
pre-training to recognize long-tail visual concepts. In CVPR, 2021.
Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo,
Huchuan Lu, and Zhenguo Li. Pixart-ϑ: Fast training of di!usion transformer for photorealistic text-to-image
synthesis. In ICLR, 2024.
Lenaic Chizat and Francis Bach. A Note on Lazy Training in Supervised Di!erentiable Programming. working paper
or preprint, December 2018. https://inria.hal.science/hal-01945578.
Jooyoung Choi, Jungbeom Lee, Chaehun Shin, Sungwon Kim, Hyunwoo Kim, and Sungroh Yoon. Perception prioritized
training of di!usion models. In CVPR, 2022.
Ste!en Czolbe, Oswin Krause, Ingemar J. Cox, and Christian Igel. A loss function for generative neural networks
based on Watson’s perceptual model. In NeurIPS, 2020.
Xiaoliang Dai, Ji Hou, Chih-Yao Ma, Sam Tsai, Jialiang Wang, Rui Wang, Peizhao Zhang, Simon Vandenhende,
Xiaofang Wang, Abhimanyu Dubey, Matthew Yu, Abhishek Kadian, Filip Radenovic, Dhruv Mahajan, Kunpeng Li,
Yue Zhao, Vladan Petrovic, Mitesh Kumar Singh, Simran Motwani, Yi Wen, Yiwen Song, Roshan Sumbaly, Vignesh
Ramanathan, Zijian He, Peter Vajda, and Devi Parikh. Emu: Enhancing image generation models using photogenic
needles in a haystack. arXiv preprint, 2309.15807, 2023.
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image
database. In CVPR, 2009.
Prafulla Dhariwal and Alex Nichol. Di!usion models beat GANs on image synthesis. In NeurIPS, 2021.
Patrick Esser, Robin Rombach, and B. Ommer. Taming transformers for high-resolution image synthesis. In CVPR,
2021.
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik
Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik
Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. In ICML, 2024.
Shanghua Gao, Pan Zhou, Ming-Ming Cheng, and Shuicheng Yan. Masked di!usion transformer is a strong image
synthesizer. In ICCV, 2023.
Tiankai Hang and Shuyang Gu. Improved noise schedule for di!usion training. arXiv preprint, 2407.03297, 2024.
Tiankai Hang, Shuyang Gu, Chen Li, Jianmin Bao, Dong Chen, Han Hu, Xin Geng, and Baining Guo. E"cient
di!usion training via Min-SNR weighting strategy. In ICCV, 2023.
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: A reference-free evaluation
metric for image captioning. In EMNLP, 2021.
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a
two time-scale update rule converge to a local Nash equilibrium. In NeurIPS, 2017.
11
Jonathan Ho and Tim Salimans. Classifier-free di!usion guidance. In NeurIPS 2021 Workshop on Deep Generative
Models and Downstream Applications, 2021.
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising di!usion probabilistic models. In NeurIPS, 2020.
Jonathan Ho, Chitwan Saharia, William Chan, David J. Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded
di!usion models for high fidelity image generation. Journal of Machine Learning Research, 23(47), 2022a.
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video di!usion
models. In NeurIPS, 2022b.
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial
networks. In CVPR, 2017.
Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural
networks. In NeurIPS, 2018.
Younghyun Jo, Sejong Yang, and Seon Joo Kim. Investigating loss functions for extreme super-resolution. In Conference
on Computer Vision and Pattern Recognition Workshops (CVPRW), 2020.
Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In
ECCV, 2016.
Minguk Kang, Richard Zhang, Connelly Barnes, Sylvain Paris, Suha Kwak, Jaesik Park, Eli Shechtman, Jun-Yan Zhu,
and Taesung Park. Distilling di!usion models into conditional GANs. In ECCV, 2024.
Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks.
In CVPR, 2019.
Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving
the image quality of StyleGAN. In CVPR, 2020.
Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias-free
generative adversarial networks. In NeurIPS, 2021.
Diederik Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014.
Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall
metric for assessing generative models. In NeurIPS, 2019.
Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken,
Alykhan Tejani, Johannes Totz, Zehan Wang, and Wenzhe Shi. Photo-realistic single image super-resolution using a
generative adversarial network. In CVPR, 2017.
Mark Levy, Bruno Di Giorgi, Floris Weers, Angelos Katharopoulos, and Tom Nickson. Controllable music production
with di!usion models and guidance gradients. In NeurIPS Workshop on Di!usion Models, 2023.
Shanchuan Lin, Bingchen Liu, Jiashi Li, and Xiao Yang. Common di!usion noise schedules and sample steps are
flawed. In WACV, 2024.
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative
modeling. In ICML, 2023.
Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided
image synthesis and editing with stochastic di!erential equations. In ICLR, 2022.
Muhammad Ferjad Naeem, Seong Joon Oh, Youngjung Uh, Yunjey Choi, and Jaejun Yoo. Reliable fidelity and
diversity metrics for generative models. In ICML, 2020.
Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive
normalization. In CVPR, 2019.
William Peebles and Saining Xie. Scalable di!usion models with transformers. In ICCV, 2023.
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin
Rombach. SDXL: Improving latent di!usion models for high-resolution image synthesis. In ICLR, 2024.
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image
synthesis with latent di!usion models. In CVPR, 2022.
12
Mehdi S. M. Sajjadi, Olivier Bachem, Mario Lu#i$, Olivier Bousquet, and Sylvain Gelly. Assessing generative models
via precision and recall. In NeurIPS, 2018.
Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of di!usion models. In ICLR, 2022.
Robin San Roman, Yossi Adi, Antoine Deleforge, Romain Serizel, Gabriel Synnaeve, and Alexandre Defossez. From
discrete tokens to high-fidelity audio using multi-band di!usion. In NeurIPS, 2023.
Edgar Schönfeld, Vadim Sushko, Dan Zhang, Juergen Gall, Bernt Schiele, and Anna Khoreva. You only need adversarial
supervision for semantic image synthesis. In ICLR, 2021.
Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual,
Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make-a-video: Text-to-video generation without
text-video data. In ICLR, 2023.
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using
nonequilibrium thermodynamics. In ICML, 2015.
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising di!usion implicit models. In ICLR, 2021.
Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In NeurIPS,
2019.
Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In ICML, 2023.
Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov,
Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lempitsky. Resolution-robust large mask inpainting with
fourier convolutions. In WACV, 2022.
Lokesh Veeramacheneni, Moritz Wolter, Hildegard Kuehne, and Juergen Gall. Fréchet wavelet distance: A domain-
agnostic metric for image generation. arXiv preprint, 2312.15289, 2023.
Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Computation, 23(7):
1661–1674, 2011.
Tong Wu, Zhihao Fan, Xiao Liu, Yeyun Gong, Yelong Shen, Jian Jiao, Hai-Tao Zheng, Juntao Li, Zhongyu Wei, Jian
Guo, Nan Duan, and Weizhu Chen. AR-Di!usion: Auto-regressive di!usion model for text generation. In NeurIPS,
2023.
Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, and Lu Yuan.
Florence-2: Advancing a unified representation for a variety of vision tasks. In CVPR, 2024.
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable e!ectiveness of deep
features as a perceptual metric. In CVPR, 2018.
Zixin Zhu, Xuelu Feng, Dongdong Chen, Jianmin Bao, Le Wang, Yinpeng Chen, Lu Yuan, and Gang Hua. Designing
a better asymmetric VQGAN for StableDi!usion. arXiv preprint, 2306.04632, 2023.
13
Appendix
A Relevance of the LPL loss
Interpreting LPL. Under the DDPM paradigm, for latent di!usion, a neural network is trained to model the
reverse process q(zt↑1 |zt ). Under this setting, training is conducted by optimizing the KL divergence between
the true reverse process and the predictor that is modeled using a neural network:
After simplification Ho et al. (2020), the training loss resembles the denoising score matching objective Song
and Ermon (2019) over multiple noise scales indexed by timestep t:
" #
2
Ex0 ,ϑ,t γt ↑D(zt , ϱt ; !) ↓ z0 ↑ , (5)
where γt is a time-dependent weighting factor. Taking into account the global objective which is image
generation means putting more emphasis on obtaining ω2 -optimal reconstructions in image space rather than
in latent space. Such a constraint can be imposed in the form of a penalty term that is added to the training
objective: $ ( )%
Lpen d
t↑1 = Eq DKL pF (q(xt↑1 |xt , x0 )) ↑ pF (Fω (p! (zt↑1 |zt ))) , (6)
where pF is a projector that maps from image space to a suitable. embedding/space in which to compare the
images. We can assume that both pF (q (zt↑1 |zt , z0 )) and pF Fω (p! (zt↑1 |zt )) map to Gaussian distributions
d
with constant variance — such an approximation can be considered reasonable when considering small enough
time discretization, assuming that the projector is locally linear around zt (the local linearity assumption has
been studied previously in the litterature Jacot et al. (2018); Chizat and
,Bach (2018)) — Under these conditions,
,2
, d+ d+ ,
we can approximate the divergence term in the penalty as ⇐ Ex0 ,t,ϑ ,Fω ⇒ D(zt , ϱt ; !) ↓ Fω ⇒ Fω (x0 ),
e
where Fωd+ is the feature projector of the decoder which outputs the intermediate features from each block of
the autoencoder’s decoder. This shows that under certain conditions, taking into account the structure of the
latent space is akin to matching intermediate feature representations in the process of image decoding.
B Latent Structure
Because of the underlying structure of the latent space, certain errors can have much more detrimental e!ects
to the quality of the decoded image than others. We illustrate this in Figure 12 by comparing the generated
image after interpolating the encoded latents to di!erent resolutions then back to its original resolution before
decoding them. While these di!erent transformations yield similar errors in terms of MSE, especially in RGB
space, the interpolation algorithm becomes crucial when working in the latent space.
An illustration of this e!ect is presented in Figure 12 where we degrade the quality of the latents by performing
an interpolation operation to downsize the latents (when s < 1) followed by the reverse operation to recover
latents at the original size, such a transformation can be seen as a form of lossy compression where di!erent
interpolation methods induce di!erent biases in the information lost.
By examining the reconstructions from the latents, we cannot conclude that there is a direct relationship
between the MSE with respect to the original latent and the decoded image quality. While nearest interpolation
results in the highest MSE, the reconstructed images are more perceptually similar to the target than the
ones obtained with bilinear interpolation. Similarly, while the bicubic interpolation with s = 1.3 achieves
an MSE of 2.38, it still results in better reconstruction than the bilinear interpolation where s = 2.0 which
achieves a lower MSE error of 1.88.
From this analysis, we see that certain kinds of errors can have more or less detrimental e!ects on the image
generation, which go beyond simple MSE in the latent space.
14
Figure 12 Influence of interpolation artefacts on latent reconstruction. We downscale the image by a factor of
1/s before upscaling back to recover the original resolution. From top to bottom: bilinear interpolation in pixel space,
nearest in latent space, bilinear in latent space and bicubic interpolation in latent space.
C Outlier Detection
At deeper layers of the autoencoder, some layers have aretfacts where small patches in the feature maps have
a norm orders of magnitude higher than the rest of the feature map. These aretefacts have been detected
consistently when testing the di!erent opensource autoencoders available online, which include the ones used
in our experiments1 , as well as others.2
To ensure easy adaptability to di!erent models, we propose a simple detection algorithm for these patches
and mask them when computing the loss and normalizing the feature maps. Our algorithm is based on simple
heuristics and is not meant to provide a state-of-the-art solution for outlier detection. Rather, it is proposed as
a temporary patch for the observed issues, while the long-term solution would be to train better autoencoders
that do not su!er from these outliers.
Detection algorithm. We empirically observe that the activations for every feature map approximately follow
a normal distribution, while the outliers can be identified as a small subset of out-of-distribution points.
To identify them, we threshold the points with the corresponding percentile at φo and 1 ↓ φo percentiles.
Since computing the quantiles can be computationally expensive during training, we do it using nearest
interpolation, which amounts to finding the k-th largest value in every feature map where k = φo ↘ Hf ↘ Wf
(or k = (1 ↓ φo ) ↘ Hf ↘ Wf for the maximal values). To remove small false positives that persist in the outlier
mask, we apply a morphological opening, which can be seen as an erosion followed by a dilation of the feature
map. Pseudo-code for the outlier detection algorithm is provided in Alg. 1.
1 https://huggingface.co/stabilityai/sdxl-vae, and https://huggingface.co/cross-attention/asymmetric-autoencoder-kl-x-1-5
2 https://huggingface.co/CompVis/stable-di!usion-v1-4, and https://huggingface.co/cross-attention/
asymmetric-autoencoder-kl-x-2.
15
def remove_outliers ( features , down_f =1 , opening =5 , closing =3 , m =100 , quant =0.02):
opening = int ( ceil ( opening / down_f ))
closing = int ( ceil ( closing / down_f ))
if opening == 2:
opening = 3
if closing ==2:
closing = 1
mask=(−MaxPool2d (
kernel_size = opening ,
stride =1 ,
padding =( opening −1)//2
)(−mask )). bool () # opening
Algorithm 1 Outlier detection algorithm. The algorithm works by setting a threshold according to the upper 0.02
quantile of the activations in the feature map. Because the outliers are orders of magnitude away from the rest, we
shift the threshold by an o!set m that guarantees that only the outliers are thresholded while no activations are
masked when no outliers are present. Subsequently, we smooth out the predicted mask using a dilation operation that
eliminates small noise in the mask.
16
!! = 0.8
!! = 2.0
!! = 3.0
Figure 13 Influence of noise threshold. Higher thresholds allow for more detailed and coherent images. Samples
obtained from a model trained on ImageNet@256.
Figure 14 Qualitative comparison of the e!ect of the latent peceptual loss. Models trained on ImageNet-1k at
256 resolution with (bottom) and without (top) our perceptual loss. Without the perceptual loss, the model frequently
fails to generate coherent structures, using the perceptual loss, the model generates more plausible objects with sharper
details. The models are finteuned for 100k iterations from a checkpoint that was trained for 200k iterations. The
samples are generated without classifier-free guidance or EMA, using 50 DDIM steps.
Samples of ImageNet-1k models.[h] In Figure 15 we show samples of models trained with or without LPL
on ImageNet-1k at 512 resolution. At higher resolutions, we also observe that the model trained with LPL
generates images that are sharper and present more fine-grained details compared to the baseline.
Samples on T2I models.[h] We provide additional qualitative comparisons regarding our LPL loss. Figure 17
showcases results on a model trained on CC12M at 512 resolution, Figure 16 showcases results on a model
trained on S320M at 256 resolution.
17
w/o perc. loss
w/ perc. loss
Figure 15 Influence of finetuning a class-conditional model of ImageNet-1k at 512 resolution using our
perceptual loss. Our perceptual loss (bottom row) leads to more realistic textures and more detailed images.
w/o perc. loss
w/ perc. loss
w/o perc. loss
w/ perc. loss
w/o perc. loss
w/ perc. loss
Figure 16 Qualitative comparison. of samples from models trained with and without our LPL on S320M at 256
resolution.
18
w/o perc. loss
w/ perc. loss
this makes me miss my person weathered , human skull on the gingerbread little men on my dog , person making
short hair. cracked purple leather sand. the beach. friends.
chairs sitting outside of
a building.
w/o perc. loss
w/ perc. loss
a bee gathering nec- two - headed statue the - cat breeds in pho- a beautiful shot of the photo of rescue dog ,
tar from a wild yellow in an ancient city of tographs. flowers. posted on the page on
flower. unesco world heritage facebook.
site.
w/o perc. loss
w/ perc. loss
blue butterfly tattoo on soft toy in a choice of wild mustang spring surprised buck with wide water rushing through
back of the shoulder. colours. foal with its mare in a eyes. rocks in a river.
parched alpine meadow.
w/o perc. loss
w/ perc. loss
19
the legend of the lep- bald eagle in a tree. tilt up to show an ele- munchkin cats are gain- interiors of a subway
rechaun. vated train riding down ing in popularity , but train.
the track. is breeding these cats
cruel?