0% found this document useful (0 votes)

39 views18 pages

Single Image Super-Resolution With Denoising Diffusion GANS

This document discusses single image super-resolution (SISR) using denoising diffusion generative adversarial networks (GANs). It introduces SISR and existing methods, then proposes a new model called SRDDGAN that combines denoising diffusion models and GANs to generate diverse, high-quality images faster than previous diffusion-based methods.

Uploaded by

bexepac929

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views18 pages

Single Image Super-Resolution With Denoising Diffusion GANS

Uploaded by

bexepac929

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

www.nature.

com/scientificreports

OPEN Single image super‑resolution

with denoising diffusion GANS
Heng Xiao 1, Xin Wang 1,2,3*, Jun Wang 3*, Jing‑Ye Cai 2, Jian‑Hua Deng 2, Jing‑Ke Yan 3,4 &
Yi‑Dong Tang 1

Single image super-resolution (SISR) refers to the reconstruction from the corresponding low-
resolution (LR) image input to a high-resolution (HR) image. However, since a single low-resolution
image corresponds to multiple high-resolution images, this is an ill-posed problem. In recent
years, generative model-based SISR methods have outperformed conventional SISR methods in
performance. However, the SISR methods based on GAN, VAE, and Flow have the problems of
unstable training, low sampling quality, and expensive computational cost. These models also
struggle to achieve the trifecta of diverse, high-quality, and fast sampling. In particular, denoising
diffusion probabilistic models have shown impressive variety and high quality of samples, but their
expensive sampling cost prevents them from being well applied in the real world. In this paper, we
investigate the fundamental reason for the slow sampling speed of the SISR method based on the
diffusion model lies in the Gaussian assumption used in the previous diffusion model, which is only
applicable for small step sizes. We propose a new Single Image Super-Resolution with Denoising
Diffusion GANS (SRDDGAN) to achieve large-step denoising, sample diversity, and training stability.
Our approach combines denoising diffusion models with GANs to generate images conditionally,
using a multimodal conditional GAN to model each denoising step. SRDDGAN outperforms existing
diffusion model-based methods regarding PSNR and perceptual quality metrics, while the added
latent variable Z solution explores the diversity of likely HR spatial domain. Notably, the SRDDGAN
model infers nearly 11 times faster than diffusion-based SR3, making it a more practical solution for
real-world applications.

Single Image Super-Resolution (SISR)1 refers to the process of reconstructing a high-resolution (HR) image
from a low-resolution (LR) image, which is an essential technology in computer vision and image processing.
It has a wide range of real-world applications, including remote sensing imaging2, video surveillance3, object
detection4, and medical imaging5,6. As shown in Fig. 1, Super-Resolution is ill-posed7,8 and cannot be reversed by
deterministic mapping because an infinite number of super-resolution images can be downsampled to the same
low-resolution image. Instead, SISR can be described as learning a random mapping. When given a low-resolu-
tion image, this mapping reasonably randomly samples from its corresponding high-resolution image domain.
In order to establish the mapping between LR and HR, many generative model-based methods have emerged,
which can be divided into five categories: Methods based on Autoregressive9, variational autoencoders (VAEs)10,
Normalizing Flow11, Generative adversarial Networks (GANs)12, Based on the method of denoising diffu-
sion probabilistic model (DDPM)13,14, however, these generative models all face three dilemmas: diversity of
sampling, high quality of sampling, and fast sampling. Autoregressive-based methods, such as PixelCNN15,
cannot be parallelized due to their pixel-by-pixel generation, resulting in slow sampling speed. The proposed
method trains the model using the commonly used loss function (MSE). However, this may result in the sampled
SR image being the average of multiple SR prediction results, reducing the diversity of the sampled images. VAE-
based methods, such as CVAE16, where C is conditional, can use additional conditions to generate more diverse
SR data, which can provide relatively fast sampling, but usually produce suboptimal sample quality. Normalizing
Flows-based methods, such as SRFlow17, which adopts a reversible flow generation network structure, can learn
the mapping relationship between the input LR image and the HR output image to realize image super-resolution.
It uses negative log-likelihood loss for training to avoid training instability and mode collapse, but it is prone to
large memory occupation and high sampling costs. GAN-based methods, such as SRGAN18, are commonly used

1
School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004,
Guangxi, China. 2School of Information and Software Engineering, University of Electronic Science and Technology
of China, Chengdu 610000, Sichuan, China. 3School of Ocean Engineering, Guilin University of Electronic
Technology, BeiHai 536000, Guangxi, China. 4State Key Laboratory of Rail Transit Vehicle System, Southwest
Jiaotong University, Chengdu 610000, Sichuan, China. *email: [email protected]; [email protected]

Scientific Reports | (2024) 14:4272 | https://doi.org/10.1038/s41598-024-52370-3 1

Vol.:(0123456789)
www.nature.com/scientificreports/

Figure 1. Random SR (8×) samples generated by SRDDGAN using latent variable Z. Our method generates
diverse predicted SR images, including differences in facial attributes and hair (e.g., the second hair detail
has a different texture than the fourth, and the third tooth being clearly visible while the fourth is not.), while
maintaining consistency with the LR images.

networks for conditional image generation and super-resolution. These combine content and adversarial loss to
reconstruct SR images with better perceptual quality. It can provide fast sampling but is prone to mode collapse,
resulting in no diversity of generated SR samples and unstable training. Numerous researchers19–21 have suggested
incorporating instance noise into the model input to address the instability in GAN model training. This helps to
widen the solution space of the generator and discriminator, and improves the model’s resilience to overfitting.
Recently, denoising diffusion probabilistic models (commonly known as the diffusion model) have been
recognized as powerful generative models due to their impressive performance in generating high-quality and
diverse samples. The SISR method based on the diffusion model (e.g., SR322, SRdiff23), which uses the Markov24
chain to transform the latent variable in Gaussian distribution into the data in complex distribution, solves
the fundamental problem of the ill-posed SR and the quality of the sampled data is high. However, its main
disadvantage is that the sampling speed is prolonged due to thousands of iteration steps. It makes them difficult
to apply in the real world. Additionally, traditional denoising diffusion probabilistic models rely on unconditional
or simple conditional inputs. In contrast, SISR tasks require a more thorough utilization of low-resolution
images to restore high-frequency details fully within high-resolution images. In addition, it is well-known
that when the degradation model presumed by an image super-resolution model does not match the actual
image degradation1,25,26, it results in decreased model performance. Although studies have focused on specific
degradation models (such as bicubic downsampling), they have yet to cover the diverse degradation modes in
authentic images effectively.
In this paper, we investigate the problem of the slow sampling speed of SISR methods based on diffusion
models. We note that diffusion models usually assume that the Gaussian distribution simulates the denoising
distribution. However, the assumption that the denoising distribution is Gaussian leads to the inevitable small
step size, which leads to many sampling steps and slow sampling speed. If we need to take a small number of
sampling steps, this indicates that we need a denoising distribution that is de-parameterized with a non-Gaussian
distribution. Following this heuristic, we propose to model the denoising distribution by multimodal distribution,
which enables the denoising of giant steps.
Additionally, this paper introduces a more complex yet practical degradation model to address the challenge
of inadequately covering the diverse degradation modes present in natural images. This model incorporates
randomized permutations of blur, downsampling, and noise degradation to encompass a broader range of image
degradation scenarios. Furthermore, to harness the valuable information within low-resolution images (LR) more
effectively, we have employed a simple conditional generation approach and devised a specialized LR encoder
module to constrain the high-resolution (HR) solution space. Lastly, style and content loss functions have been
employed to restore certain high-frequency detail information.
In the SISR task, we introduced a novel conditional image generation approach called SRDDGAN. This
method incorporates a multimodal distribution to model the denoising distribution and utilizes conditional GAN
for modeling. Additionally, to adapt to diverse degradation modes, a more complex yet practical degradation
model has been designed in this study. An LR Encoder module has been devised to utilize valuable information
within low-resolution images (LR) efficiently. Moreover, instance noise injection has been implemented to foster
stable GAN training and provide diversity. Furthermore, style and content loss functions have been utilized
to restore high-frequency detail information. The new solution addresses current challenges and exhibits
competitive sample quality and diversity in the SISR task compared to image super-resolution models based on
diffusion models. Notably, our sampling process requires only four steps, approximately 11 times faster than
diffusion models like SR3. Compared to traditional GANs, our proposed model significantly improves training
stability and sample diversity while maintaining competitiveness in sample fidelity.
Our research has three main contributions:

1. We attribute the slow sampling of the diffusion model-based SISR method to the Gaussian distribution
adopted in the denoising distribution and propose to employ a complex multimodal distribution to model

Scientific Reports | (2024) 14:4272 | https://doi.org/10.1038/s41598-024-52370-3 2

Vol:.(1234567890)
www.nature.com/scientificreports/

the denoising distribution for fast sampling. Our approach produces images in just four steps, making it a
competitive alternative to the most advanced models that require hundreds or thousands of sampling steps.
2. We propose SRDDGAN, which resolves the issue of unstable GAN training and sample diversity through
instance noise injection, and its inverse process is parameterized by conditional GANs. SRDDGAN has
introduced an intricate and pragmatic degradation model to tackle the various degradation modes found in
genuine images.
3. We have created the LR Encoder module to limit the solution space of high-resolution images. This module
extracts feature details from low-resolution images and transforms them into a latent space representation,
used as input conditions for the model. Ultimately, we aim to improve the model’s fidelity and detail recovery
by introducing style and content losses to restore and retain high-frequency details within the image.
4. The extensive experiments conducted on CelebA-HQ27, DIv2K28, and CIFAR10 datasets demonstrate the
competitive performance of the proposed model in addressing the ill-posedness and fidelity of super-
resolution images. SRDDGAN employs diffusion and reverse processes for flexible image manipulation,
such as content fusion, and showcases its capability to handle complex degradation in real-world images.

Background
This section is dedicated to the SISR task, initially presenting an overview of fundamental concepts associated
with GAN and DDPM models. Subsequently, it introduces the theoretical foundation of our approach, which
comprises four key components: first, reducing the sampling steps within the diffusion model; second, enhancing
sample diversity by introducing instance noise, which is crucial for stabilizing GAN training. Additionally, it
includes a complex and diverse degradation model. Finally, it ensured stable style and content consistency.

GAN
Let us briefly review them to facilitate the understanding of Generative Adversarial Networks (GAN). GAN
comprises two networks, a generator, and a discriminator, that learn through an adversarial process in which they
play against each other. The ultimate goal of GAN is to use the max–min game29 between the two networks to
simulate the actual data distribution (p(x)). The objective of the generative network in GAN is to convert random
noise z into a distribution of actual data. In contrast, the discriminator network is trained to differentiate between
actual samples ( x∼p(x)) and generated samples (G(z)). The two networks are constantly fighting and learning
from each other. The ultimate goal is to make it unclear to the discriminator whether the result produced by the
generator is accurate. The max–min game between the two networks can be expressed as follows.
min max V (G, D) = Ex∼p(x) [log(D(x))] + Ez∼p(z) [log(1 − D(G(z)))] (1)
G D

However, it is worth noting that adversarial learning between G and D is typically kept constant despite
potential issues such as instability during training and mode collapse that can arise when training GANs using
the abovementioned formula. Various formula improvements have been proposed in practice30 to solve these
problems.

DDPM
To aid in the denoising diffusion probabilistic model, commonly known as the diffusion model, we will provide
a brief overview of it. The diffusion model is a generative model that comprises two chains: a forward diffusion
chain and an inverse diffusion chain.
Forward diffusion chain: The initial data distribution x0 ∼ q(x0 ) undergoes gradually adding Gaussian
noise. As time t increases, it becomes an independent isotropic Gaussian distribution xT . The mean value of the
noise is determined by the data xt at the current time t and a fixed value βt , while a fixed value βt determines the
variance. This process is a Markov chain process30.

q(xt | xt−1 ) = N xt ; 1 − βt xt−1 , βt I (2)

T

q(x1:T | x0 ) = q(xt | xt−1 ) (3)
t=1

Specifically, at any time step t, q(xt ) can be obtained directly from x0 and βt without the need for iteration.
t

q(xt | x0 ) = N xt ; ᾱt x0 , (1 − ᾱt )I where αt := 1 − βt , ᾱt := αs (4)
s=1

The reverse diffusion chain (denoised diffusion): is constructed as

T

pθ (x0:T ) = p(xT ) pθ (xt−1 | xt ) (5)
t=1

pθ (xt−1 | xt ) = N xt−1 ; µθ (xt , t), σt2 I (6)

Scientific Reports | (2024) 14:4272 | https://doi.org/10.1038/s41598-024-52370-3 3

Vol.:(0123456789)
www.nature.com/scientificreports/

The training process involves optimizing the typical variational lower bound on the negative logarithm of
likelihood:

q(x1:T | x0 )
− log pθ (x0 ) ≤ − log pθ (x0 ) + DKL q(x1:T | x0 )� pθ (x1:T | x0 ) = Eq log (7)
pθ (x0:T )
After taking the expectation on both sides of Eq. 7, we obtain the following:

q(x1:T | x0 )
L = Eq log ≥ −Eq log pθ (x0 ) (8)
pθ (x0:T )
The L can be further rewritten as:
T

L = Eq [DKL q(xT | x0 )�pθ (xT ) + DKL q(xt−1 | xt , x0 )�pθ (xt−1 | xt ) − log pθ (x0 | x1 )] (9)
t=2
LT Lt−1 L0

In the equation above, there are two parts: L0 and LT . Since the original paper14 chose a fixed variance, LT is a
constant. On the other hand, L0 is processed using the method described in the original DDPM paper, which
involves discretizing the continuous Gaussian distribution. The formula for this conversion can be found in13,
which also yields a constant value for L0. Therefore, we can further process the L as follows:
T

L= DKL q(xt−1 | xt , x0 )�pθ (xt−1 | xt ) + C (10)
t=2

Ultimately, our training objective translates to minimizing Eq. 10, where C is a constant.
Diffusion models commonly adopt the Gaussian distribution as a denoising distribution, requiring hundreds
to thousands of steps. However, our paper specifically concentrates on a diffusion model that involves a smaller
number of steps.

Large step denoising with multimodal distribution

Sampling speed is one of the main obstacles currently hindering the practical application of diffusion
models13,14,22. The diffusion model typically assumes that the Gaussian distribution approximates the true
denoising distribution q(xt−1 |xt ). As per the Bayes formula31, the denoising distribution q(xt−1 |xt ) can be
expressed as q(xt−1 |xt )∝q(xt |xt−1 )q(xt−1 ), where q(xt |xt−1 ) represents the forward diffusion chain and q(xt−1 )
represents the edge probability. Assuming that the denoising distribution follows a Gaussian distribution,
it is valid in specific scenarios. When βt is sufficiently tiny at each step, q(xt |xt−1 ) dominates the Bayesian
transformation equation, resulting in the reverse diffusion chain having the same functional form as the forward
diffusion chain32. As a result, if the forward diffusion is Gaussian, the reverse diffusion will also be Gaussian.
However, diffusion models often necessitate hundreds or thousands of steps with small βt to meet this condition,
leading to slow sampling.
When βt is sufficiently large, the assumption that the denoised distribution follows a Gaussian distribution is
no longer valid. As βt increases, the step size of the denoising distribution will also increase, leading to a reduction

Figure 2. Middle:We systematically introduce Gaussian noise to the initial data distribution during the forward
diffusion process, gradually transforming it into an independent isotropic Gaussian distribution. Top: When
denoising, the model’s step size is set to a very small value if a Gaussian distribution is assumed to be used
for the task. Bottom: However, increasing the step size leads to a more complex and multimodal denoising
distribution, which can significantly accelerate the sampling speed.

Scientific Reports | (2024) 14:4272 | https://doi.org/10.1038/s41598-024-52370-3 4

Vol:.(1234567890)
www.nature.com/scientificreports/

in the required steps and a faster sampling speed. Therefore, a more complex multimodal distribution is neces-
sary to model the denoising distribution instead of using a Gaussian distribution. From Fig. 2, it is evident that
as the step size of the denoising distribution increases, the denoising distribution becomes progressively more
complex and multimodal.

Conditional GAN
SISR is commonly described as learning a random mapping between high-resolution (HR) and low-resolution
(LR) images. However, the original diffusion model used in building the denoising distribution pθ (xt−1 |xt )
predicts x0 from xt deterministically through iterative processes, which deviates from the desired random
mapping. Our approach, on the other hand, generates the denoising distribution by passing through the
generator with a latent random variable z. As a result, our denoising distribution pθ (xt−1 |xt ) is more complex
and multimodal than the original one.
To fit the noise model with a complex multimodal distribution, we increase the step size of the step and
reduce the number of samples. Since conditional GANs33 can model complex distributions, we use them to fit
the denoising distribution.
Injecting instance noise into the generator has been identified as an integral approach to enhancing the
stability of GAN training and reducing overfitting induced by the discriminator focusing on pure data. It is
apparent from the available literature19,20 on GAN that incorporating noise into the generator enhances the
stability of GAN training. Thus, the incorporation of noise has become a prevalent technique for achieving both
the stability of GAN training and a diverse range of generated samples.

Diverse forms of degradation

Following the relevant literature1,25,26, this study employs various degradation methods to process obtained
low-resolution (LR) images, aiming to address the diverse and complex degradation scenarios encountered in
the real world. Our approach encompasses a range of processing strategies, such as blurring, downsampling,
and noise addition. Blurring degradation includes two types: isotropic Gaussian blur and anisotropic Gaussian
blur. Downsampling degradation employs methods like nearest-neighbor, bilinear, and bicubic interpolation
to simulate the effect of reducing image size. Noise degradation replicates various image noise types, including
Gaussian noise, JPEG compression, and camera sensor noise. Combining these methods generates the final LR
image. This diversified degradation approach enhances the model’s adaptability to various imperfect inputs,
resulting in higher-quality super-resolution images. The results of our diversified degradation are shown in Fig. 3.
XLR = (I × k)↓s + n (11)
Where XLR denotes low-resolution, while I denotes the image undergoing processing. The variable k symbolizes
the blur kernel that simulates potential blurriness during image capture. The symbol × signifies the convolu-
tion operation. The ↓ notation indicates downsampling, where s represents the downsampling factor. Lastly, n
represents the noise added to the image.

Consistency of style and content

In the SISR task, a single low-resolution image might correspond to multiple high-resolution images, presenting
an ill-posed problem. Initially, using L1 or L2 loss during training frequently led to blurred predictions despite
yielding higher PSNR metrics. This approach leans toward average losses, inadequately addressing uncertainty
in super-resolution problems, resulting in a notable decrease in high-frequency details. Recently, leveraging

Figure 3. To address the diverse degradation modes present in the authentic image.

Scientific Reports | (2024) 14:4272 | https://doi.org/10.1038/s41598-024-52370-3 5

Vol.:(0123456789)
www.nature.com/scientificreports/

VGG-19’s26,34,35 style and content losses has demonstrated the ability to generate more explicit images and
improve visual quality, notably assisting in restoring high-frequency details.
Content loss: Content loss1,25,26 is introduced into the SISR task to evaluate the perceptual quality of images.
Specifically, we employ a pre-trained classification network to measure the semantic differences between images.
This network is denoted as φ , and the high-level representations extracted at layer l-th are represented as φ (l) (I).
The content loss is defined as the Euclidean distance between the high-level representations of the two images,
as shown below:

1
(l) (l)
2
Lcontent (Î, I; φ, l) = φi,j,k (Î) − φi,j,k (I) (12)
hl wl cl
i,j,k

Where hl , wl , and cl represent the height, width, and number of channels of the representations on layer l,
respectively.
Style loss: As reconstructed images should exhibit a similar style to the target image (e.g., color, texture,
contrast), inspiration from style representations is drawn. Style loss (texture loss)1,25,26 is introduced into the
SISR task. The style of an image is regarded as the correlation between different feature channels. It is defined as
the Gram matrix Gij(l) ∈ Rci×cj , where Gij(l) denotes the inner product between vectorized feature maps i and j at
layer l. The formula is represented as follows:

Gij(l) (I) = vec φi(l) (I) · vec φj(l) (I) (13)
(l)
Where vec(.) denotes the vectorization operation, and φi (I) represents the i-th channel in the l-th feature map
of the image (I). Therefore, the style loss is expressed as:

1 (l) (l)
2
Lstyle (Î, I; φ, l) = 2 Gi,j (Î) − Gi,j (I) (14)
cl i,j

Method
This section presents our proposed Single Image Super-Resolution (SISR) task model, the Conditional Denoising
Diffusion GANS Model (SRDDGAN). The section begins by providing a brief introduction to the fundamental
concept of the model. Subsequently, a detailed description of the forward diffusion process is presented.
Furthermore, this section provides comprehensive insights into our model’s training and optimization process,
culminating with a detailed explanation of how to extrapolate our denoising model.

Conditional Denoising Diffusion GANS Model

For the SISR task, a high-resolution (HR) image dataset and its corresponding low-resolution (LR) counterpart
i=1, representing samples obtained from a distribution p(y|x)
are combined to create a paired dataset D = {xi , yi }N
with unknown properties. This dataset has an ill-posed mapping between LR and HR images, meaning that a
single low-resolution source image x may correspond to multiple high-resolution target y. Our objective is to
acquire the capability to generate high-resolution images that closely match distribution p(y|x), given a low-
resolution image as input.
A denoising model based on a complex multimodal distribution was utilized to effectively deal with the
instability issues associated with GAN training and learn the ill-posed mapping between LR and HR images.
The proposed method involves the denoising model (DDPM) and generative adversarial network (GAN) for
conditional image generation aimed at resolving these challenges.
The Conditional Denoising Diffusion GANs model can generate the target image y0 in a relatively small
number of iteration steps T. Starting from purely Gaussian noise, the model leverages conditional transfer
learning to generate samples from the distribution pθ (yt−1 |yt , x, z), where x denotes the source image and z
represents potential random variables. By iterating through detailed images in sequence (yt−1 , yt−2 , ..., y0 ), the
model eventually converges to the point where y0 ∞p(y|x). Refer to Fig. 4 for a visual representation. Note that
the source image x is not displayed in this illustration.

Figure 4. The forward diffusion process involves gradually adding Gaussian noise to the original image,
progressing from left to right until it becomes a fully Gaussian noise distribution. In contrast, the reverse
diffusion process proceeds from right to left, utilizing the source image x as the condition for iterative denoising.

Scientific Reports | (2024) 14:4272 | https://doi.org/10.1038/s41598-024-52370-3 6

Vol:.(1234567890)
www.nature.com/scientificreports/

Our model assumes a small value for T and defines the distribution of intermediate images in the inference
chain using a forward diffusion process. At each diffusion step, a large βt is required (See Appendix B for specific
B settings). This process involves the gradual addition of Gaussian noise to the original data through a fixed
forward diffusion chain, denoted as q(yt |yt−1 ) (Fig. 4). Our model aims to recover the original data distribution
iteratively from noise through a reverse diffusion chain, conditioned on both the source image x and the noisy
image. We train a neural denoising model G to learn the reverse diffusion chain to achieve this. The denoising
model denoted as G is presented with inputs, namely a source image, a noisy image, and a latent variable Z,
which predicts the output image( y0).
The following sections overview the forward diffusion process and describe how our denoising model G is
trained and inferred.

Forward diffusion process

Following the l iterature13,14,21,31, we establish our forward diffusion chain using a method similar to the diffusion
process described in Eqs. 2 and 3 Specifically, we can employ Eq. 4 for the forward diffusion.
t

q yt | y0 = N yt ; ᾱt y0 , (1 − ᾱt )I where αt := 1 − βt , ᾱt := αs (15)
s=1

It is worth noting that our approach differs from previous diffusion models, which typically require thousands
of steps. In our method, we assume that T is small, which means that βt at each diffusion step is large enough.
One can obtain the posterior distribution from Eq. 16 given the y0 and yt , as shown below:

q yt−1 | y0 , yt = N yt−1 | µt , σ 2 I (16)
where the mean and variance in q(yt−1 |yt , y0 ) are obtained from Eqs. 17 and 18.
√ √
αt (1 − ᾱt−1 ) ᾱt−1 βt
µt = yt + y0 (17)
1 − ᾱt 1 − ᾱt

1 − ᾱt−1
σ2 = · βt (18)
1 − ᾱt
The posterior distribution plays a dual role in parameterizing the reverse diffusion chain and formulating a
variational lower bound on the log-likelihood of the chain. Moving forward, we will explore using generative
adversarial networks to parameterize this denoising model.

Optimizing the Denoising Diffusion GANS Model

To facilitate the inverse diffusion process, we adopt the approach proposed in previous work22,23,31, where a neural
network G is trained using supplementary information from the input image x. Specifically, the network takes
as inputs a noisy target image yt and a source image x, and its objective is to reconstruct a clean version of the
target image by removing the noise, as described in Eq. 19.

yt = ᾱt y0 + 1 − ᾱt ǫ, ǫ ∼ N(0, I) (19)
To be precise, our denoising model G requires the input of not only the source image x and the noisy target image
yt , but also the latent variable Z ( Z ∼ N(0, I))) and t (t ∼ U(1, T)). During training, our goal is to minimize the
adversarial loss, as demonstrated in Eq. 10, which is comparable to the one presented in the previous section.
To express our loss in a different form, we have rephrased it in Eq. 20 by applying the equivalent transformation
of L as detailed in Appendix A of D DPM14.

L= Eq(yt ) Dadv q yt−1 | yt �pθ yt−1 | yt
(20)
t≥1

The adversarial loss Dadv in GAN can be formulated using different types of divergence measures, such as KL
divergence, Jensen-Shannon divergence, and o thers36. However, for this particular case, the f-divergence has
been chosen.
In adversarial training, the approach is akin to the training process of most GANs. The traditional method of
training the discriminator in GANs involves using the input y0, which exposes it to a surplus of clean data and
can lead to overfitting. However, in our model, we have designed the discriminator to receive noisy target images
yt and yt−1 as input. This critical difference in the training process makes our model more stable compared to
the original GAN.
Specifically, the discriminator D(yt−1 , yt , t) takes two noisy target images yt−1 and yt as inputs and outputs
the confidence score that yt−1 is a denoised version of yt . Adversarial training as in Eq. 21

′
Ladv = Eq(yt ) Eq(yt−1 |yt ) − log D(yt−1 , yt , t) + Epθ (yt−1 |yt ) − log 1 − D(yt−1 , yt , t) (21)
t=1

The objective of the discriminator is to maximize its confidence in identifying a sample from the true distribution
q(yt−1 |yt ) while minimizing its confidence in identifying a fake sample from pθ (yt−1 |yt ). Conversely, the genera-
tor aims to increase the likelihood that the fake samples it produces are classified as genuine by the discriminator.

Scientific Reports | (2024) 14:4272 | https://doi.org/10.1038/s41598-024-52370-3 7

Vol.:(0123456789)
www.nature.com/scientificreports/

Please note that the formula above requires an unknown distribution, q yt−1 yt , in order to obtain samples.

However, we can use the identity q yt−1 yt := q y0 q yt , yt−1 y0 dy0 = q y0 q yt−1 y0 q yt−1 yt dy0
in order to express it in terms of what we already know. Moreover, concerning the denoising model
pθ (yt−1 |yt ) in diffusion models, it has been proposed by14 that the denoising model can be parameterized as
pθ (yt−1 |yt ) := q(yt−1 |yt , y0 ).
Our approach differs from previous m ethods13,21,22 in that we return the generator output to the forecast y0
instead of the prediction noise. Although the noise and y0 values can be converted into each other based on ᾱt
and yt (Eq. 19), we directly predict y0 using the generator, which simplifies the model’s transformation step and
accelerates the inference process. This is what sets our diffusion model algorithm apart from others.
Finally, we employed VGG-19’s(relu1.2, relu2.2, relu3.3, and relu4.1) style and content losses to recover high-
frequency details in super-resolution image reconstruction. Following relevant literature26,35,37, our utilization of
VGG-19 content loss involves extracting content features from input and target images using a neural network
and computing the distance between these features. Meanwhile, the style loss involves extracting style features
from input and target images using a neural network and computing the distance between these features. The
model is trained by combining these loss functions. The overall loss function of the model is depicted in Eq. 22.
Ltotal = αLadv + βLcontent + ηLstyle (22)
Where Ladv denotes the foundational loss of the SRDDGAN model, while Lcontent and Lstyle refer to the reduc-
tion of style and content losses in super-resolved images based on a pre-trained VGG-19 model. The weights
α, β , and η signify the importance of each loss function. The training process can be illustrated through Fig. 5.

Inference
To perform inference in our model, we initiate the process in the reverse direction of the forward diffusion
process, starting from pure Gaussian noise yT .
T

pθ y0:T | x = p yT pθ yt−1 | yt , x (23)
t=1

p yT = N yT ; 0, I (24)

pθ yt−1 | yt , x = N yt−1 | µθ x, yt , z, t , σt2 I (25)
Our inference procedure is based on the complex multimodal distribution pθ (yt−1 |yt , x) learned by the model.
Referring to the theory in the previous section, when the forward diffusion βt is set to the possible maximum
value, the optimal denoising distribution pθ (yt−1 |yt , x) approximates a distribution of multiple peaks. Therefore,
our inference process incorporates the conditions of a multimodal distribution, which can reasonably fit the
reverse diffusion process. As per Eq. 15, A should be as small as possible when βt is set large enough so that yt
approximates a Gaussian d istribution13, and Eq. 24 can be obtained. Sampling can start from pure Gaussians.
To predict yt−1 directly during the denoising stage, we employ a technique akin to that used i n13,14. First, the
model G is trained for denoising to estimate the value of y0′ after we feed the source image x, the noisy image yt ,
the temporal variable t, and z into it. Then, we use the estimated value of y0′ to derive the posterior distribution

Figure 5. The training process of SRDDGAN.

Scientific Reports | (2024) 14:4272 | https://doi.org/10.1038/s41598-024-52370-3 8

Vol:.(1234567890)
www.nature.com/scientificreports/

q(yt−1 |yt , y0 ) using equations (Eqs. 17 and 18). Finally, we use this posterior distribution to parameterize the
mean and variance of the parametric distribution pθ (yt−1 |yt , x) (Eqs. 26 and 27).
√ √
αt (1 − ᾱt−1 ) ᾱt−1 βt ′
µθ = yt + y (26)
1 − ᾱt 1 − ᾱt 0

1 − ᾱt−1
σ2 = · βt (27)
1 − ᾱt
Notably, the variance used here employs the default values provided by the forward diffusion variance14.
aper13,14, we employ a reparameterization t rick10 to refine the model iteratively.
Similar to the approach in the p
The specific form of this technique is as follows:
yt−1 = µθ + σ εt where εt ∼ N(0, I) (28)
13
This step is akin to Langevin dynamics , where we iteratively refine the inference by following Eq. 28 and
ultimately obtain the denoised image.

Algorithm 1. Training a denoising model G

Algorithm 2. Inference in T iterative refinement steps

Informed Consent
The images included in our study are sourced from a publicly available dataset that contains facial data. These
images were collected and made publicly accessible by the dataset provider, who ensured compliance with the
relevant usage rules and guidelines. As the authors of this study, we have strictly adhered to these rules and
guidelines while using the dataset for our experiments.

The Structure Of The SRDDGAN Model

This section will outline the model structure of SRDDGAN, which consists of both generators and discriminators
and the number of denoising steps utilized.
Our model’s generator architecture resembles the U-net architecture utilized in NCSN++36, which comprises
several residual blocks and attention blocks. The sinusoidal position function regulates the time step, as per
DDPM. We employ the residual blocks from B iGAN29 instead of the original DDPM’s residual blocks, increasing

Scientific Reports | (2024) 14:4272 | https://doi.org/10.1038/s41598-024-52370-3 9

Vol.:(0123456789)
www.nature.com/scientificreports/

their number. Following StyleGAN37, we also incorporate latent variable z conditions in the U-net architecture,
which sets our generator apart from previous diffusion model networks. Specific settings, such as the Swish
activation function, can be found in the original paper. To confine the solution space of high-resolution images,
we developed the LR Encoder module capable of extracting feature details from the low-resolution image and
transforming them into a latent space representation. In the subsequent section, Table 1 presents examples of
hyperparameter designs for generator networks, such as the number of blocks and initial channel number. (see
Appendix A for details).
Crucially, this paper introduces the utilization of an LR Encoder that processes LR information and integrates
it into each reverse diffusion step to steer the generation toward the corresponding HR space. We opted for an
RRDB38 architecture inspired by SRFlow17, renowned for its residual-in-residual design and numerous dense
skip connections. However, we have removed the final convolutional layer from the RRDB architecture as we
aim not to obtain SR outcomes but rather to concentrate on the concealed LR image particulars. Additionally,
we have removed the BN layers due to findings in pertinent literature1,25,38 indicating their potential to introduce
unwanted artifacts and constrain the model’s capacity for generalization.
We take a comparable a pproach1 and create our discriminator with a convolutional neural network using
ResNet blocks, which are designed similarly to generators. The discriminator aims to discriminate between true
and false yt−1, using yt and t as contextual conditions. We incorporate time adjustment by utilizing sinusoidal
position embedding, also employed in the generator. To adjust yt for input to the discriminator, we arrange yt
and yt−1 in series. (see Appendix A for details).
The diffusion model presented in previous research13,14 often required hundreds or thousands of diffusion
steps during inference, resulting in slow image synthesis. Multiple improvements have been suggested to decrease
the number of diffusion steps to solve this problem. For example, previous w ork22,23 suggested incorporating
13,14
noise intensity into the model rather than time (as in ), which allows for greater flexibility in choosing the
number and scheduling of diffusion steps and is effective for image super-resolution. Another intuitive approach
to speeding up diffusion model sampling is to reduce the denoising step in the reverse process. However, previous
research14 has shown that diffusion models often assume the denoising distribution learned during inverse
synthesis can be approximated as a Gaussian distribution. This is problematic because the Gaussian assumption
is only valid in the limit of many small denoising steps, which leads to slow synthesis in diffusion models. In this
paper, we propose using a non-Gaussian multimodal distribution to model the denoising distribution when the
reverse generation process uses larger step sizes (with fewer denoising steps).

Experiment And Analysis

In this section, we will provide a detailed description of the experimental setup of the SRDDGAN model and
demonstrate its effectiveness in the SISR task. Initially, we will briefly overview the dataset used, implementation
details, and evaluation metrics. Subsequently, we will compare and analyze the experimental results of our
model with those of other state-of-the-art models. Additionally, we conducted ablation experiments to explore
the roles of various components in the proposed model. Finally, we will discuss the potential application value
of this model in content fusion and the restoration of complex degraded images in real-world environments.

Experimental Settings
Datasets: In the case of face super-resolution (8×), the same training data as SR322 is utilized, consisting of
70,000 images from FFHQ37 and 28,000 images from CelebA-HQ27. The model is evaluated on 2000 images
from CelebA-HQ. Following SR3, the HR images in the dataset are resized to 128×128 size. Subsequently, the
HR images are downsampled using a bicubic kernel to generate an LR image of size 16×16.

Training Config FFHQ/CelebA-HQ DIV2K/Flickr2K CIFAR10

High-Resolution Size 128 × 128 128 × 128 32 × 32
Low-Resolution Size 16 × 16 32 × 32 16 × 16
Inner Channel 64 64 128
Channel Multiplier (1, 1, 2, 2, 4, 4) (1, 1, 2, 2, 4, 4) (1, 2, 2, 2)
Scale of attention block 16 16 16
Latent embedding dimension 256 256 256
Timestep 4 4 4
Learning rate for generator 1.0E − 04 1.0E − 04 1.5E − 04
Learning rate for discriminator 1.6E − 04 1.6E − 04 1.2E − 04
Training iterations 240k 300k 280k
Exponential Moving Average (EMA) 0.999 0.999 0.999
Optimizer Adamw40 Adamw Adamw
Loss function weights α, β, η = 1, 0.8, 0.2 α, β, η = 1, 0.8, 0.2 α, β, η = 1, 0.8, 0.2
Optimizer Momentum β1 , β2 = 0.5, 0.9 β1 , β2 = 0.5, 0.9 β1 , β2 = 0.5, 0.9
Batch size 48 48 128

Table 1. Training parameter settings of the model.

Scientific Reports | (2024) 14:4272 | https://doi.org/10.1038/s41598-024-52370-3 10

Vol:.(1234567890)
www.nature.com/scientificreports/

For general task super-resolution (4×), the same training data as SRDiff is utilized, which includes 800 images
from DIV2K28 and 2,650 images from Flickr2K39. The model is evaluated on 100 validation sets from DIV2K.
During training and testing, each image in the dataset is cropped to 128×128 to obtain the HR image. The HR
image is then downsampled using a bi-cubic kernel to generate an LR image of size 32×32. Additionally, for
the general-purpose SISR task (2×), we utilized the CIFAR-10 dataset, which comprises 60,000 images across
ten categories. During training and testing, each image in the dataset (32×32) was downsampled using bicubic
interpolation to (16×16) resolution.
Finally, to address the diverse degradation modes in authentic images and enhance the model’s robustness,
we applied a complex degradation algorithm mentioned in the second section to the low-resolution (LR) images.
This algorithm involves random permutations of blurring, downsampling, and noise.
Implementation details: The experimental configuration remains identical for both face SR and general SR
tasks, while the settings for other components are detailed in Table 1. The entire model training process was
carried out using 4 TITAN V 12GB and 4 3090 24GB, and the model evaluation was done using GeForce GTX
1070 8GB. Table 1 in the paper shows the model parameter settings used for training and testing the CelebA-HQ,
FFHQ, DIV2K, and Flickr2K datasets. These settings are consistent throughout the entire table. The same settings
were also used for all the variants of the SRDDGAN in the ablation experiments.
Evaluation metrics: We use classical metrics such as Peak Signal-to-Noise Ratio (PSNR) and Structural
Similarity Index (SSIM)41 to assess the difference between the reconstructed SR and the original HR images.
Additionally, we utilize Learned Perceptual Image Patch Similarity (LPIPS)42 and Low-Resolution Peak Signal-
to-Noise Ratio (LR-PSNR)17 as evaluation metrics. LPIPS measures perceptual similarity by comparing image
features rather than relying on pixel values. It is more consistent with human perception than traditional
evaluation metrics based on pixel values such as PSNR and SSIM. LR-PSNR is a recent evaluation metric for
super-resolution algorithms that calculates the PSNR between the downsampled SR image and the LR image,
reflecting the consistency between the output of the super-resolution algorithm and the LR. Additionally, we have
introduced the FID (Fréchet Inception Distance)43 and IS (Inception Score)44 metrics to assess the quality and
diversity of the generated images. Finally, to evaluate the sampling speed, we measure the clock time required to
process a single image on a GeForce GTX 1070 and the number of iterations needed to process a single image.

Performance
In this section, we assess the effectiveness of SRDDGAN by comparing it with various cutting-edge super-
resolution techniques on face super-resolution (8×) and general super-resolution (4×) tasks. The specifics of
these baseline models’ configurations can be found in their original research papers. Furthermore, we gauge
our model’s performance against these baseline models regarding sample quality, diversity, and sampling speed.
Face SR: Table 2 and Fig. 6 depict our evaluation of SRDDGAN on Face SR (8×) using the CelebA-HQ
validation set. We benchmarked SRDDGAN against various state-of-the-art super-resolution models, namely
PSNR-driven RRDB38 (which is a PSNR-oriented method trained using only L1 loss), GAN-based E SRGAN38,
flow-based SRFlow17, and DDPM-based S R322 and SRDiff23. The evaluation metrics show that in most cases,
SRDDGAN outperforms the previous models, generating high-quality and diverse SR images that remain loyal
to the LR consistency. Specifically :

Figure 6. Face SR (8×) visual results. The SRDDGAN-generated details are more elaborate than those produced
by SR3, SRFlow, and SRDiff. This approach circumvents the visual artifacts observed in ESRGAN, such as
distortions in the woman’s teeth and eyes. Additionally, the SR produced by the model appears more realistic
and diverse, maintaining consistency with the original image.

Scientific Reports | (2024) 14:4272 | https://doi.org/10.1038/s41598-024-52370-3 11

Vol.:(0123456789)
www.nature.com/scientificreports/

Methods PSNR↑ SSIM↑ LPIPS↓ LR-PSNR↑

Bicubic 23.38 0.65 0.484 34.66
RRDB38 26.89 0.78 0.220 48.01
ESRGAN38 23.24 0.66 0.115 39.91
SRFlow17 25.24 0.71 0.110 50.58
SR322 23.04 0.66 0.098 47.00
SRDiff23 25.38 0.74 0.106 52.34
SRDDGAN 25.75±0.0041 0.76±0.0017 0.132±0.0007 53.69±0.0921

Table 2. Results for 8× SR of faces on CelebA-HQ.

1. According to Table 2, SRDDGAN demonstrates superior performance over other state-of-the-art super-
resolution models in terms of perceived quality. LPIPS serves as a primary indicator in this comparison. SRD-
DGAN achieves nearly a 1× improvement in the LPIPS score compared to RRDB, showcasing its superiority.
Even compared to GAN-based methods, SRDDGAN achieves significantly better results on all reference
indicators, including PSNR, which is traditionally considered a fidelity metric. This suggests that SRDDGAN
maintains HR fidelity while also achieving better perceived quality. Compared with Flow-based and DDPM-
based methods, we achieve some competitive performance on the reference metrics. Notably, SRDDGAN
achieves the highest LR-PSNR score among all models, highlighting its consistency with the input LR image.
2. Figure 6 demonstrates that the SRDDGAN model outperforms ESRGAN in avoiding artifacts and preserving
fine details, resulting in a precise and natural-looking image. Our model also produces superior visual results
compared to SRDFlow in the tooth and eye regions. In addition, when compared to the DDPM-based
method, SRDDGAN outperforms SR3 in the mouth area and generates more detailed results than SRDiff.
3. Our model (39.14M) has fewer parameters than SR3 (550M) and SRFlow (40M) while converging faster,
taking only 240K iteration epochs in the same dataset to converge. In contrast, SRDiff convergence requires
approximately 300K iteration epoch, and SR3 requires around 1000K iteration epoch, highlighting the high
efficiency of our SRDDGAN model training.

Figure 7. General SR (4×) visual results. SRDDGAN is superior to EDSR and RRDB in generating SR images
that align with human perception instead of producing blurred hairs. Notably, only SRDDGAN successfully
preserves the horizontal stripe on the brown wall in the second image, which corresponds with the reference
image.

Methods PSNR↑ SSIM↑ LPIPS↓ LR-PSNR↑

Bicubic 26.70 0.77 0.409 38.70
EDSR45 28.98 0.83 0.270 54.89
RRDB38 29.44 0.84 0.253 49.20
ESRGAN38 26.22 0.75 0.124 39.03
SRFlow17 27.09 0.76 0.120 49.96
SRDiff23 27.41 0.79 0.136 55.21
SRDDGAN 27.89±0.0072 0.79±0.0021 0.163±0.0008 55.42±0.0845

Table 3. Results for 4× SR of general images on DIV2K.

Scientific Reports | (2024) 14:4272 | https://doi.org/10.1038/s41598-024-52370-3 12

Vol:.(1234567890)
www.nature.com/scientificreports/

General SR: Table 3 and Fig. 7 display the outcomes of evaluating the generic SRDDGAN using the DIV2k
validation set. The performance of SRDDGAN was compared with other models such as EDSR45, RRDB38,
ESRGAN38, SRFlow17, and SRDiff23. For the 4× setting, we used the officially released pre-trained models of these
models for comparison. As a result, it was observed that SRDDGAN produced intricate details and exhibited
excellent perceptual quality. Specifically:

1. As shown in Table 3, EDSR and RRDB models are trained exclusively using reconstruction losses, which
results in subpar performance when evaluated based on the perceptual LPIPS metric. In contrast, our
SRDDGAN model outperforms ESRGAN, which utilizes GANs in terms of PSNR, LPIPS, and LR-PSNR.
Notably, SRDDGAN achieves the highest score in LR-PSNR among all other models;
2. In Fig. 7, it was noted that EDSR and RRDB produced unsatisfactory visualizations due to their inadequate
generation of high-frequency details. Conversely, SRDDGAN surpassed SRDiff in perceptual quality by
generating rich and detailed visualizations. Additionally, a close examination of the reference image revealed
that SRDDGAN displayed superior perceptual details compared to SRFlow and SRDiff. In the first row,
SRDDGAN produced intricate hair details in the top right corner of the eye and a sharp, brown horizontal
line on the white wall in the second row.

High quality and diversity of sampling: Assessing various models for the image super-resolution task on
the CIFAR-10 dataset (2x upscaling), we evaluated their performance using the quantitative metrics in Table 4.
Our SRDDGAN model exhibited outstanding performance in this task, delivering remarkable results. With an
FID score of 3.92 on 50k CIFAR-10 images, SRDDGAN displayed exceptional image quality, competing com-
petitively with top diffusion models and GANs. While L DM46 required 20000 diffusion steps for the same task,
SRDDGAN only needed four steps, showcasing its rapid sampling speed. Furthermore, SRDDGAN achieved
an IS score of 9.60, highlighting its outstanding image diversity, quality, and swift sampling performance. These
findings underscore the excellent performance of SRDDGAN in image super-resolution tasks, offering robust
support for high-quality, rapid sampling diverse image generation, demonstrating its potential and competitive-
ness in image processing.
Moreover, the results in Fig. 1 demonstrate that our model can generate diverse high-resolution (SR) images
from a single low-resolution (LR) input image. These generated images exhibit natural variations in features such
as hair tips, mouth shape, and eyebrow arches while remaining consistent with the input LR image.
Sampling speed and inference steps: Figure 8 illustrates that the SRDDGAN model surpasses other diffu-
sion-based image generation models, including DDIM56, an enhanced version of DDPM. The SRDDGAN model
possesses two primary benefits: swifter sampling speed and superior image quality generation. Our model only
requires 0.30 seconds to sample an image, whereas other diffusion-based image generation methods, such as
SR3, demand 3.29 seconds per image sampling time. As a result, our model can produce more high-quality
image samples in a shorter period. Additionally, our model shows an enhancement in PSNR evaluation metrics
relative to SR3 and SRDiff (see Table 2). Notably, despite requiring just four sampling steps, our model achieves
exceptional sample quality and speed, distinguishing us from other models.

Ablation Study
We developed two models under low-resolution (LR) conditions and investigated their impact on Super-
Resolution Deep Depth Generative Adversarial Networks (SRDDGAN). The first model (V1 ) directly
concatenates low-resolution images with noisy dimensions and feeds them into the model. The second model
(V 2 ) extends this by incorporating a low-resolution encoder based on V1 . Our research found that using a low-
resolution encoder yields better performance metrics. Refer to the results in Table 5.
As depicted in Table 6, the model in the 3 row demonstrates superior performance across all metrics, achiev-
ing a PSNR of 25.75, SSIM of 0.76, LPIPS of 0.132, and LR-PSNR of 53.69. The performance difference between

Methods Models FID@50k↓ IS↑

NVAE47 51.67 5.51
VAE-based Methods D2C48 10.15 –
DC-VAE49 17.90 8.20
Flow-based Methods SRFlow 16.89 8.42
AutoGAN46 5.29 8.55
BigGAN50 14.73 9.22
GAN-based Methods
StyleGAN251 8.3 9.21
GLEAN52 13.78 8.34
NCSNV253 10.87 8.40
DiffFlow54 14.14 –
DDPM-based Methods
LDM55 3.86 9.57
SRDDGAN 3.92±0.0608 9.60±0.0404

Table 4. Quantitative comparison of SRDDGAN with state-of-the-art models on CIFAR-10 dataset (×2). FID
and IS are computed on 50k samples.

Scientific Reports | (2024) 14:4272 | https://doi.org/10.1038/s41598-024-52370-3 13

Vol.:(0123456789)
www.nature.com/scientificreports/

Figure 8. Comparison of sampling time and diffusion steps of different models on the CelebA-HQ dataset.

Methods PSNR↑ SSIM↑ LPIPS↓ LR-PSNR↑

V1 25.43±0.0073 0.75±0.0051 0.143±0.0025 52.63±0.0878
V2 25.50±0.0067 0.75±0.0060 0.140±0.0028 52.81±0.1188

Table 5. The effectiveness of LR Encoder is verified on CelebA-HQ (8×).

Methods PSNR↑ SSIM↑ LPIPS↓ LR-PSNR↑

SRDDGAN 25.50±0.0067 0.75±0.0060 0.140±0.0028 52.81±0.1188
SRDDGAN+Lcontent 25.69±0.0054 0.76±0.0069 0.134±0.0041 53.12±0.0628
SRDDGAN+Lcontent+Lstyle 25.75±0.0041 0.76±0.0017 0.132±0.0007 53.69±0.0921

Table 6. Effectiveness of content and style loss for SRDDGAN on CelebA-HQ (8×).

Z T PSNR ↑ SSIM ↑ LPIPS ↓ LR-PSNR ↑

256 4 25.75±0.0041 0.76±0.0017 0.132±0.0007 53.69±0.0921
64 4 25.67±0.0017 0.76±0.0013 0.134±0.0011 51.43±0.0244
128 4 25.70±0.0039 0.76±0.0007 0.128±0.0004 52.39±0.0574
256 1 22.45±0.0030 0.67±0.0023 0.127±0.0016 44.76±0.0577
256 2 23.73±0.0019 0.68±0.0024 0.133±0.0002 47.68±0.1163
256 8 25.95±0.0016 0.77±0.0020 0.176±0.0005 49.78±0.6871
0 4 24.96±0.0029 0.75±0.0020 0.112±0.0008 50.26±0.1641

Table 7. Ablations of SRDDGAN for faces SR on CelebA-HQ(8×).

the 2 and 3 rows is minor, but with the inclusion of content and style losses, the fourth row exhibits enhanced
image quality and consistency. Introducing style and content losses significantly boosts the model’s performance,
improving fidelity and perceptual similarity.
Table 7 presents the outcomes of a sequence of ablation experiments conducted to explore the impact of the
size of the latent variable Z embedding dimension and the diffusion step size on the ablation of the diffusion
model. We discovered from the data in rows 1, 4, 5, and 6 that the model generates higher quality and clearer
images as the diffusion step size increases. Furthermore, rows 1, 2, and 3 illustrate that increasing the number
of embedding dimensions of the latent variables enhances the quality of the super-resolved image and improves
its agreement with the LR image. However, a larger diffusion step results in slower inference, and T=4 and
Z=256 are set as the default settings to maintain consistency with LR images. The last row of Table 7 reveals that

Scientific Reports | (2024) 14:4272 | https://doi.org/10.1038/s41598-024-52370-3 14

Vol:.(1234567890)
www.nature.com/scientificreports/

without any latent variable z, the model generates significantly poor sample quality, emphasizing the importance
of multimodal denoising distributions.

Extensions
To comprehensively evaluate the model performance of SRDDGAN, we apply it in the domain of content fusion
and real-world degraded pictures in this subsection.
Content fusion:We aim to utilize other images to modify SR images. Let x represent an LR image, and y
represent an HR image. If we are manipulating a super-resolved image, then y0 = G x, yt , t, z is an SR sample
of x. However, we can also control an existing HR image y by setting x = d ↓ (x) to the down-scaled version of y.
Subsequently, we can modify the SR image by directly incorporating additional image content in the image space.
The forthcoming example illustrates merging one person’s eyes with the rest of another person’s face. The specific
process of content fusion involves the following steps: Initially, we replace the source region image of the mouth
(source) with the corresponding mouth region of the source image of the face (target) to generate a synthetic
content image (Input). Subsequently, we obtained the LR image through bicubic downsampling and generated
the corresponding SR image through model iteration. Lastly, we replace the mouth region on the source image
with the corresponding mouth region on the target source image while preserving the unprocessed facial area.
Figure 9 in the example showcases the transfer of facial features and eyes. The latent variable Z in our approach
enhances the diversity of the generated SR image. For instance, in comparison to the source image, the mouth
area of the sampled SR image is more varied and natural.
Experimental comparison on real-world datasets:To comprehensively evaluate the capability of SRDDGAN
in processing complex degraded images from the real world, we collected low-resolution (LR) images from actual
environments. As shown in Fig. 10, the quality of high-resolution (HR) images reconstructed by SRDDGAN
is significantly superior to those reconstructed by SRFlow, SRDiff, and SR3. Specifically, SRDDGAN in Fig. 10
is significantly better than the other models in detail and texture. For instance, the lines on the wall in the first
row should be straight, and the branching of the tree limbs should be clear rather than blurred. In contrast,
SRDDGAN reconstructs clear images and restores complete details and textures. Experiments on real datasets

Figure 9. SRDDGAN model integrates and coordinates the content from the source image with the target
image.

Figure 10. Real-world performance of SRDDGAN versus other state-of-the-art models.

Scientific Reports | (2024) 14:4272 | https://doi.org/10.1038/s41598-024-52370-3 15

Vol.:(0123456789)
www.nature.com/scientificreports/

demonstrate that SRDDGAN has excellent generalizability and is suitable for single-image super-resolution
(SISR) tasks in real-world scenarios.

Conclusion
This paper introduces SRDDGAN, the first diffusion-based Single Image Super-Resolution (SISR) method
model that relies on a small number of sampling steps. The study posits that in diffusion-based SISR tasks,
the slow sampling speed is primarily due to the Gaussian assumption used in denoising distributions, which
employs very few denoising steps. To address this issue, SRDDGAN is proposed. This method utilizes complex
multimodal distributions to model each denoising step, allowing for more giant denoising strides. To alleviate the
ill-posedness of super-resolution, latent variable Z is introduced to diversify the predictions of SR. Furthermore,
to exploit the adequate information on Low-Resolution (LR) efficiently, a custom LR encoder module is employed
to constrain the solution space of HR using a simple conditional generation approach. Finally, style and content
loss functions are combined to recover some high-frequency details.
Many experiments show that SRDDGAN can generate a wide range of high-quality, realistic SR images.
Moreover, these models demonstrate cost-effectiveness in testing, making them more practical for real-world
applications. Despite exhibiting advantages in experiments, SRDDGAN still has limitations. For instance, it
tends to produce blurry results, especially in the detailed texture of features such as hair, as seen in Figs. 6 and 9.
In the future, we plan to enhance the treatment of fine texture details without altering the existing diffusion
steps. Initially, the image super-resolution reconstruction process will be divided into two stages. The initial stage
prioritizes upsampling, utilizing networks like RRDB to enlarge low-resolution images and obtain the initial
stage’s super-resolved images. In the second stage, we aim to restore residual maps of texture details, introducing
residual learning and enhancing the fusion of super-resolution networks (texture transfer networks) with existing
diffusion models to grasp and recover texture details. Finally, by combining the super-resolved images generated
in the first stage with the residual maps from the second stage, we aim to develop the ultimate super-resolved
photos to address the limitations observed in current super-resolution experiments. Furthermore, we aim to
broaden the research to encompass a broader range of image transformation tasks, such as medical imaging,
image coloring, and JPEG restoration.

Data availability
The dataset and code used and analyzed during the current study are available from the corresponding author
upon reasonable request. The CelebA-HQ dataset is accessible on GitHub at https://github.com/tkarras/progr
essive_growing_of_gans. The FFHQ dataset can be found at https://github.com/NVlabs/ffhq-dataset. CIFAR-
10 data is available via https://www.cs.toronto.edu/~kriz/cifar.html. The Flickr2K dataset can be obtained from
http://cv.snu.ac.kr/research/EDSR/Flickr2K.tar. The Div2K dataset is accessible at https://data.vision.ee.ethz.
ch/cvl/DIV2K/.

Received: 28 May 2023; Accepted: 17 January 2024

References
1. Wang, Z., Chen, J. & Hoi, S. C. Deep learning for image super-resolution: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 43,
3365–3387. https://doi.org/10.1109/TPAMI.2020.2982166 (2020).
2. Fernandez-Beltran, R., Latorre-Carmona, P. & Pla, F. Single-frame super-resolution in remote sensing: A practical overview. Int.
J. Remote Sens. 38, 314–354. https://doi.org/10.1080/01431161.2016.1264027 (2017).
3. Rasti, P., Uiboupin, T., Escalera, S. & Anbarjafari, G. Convolutional neural network super resolution for face recognition in surveil-
lance monitoring. In Articulated Motion and Deformable Objects: 9th International Conference, AMDO 2016, Palma de Mallorca,
Spain, July 13-15, 2016, Proceedings 9, 175–184. (Springer, 2016). https://doi.org/10.1007/978-3-319-41778-3_18.
4. Haris, M., Shakhnarovich, G. & Ukita, N. Task-driven super resolution: Object detection in low-resolution images. In Neural
Information Processing: 28th International Conference, ICONIP 2021, Sanur, Bali, Indonesia, December 8–12, 2021, Proceedings,
Part V 28, 387–395 (Springer, 2021). https://doi.org/10.1007/978-3-030-92307-5_45.
5. Huang, Y., Shao, L. & Frangi, A. F. Simultaneous super-resolution and cross-modality synthesis of 3d medical images using weakly-
supervised joint convolutional sparse coding. In Proceedings of the IEEE conference on computer vision and pattern recognition,
6070–6079. https://doi.org/10.1109/cvpr.2017.613 (2017).
6. Yan, J. et al. Medical image segmentation model based on triple gate multilayer perceptron. Sci. Rep. 12, 1–14. https://doi.org/10.
1038/s41598-022-09452-x (2022).
7. Tikhonov, N., Andre, Arsenin, V. J., Arsenin, I., Vasili, Arsenin, V. Y. et al. Solutions of Ill-Posed Problems (Vh Winston, 1977).
8. Hansen, P. C. Rank-Deficient and Discrete Ill-Posed Problems: Numerical Aspects of Linear Inversion (SIAM, 1998).
9. Bengio, Y., Ducharme, R. & Vincent, P. A neural probabilistic language model. Adv. Neural Inf. Process. Syst. 13 (2000).
10. Kingma, D. P. & Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. https://doi.org/10.48550/arXiv.
1312.6114 (2013).
11. Dinh, L., Sohl-Dickstein, J. & Bengio, S. Density estimation using real nvp. arXiv preprint arXiv:1605.08803. https://doi.org/10.
48550/arXiv.1605.08803 (2016).
12. Goodfellow, I. et al. Generative adversarial networks. Commun. ACM 63, 139–144. https://doi.org/10.1145/3422622 (2020).
13. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N. & Ganguli, S. Deep unsupervised learning using nonequilibrium thermody-
namics. In International Conference on Machine Learning, 2256–2265 (PMLR, 2015).
14. Ho, J., Jain, A. & Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 33, 6840–6851. https://doi.org/
10.48550/arXiv.2006.11239 (2020).
15. Dahl, R., Norouzi, M. & Shlens, J. Pixel recursive super resolution. In Proceedings of the IEEE International Conference on Computer
Vision, 5439–5448. https://doi.org/10.48550/arXiv.1702.00783(2017).
16. Liu, Z.-S., Siu, W.-C. & Chan, Y.-L. Photo-realistic image super-resolution via variational autoencoders. IEEE Trans. Circuits Syst.
Video Technol. 31, 1351–1365. https://doi.org/10.1109/TCSVT.2020.3003832 (2020).

Scientific Reports | (2024) 14:4272 | https://doi.org/10.1038/s41598-024-52370-3 16

Vol:.(1234567890)
www.nature.com/scientificreports/

17. Lugmayr, A., Danelljan, M., Van Gool, L. & Timofte, R. Srflow: Learning the super-resolution space with normalizing flow. In
Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, 715–732
(Springer, 2020). https://doi.org/10.1007/978-3-030-58558-7_42.
18. Ledig, C. et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 4681–4690. https://doi.org/10.1109/cvpr.2017.19 (2017).
19. Arjovsky, M. & Bottou, L. Towards principled methods for training generative adversarial networks. arXiv preprint arXiv:1701.
04862. https://doi.org/10.48550/arXiv.1701.04862 (2017).
20. Sønderby, C. K., Caballero, J., Theis, L., Shi, W. & Huszár, F. Amortised map inference for image super-resolution. arXiv preprint
arXiv:1610.04490. https://doi.org/10.48550/arXiv.1610.04490 (2016).
21. Wang, Z., Zheng, H., He, P., Chen, W. & Zhou, M. Diffusion-gan: Training gans with diffusion. arXiv preprint arXiv:2206.02262.
https://doi.org/10.48550/arXiv.2206.02262 (2022).
22. Saharia, C. et al. Image super-resolution via iterative refinement. IEEE Trans. Pattern Anal. Mach. Intell.https://doi.org/10.1109/
TPAMI.2022.3204461 (2022).
23. Li, H. et al. Srdiff: Single image super-resolution with diffusion probabilistic models. Neurocomputing 479, 47–59. https://doi.org/
10.1016/j.neucom.2022.01.029 (2022).
24. Salimans, T., Kingma, D. & Welling, M. Markov chain Monte Carlo and variational inference: Bridging the gap. In International
Conference on Machine Learning, 1218–1226 (PMLR, 2015). https://doi.org/10.48550/arXiv.1410.6460.
25. Liu, A., Liu, Y., Gu, J., Qiao, Y. & Dong, C. Blind image super-resolution: A survey and beyond. IEEE Trans. Pattern Anal. Mach.
Intell. 45, 5461–5480. https://doi.org/10.48550/arXiv.2107.03055 (2022).
26. Zhang, K., Liang, J., Van Gool, L. & Timofte, R. Designing a practical degradation model for deep blind image super-resolution.
In Proceedings of the IEEE/CVF International Conference on Computer Vision, 4791–4800. https://doi.org/10.48550/arXiv.2103.
14006 (2021).
27. Karras, T., Aila, T., Laine, S. & Lehtinen, J. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint
arXiv:1710.10196. https://doi.org/10.48550/arXiv.1710.10196 (2017).
28. Agustsson, E. & Timofte, R. Ntire 2017 challenge on single image super-resolution: Dataset and study. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition Workshops, 126–135. https://doi.org/10.1109/cvprw.2017.150 (2017).
29. Brock, A., Donahue, J. & Simonyan, K. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:
1809.11096. https://doi.org/10.48550/arXiv.1809.11096 (2018).
30. Creswell, A. et al. Generative adversarial networks: An overview. IEEE Signal Process. Mag. 35, 53–65. https://doi.org/10.1109/
MSP.2017.2765202 (2018).
31. Xiao, Z., Kreis, K. & Vahdat, A. Tackling the generative learning trilemma with denoising diffusion gans. arXiv preprint arXiv:
2112.07804. https://doi.org/10.48550/arXiv.2112.07804 (2021).
32. Feller, W. On the theory of stochastic processes, with particular reference to applications. In Proceedings of the [First] Berkeley
Symposium on Mathematical Statistics and Probability, vol. 1, 403–433 (University of California Press, 1949).
33. Gui, J., Sun, Z., Wen, Y., Tao, D. & Ye, J. A review on generative adversarial networks: Algorithms, theory, and applications. IEEE
Trans. Knowl. Data Eng.https://doi.org/10.1109/TKDE.2021.3130191 (2021).
34. Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.
1556. https://doi.org/10.48550/arXiv.1409.1556 (2014).
35. Wang, X. et al. Superresolution reconstruction of single image for latent features. arXiv preprint arXiv:2211.12845. https://doi.
org/10.1007/s41095-023-0387-8 (2022).
36. Song, Y. et al. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.1 3456. https://
doi.org/10.48550/arXiv.2011.13456 (2020).
37. Karras, T., Laine, S. & Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/
CVF conference on computer vision and pattern recognition, 4401–4410, https://doi.org/10.1109/cvpr.2019.00453(2019).
38. Wang, X. et al. Esrgan: Enhanced super-resolution generative adversarial networks. In Proceedings of the European Conference on
Computer Vision (ECCV) Workshops. https://doi.org/10.48550/arXiv.1809.00219 (2018).
39. Li, Z. et al. Feedback network for image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, 3867–3876. https://doi.org/10.1109/cvpr.2019.00399 (2019).
40. Loshchilov, I. & Hutter, F. Decoupled weight decay regularization. In 7th International Conference on Learning Representations
(ICLR). New Orleans, LA, USA, May 2019 (2019).
41. Wang, Z., Bovik, A. C., Sheikh, H. R. & Simoncelli, E. P. Image quality assessment: From error visibility to structural similarity.
IEEE Trans. Image Process. 13, 600–612. https://doi.org/10.1109/TIP.2003.819861 (2004).
42. Zhang, R., Isola, P., Efros, A. A., Shechtman, E. & Wang, O. The unreasonable effectiveness of deep features as a perceptual metric.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 586–595. https://doi.org/10.1109/cvpr.2018.
00068 (2018).
43. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B. & Hochreiter, S. Gans trained by a two time-scale update rule converge to a
local Nash equilibrium. Advances in Neural Information Processing Systems30. https://doi.org/10.48550/arXiv.1706.08500 (2017).
44. Salimans, T. et al. Improved techniques for training gans. Advances in Neural Information Processing Systems 29. https://doi.org/
10.48550/arXiv.1606.03498 (2016).
45. Lim, B., Son, S., Kim, H., Nah, S. & Mu Lee, K. Enhanced deep residual networks for single image super-resolution. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 136–144. https://doi.org/10.48550/arXiv.1707.
02921 (2017).
46. Cao, B., Zhang, H., Wang, N., Gao, X. & Shen, D. Auto-gan: Self-supervised collaborative learning for medical image synthesis.
In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, 10486–10493 (2020).
47. Vahdat, A. & Kautz, J. Nvae: A deep hierarchical variational autoencoder. Adv. Neural Inf. Process. Syst. 33, 19667–19679. https://
doi.org/10.48550/arXiv.2007.03898 (2020).
48. Sinha, A., Song, J., Meng, C. & Ermon, S. D2c: Diffusion-decoding models for few-shot conditional generation. Adv. Neural Inf.
Process. Syst. 34, 12533–12548 (2021).
49. Parmar, G., Li, D., Lee, K. & Tu, Z. Dual contradistinctive generative autoencoder. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 823–832. https://doi.org/10.48550/arXiv.2011.1006 (2021).
50. Brock, A., Donahue, J. & Simonyan, K. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:
1809.11096 (2018).
51. Karras, T. et al. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 8110–8119. https://doi.org/10.48550/arXiv.1912.04958 (2020).
52. Chan, K. C., Wang, X., Xu, X., Gu, J. & Loy, C. C. Glean: Generative latent bank for large-factor image super-resolution. In Proceed-
ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14245–14254. https://doi.org/10.48550/arXiv.2012.
00739 (2021).
53. Song, Y. & Ermon, S. Improved techniques for training score-based generative models. Adv. Neural Inf. Process. Syst. 33, 12438–
12448 (2020).
54. Zhang, Q. & Chen, Y. Diffusion normalizing flow. Adv. Neural Inf. Process. Syst. 34, 16280–16291 (2021).

Scientific Reports | (2024) 14:4272 | https://doi.org/10.1038/s41598-024-52370-3 17

Vol.:(0123456789)
www.nature.com/scientificreports/

55. Rombach, R., Blattmann, A., Lorenz, D., Esser, P. & Ommer, B. High-resolution image synthesis with latent diffusion models. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10684–10695 (2022).
56. Song, J., Meng, C. & Ermon, S. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502. https://doi.org/10.48550/
arXiv.2010.02502(2020).

Acknowledgements
This work is supported by Guangxi Science and Technology Major Project (AA19254016), Beihai city science
and technology planning project (202082033), Beihai city science and technology planning project (202082023),
Guangxi graduate student innovation project (YCSW2021174).

Author contributions
H.X.: writing—original draft, writing—review and editing, H.X., X.W.: writing—original draft, conceptualization,
data curation, validation. J.C., J.W.: conceptualization, data curation, validation. J.W., J.C., J.D., J.Y.: conceptualiza-
tion, formal analysis, writing—review and editing, supervision, and funding. Y.T., J.Y., J.D., J.C.: formal analysis,
supervision, writing—review, and editing. All authors read and approved the final manuscript.

Competing interests
The authors declare no competing interests.

Additional information
Supplementary Information The online version contains supplementary material available at https://doi.org/
10.1038/s41598-024-52370-3.
Correspondence and requests for materials should be addressed to X.W. or J.W.
Reprints and permissions information is available at www.nature.com/reprints.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International
License, which permits use, sharing, adaptation, distribution and reproduction in any medium or
format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the
Creative Commons licence, and indicate if changes were made. The images or other third party material in this
article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the
material. If material is not included in the article’s Creative Commons licence and your intended use is not
permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Scientific Reports | (2024) 14:4272 | https://doi.org/10.1038/s41598-024-52370-3 18

Vol:.(1234567890)

Srdiff: Single Image Super-Resolution With Diffusion Probabilistic Models
No ratings yet
Srdiff: Single Image Super-Resolution With Diffusion Probabilistic Models
9 pages
Wang Et Al 2021deep - Learning - Super-Resolution
No ratings yet
Wang Et Al 2021deep - Learning - Super-Resolution
23 pages
Transformers in Image Super-Resolution - A Brief Review: Since The Advent of Deep Learning A Decade Ago
No ratings yet
Transformers in Image Super-Resolution - A Brief Review: Since The Advent of Deep Learning A Decade Ago
8 pages
Structure-Preserving Image Super-Resolution
No ratings yet
Structure-Preserving Image Super-Resolution
22 pages
Sinsr: Diffusion-Based Image Super-Resolution in A Single Step
No ratings yet
Sinsr: Diffusion-Based Image Super-Resolution in A Single Step
10 pages
State-of-the-Art Transformer Models For Image Super-Resolution: Techniques, Challenges, and Applications
No ratings yet
State-of-the-Art Transformer Models For Image Super-Resolution: Techniques, Challenges, and Applications
8 pages
SinSR - Diffusion-Based Image Super-Resolution in A Single Step
No ratings yet
SinSR - Diffusion-Based Image Super-Resolution in A Single Step
10 pages
Remotesensing 13 04044 v2 Compressed
No ratings yet
Remotesensing 13 04044 v2 Compressed
12 pages
A Fully Progressive Approach To Single Image Super Resolution Paper 1
No ratings yet
A Fully Progressive Approach To Single Image Super Resolution Paper 1
10 pages
Photo-Realistic Single Image Super-Resolution Using A Generative Adversarial Network
No ratings yet
Photo-Realistic Single Image Super-Resolution Using A Generative Adversarial Network
19 pages
超分辨率
No ratings yet
超分辨率
19 pages
Generating Super-Resolution Images Using Computer Vision Approaches
No ratings yet
Generating Super-Resolution Images Using Computer Vision Approaches
6 pages
Yuan Unsupervised Image Super-Resolution CVPR 2018 Paper
No ratings yet
Yuan Unsupervised Image Super-Resolution CVPR 2018 Paper
10 pages
Deep Learning For Image Super Resolution
No ratings yet
Deep Learning For Image Super Resolution
9 pages
Kim Arbitrary-Scale Image Generation and Upsampling Using Latent Diffusion Model and CVPR 2024 Paper
No ratings yet
Kim Arbitrary-Scale Image Generation and Upsampling Using Latent Diffusion Model and CVPR 2024 Paper
10 pages
Chen Activating More Pixels in Image Super-Resolution Transformer CVPR 2023 Paper
No ratings yet
Chen Activating More Pixels in Image Super-Resolution Transformer CVPR 2023 Paper
7 pages
Single Image Super Resolution
No ratings yet
Single Image Super Resolution
40 pages
2205 Esrgcnn
No ratings yet
2205 Esrgcnn
24 pages
Research On GAN-based Image Super-Resolution Method
No ratings yet
Research On GAN-based Image Super-Resolution Method
4 pages
INTRODUCTION
No ratings yet
INTRODUCTION
29 pages
22A Heterogenous Group CNN For Image SR
No ratings yet
22A Heterogenous Group CNN For Image SR
13 pages
Group 27 Endsem Report
No ratings yet
Group 27 Endsem Report
10 pages
Single Image Super Resolution Using Interpolation and Discrete Wavelet Transform
No ratings yet
Single Image Super Resolution Using Interpolation and Discrete Wavelet Transform
9 pages
Single Image Super-Resolution Using Compressive Sensing
No ratings yet
Single Image Super-Resolution Using Compressive Sensing
7 pages
PULSE: Self-Supervised Photo Upsampling Via Latent Space Exploration of Generative Models
No ratings yet
PULSE: Self-Supervised Photo Upsampling Via Latent Space Exploration of Generative Models
17 pages
2003 03808 PDF
No ratings yet
2003 03808 PDF
17 pages
A Reviw of GAN Based Super Resolution Reconstruction For Optical Remote Sensing Images
No ratings yet
A Reviw of GAN Based Super Resolution Reconstruction For Optical Remote Sensing Images
34 pages
Shearlet Transformation Within Super Resolution
No ratings yet
Shearlet Transformation Within Super Resolution
16 pages
A Review of Single Image Super Resolution Techniques Using Convolutional Neural Networks
No ratings yet
A Review of Single Image Super Resolution Techniques Using Convolutional Neural Networks
35 pages
Context Aware Edge-Enhanced GAN For Remote Sensing Image Super-Resolution
No ratings yet
Context Aware Edge-Enhanced GAN For Remote Sensing Image Super-Resolution
14 pages
Training Generative Image Super-Resolution Models by Wavelet-Domain Losses Enables Better Control of Artifacts
No ratings yet
Training Generative Image Super-Resolution Models by Wavelet-Domain Losses Enables Better Control of Artifacts
11 pages
Electronics 10 02014
No ratings yet
Electronics 10 02014
15 pages
DeepSR A Deep Learning Tool For Image Super Resolution. DOI-10.1016j.softx.2022.101261
No ratings yet
DeepSR A Deep Learning Tool For Image Super Resolution. DOI-10.1016j.softx.2022.101261
16 pages
Image Super Resolution
No ratings yet
Image Super Resolution
8 pages
NEUCOM SI Editorial SR
No ratings yet
NEUCOM SI Editorial SR
5 pages
Single Image Super-Resolution Via Locally Regularized Anchored Neighborhood Regression and Nonlocal Means
No ratings yet
Single Image Super-Resolution Via Locally Regularized Anchored Neighborhood Regression and Nonlocal Means
14 pages
Lightweight Image Super-Resolution Based On
No ratings yet
Lightweight Image Super-Resolution Based On
27 pages
Underwater Image Super-Resolution Using Deep Residual Multipliers
No ratings yet
Underwater Image Super-Resolution Using Deep Residual Multipliers
10 pages
Deep Wavelet 2017
No ratings yet
Deep Wavelet 2017
10 pages
Super Resolution: Ms. Manisha A. Bhusa
No ratings yet
Super Resolution: Ms. Manisha A. Bhusa
5 pages
Icramet RDGAN Rep3
No ratings yet
Icramet RDGAN Rep3
6 pages
RAISR: Rapid and Accurate Image Super Resolution: Yaniv Romano, John Isidoro, and Peyman Milanfar, Fellow, IEEE
No ratings yet
RAISR: Rapid and Accurate Image Super Resolution: Yaniv Romano, John Isidoro, and Peyman Milanfar, Fellow, IEEE
31 pages
Improving Image Quality Using Deep Learning Based Super Resolution
No ratings yet
Improving Image Quality Using Deep Learning Based Super Resolution
5 pages
PULSE: Self-Supervised Photo Upsampling Via Latent Space Exploration of Generative Models
No ratings yet
PULSE: Self-Supervised Photo Upsampling Via Latent Space Exploration of Generative Models
20 pages
Review DL2019
No ratings yet
Review DL2019
16 pages
ESRGAN Slides 3mar2025
No ratings yet
ESRGAN Slides 3mar2025
40 pages
Single-Image Super-Resolution Based On Rational Fractal Interpolation
No ratings yet
Single-Image Super-Resolution Based On Rational Fractal Interpolation
16 pages
Deep Learning For Single Image Super-Resolution: A Brief Review
No ratings yet
Deep Learning For Single Image Super-Resolution: A Brief Review
17 pages
Super Resolution A Simplified Approach Using GANs
No ratings yet
Super Resolution A Simplified Approach Using GANs
4 pages
Images Super-Resolution Using Improved Generative Adversarial Networks
No ratings yet
Images Super-Resolution Using Improved Generative Adversarial Networks
5 pages
Li 2019
No ratings yet
Li 2019
25 pages
Chen NTIRE 2024 Challenge On Image Super-Resolution x4 Methods and Results CVPRW 2024 Paper
No ratings yet
Chen NTIRE 2024 Challenge On Image Super-Resolution x4 Methods and Results CVPRW 2024 Paper
25 pages
Super-Resolution of Document Images Using Transfer Deep Learning of An ESRGAN Model
No ratings yet
Super-Resolution of Document Images Using Transfer Deep Learning of An ESRGAN Model
6 pages
Forests 14 02188 Compressed
No ratings yet
Forests 14 02188 Compressed
17 pages
Literature Review On Single Image Super Resolution
No ratings yet
Literature Review On Single Image Super Resolution
6 pages
Resolution Enhancement Technique For Noisy Medical Images: Jesna K H
No ratings yet
Resolution Enhancement Technique For Noisy Medical Images: Jesna K H
14 pages
NoUCSR - Efficient Super-Resolution Network Without Upsampling Convolution
No ratings yet
NoUCSR - Efficient Super-Resolution Network Without Upsampling Convolution
10 pages
MADNet A Fast and Lightweight Network For Single-Image Super Resolution
No ratings yet
MADNet A Fast and Lightweight Network For Single-Image Super Resolution
11 pages
Netdiffusion: Network Data Augmentation Through Protocol-Constrained Traffic Generation
No ratings yet
Netdiffusion: Network Data Augmentation Through Protocol-Constrained Traffic Generation
32 pages
Text-to-Image Diffusion Models
No ratings yet
Text-to-Image Diffusion Models
13 pages
Chen ArtAdapter Text-to-Image Style Transfer Using Multi-Level Style Encoder and Explicit CVPR 2024 Paper
No ratings yet
Chen ArtAdapter Text-to-Image Style Transfer Using Multi-Level Style Encoder and Explicit CVPR 2024 Paper
10 pages
Automated Virtual Product Placement and Assessment in Images Using Diffusion Models
No ratings yet
Automated Virtual Product Placement and Assessment in Images Using Diffusion Models
9 pages
Efficient Diffusion Training Via Min-SNR Weighting Strategy
No ratings yet
Efficient Diffusion Training Via Min-SNR Weighting Strategy
18 pages
Selected Research Papers For ML - AI Project
No ratings yet
Selected Research Papers For ML - AI Project
3 pages
Diffusion Model For Generative Image Denoising
No ratings yet
Diffusion Model For Generative Image Denoising
15 pages
Feature Denoising Diffusion Model For Blind Image Quality Assessment
No ratings yet
Feature Denoising Diffusion Model For Blind Image Quality Assessment
10 pages
Esser Structure and Content-Guided Video Synthesis With Diffusion Models ICCV 2023 Paper
No ratings yet
Esser Structure and Content-Guided Video Synthesis With Diffusion Models ICCV 2023 Paper
11 pages
Diffusion Csail Lecture Notes
No ratings yet
Diffusion Csail Lecture Notes
56 pages
Score-Based Diffusion Models Via Stochastic Differential Equations - A Technical Tutorial
No ratings yet
Score-Based Diffusion Models Via Stochastic Differential Equations - A Technical Tutorial
29 pages
AnlightenDiff Anchoring Diffusion Probabilistic Model On Low Light Image Enhancement
No ratings yet
AnlightenDiff Anchoring Diffusion Probabilistic Model On Low Light Image Enhancement
16 pages
Tame VTON
No ratings yet
Tame VTON
16 pages
Lavida: A Large Diffusion Language Model For Multimodal Understanding
No ratings yet
Lavida: A Large Diffusion Language Model For Multimodal Understanding
26 pages
Mirada
No ratings yet
Mirada
15 pages
Style TTS2
No ratings yet
Style TTS2
28 pages
CVPR2024优质论文汇总（B站future整理）
No ratings yet
CVPR2024优质论文汇总（B站future整理）
70 pages
Exploring Pre-Trained Text-to-Video Diffusion Models For Referring Video Object Segmentation
No ratings yet
Exploring Pre-Trained Text-to-Video Diffusion Models For Referring Video Object Segmentation
21 pages
One-Step Image Translation With Text-to-Image Models
No ratings yet
One-Step Image Translation With Text-to-Image Models
29 pages
Audio X
No ratings yet
Audio X
14 pages
The Age of Synthetic Realities: Challenges and Opportunities
No ratings yet
The Age of Synthetic Realities: Challenges and Opportunities
27 pages
Freestyle Net
No ratings yet
Freestyle Net
22 pages
Videomage:: Multi-Subject and Motion Customization of Text-To-Video Diffusion Models
No ratings yet
Videomage:: Multi-Subject and Motion Customization of Text-To-Video Diffusion Models
14 pages
You Only Sample Once: Taming One-Step Text-To-Image Synthesis by Self-Cooperative Diffusion GANs
No ratings yet
You Only Sample Once: Taming One-Step Text-To-Image Synthesis by Self-Cooperative Diffusion GANs
14 pages
Flow Matching Guide and Code
No ratings yet
Flow Matching Guide and Code
83 pages
Burgert Go-with-The-Flow Motion-Controllable Video Diffusion Models Using Real-Time Warped Noise CVPR 2025 Paper
No ratings yet
Burgert Go-with-The-Flow Motion-Controllable Video Diffusion Models Using Real-Time Warped Noise CVPR 2025 Paper
11 pages
Customizing Text-to-Image Models With A Single Image Pair
No ratings yet
Customizing Text-to-Image Models With A Single Image Pair
27 pages
Motion-I2V: Consistent and Controllable Image-to-Video Generation With Explicit Motion Modeling
No ratings yet
Motion-I2V: Consistent and Controllable Image-to-Video Generation With Explicit Motion Modeling
11 pages
Priming GPT-4 For Midjourney V5
No ratings yet
Priming GPT-4 For Midjourney V5
14 pages
Optimizing OOD Detection in Molecular Graphs: A Novel Approach With Diffusion Models
No ratings yet
Optimizing OOD Detection in Molecular Graphs: A Novel Approach With Diffusion Models
11 pages

Single Image Super-Resolution With Denoising Diffusion GANS

Uploaded by

Single Image Super-Resolution With Denoising Diffusion GANS

Uploaded by

www.nature.

OPEN Single image super‑resolution

Scientific Reports | (2024) 14:4272 | https://doi.org/10.1038/s41598-024-52370-3 1

Scientific Reports | (2024) 14:4272 | https://doi.org/10.1038/s41598-024-52370-3 2

The reverse diffusion chain (denoised diffusion): is constructed as

Scientific Reports | (2024) 14:4272 | https://doi.org/10.1038/s41598-024-52370-3 3

Large step denoising with multimodal distribution

Scientific Reports | (2024) 14:4272 | https://doi.org/10.1038/s41598-024-52370-3 4

Diverse forms of degradation

Consistency of style and content

Scientific Reports | (2024) 14:4272 | https://doi.org/10.1038/s41598-024-52370-3 5

Conditional Denoising Diffusion GANS Model

Scientific Reports | (2024) 14:4272 | https://doi.org/10.1038/s41598-024-52370-3 6

Forward diffusion process

Optimizing the Denoising Diffusion GANS Model

Scientific Reports | (2024) 14:4272 | https://doi.org/10.1038/s41598-024-52370-3 7

Figure 5. The training process of SRDDGAN.

Scientific Reports | (2024) 14:4272 | https://doi.org/10.1038/s41598-024-52370-3 8

Algorithm 1. Training a denoising model G

Algorithm 2. Inference in T iterative refinement steps

The Structure Of The SRDDGAN Model

Scientific Reports | (2024) 14:4272 | https://doi.org/10.1038/s41598-024-52370-3 9

Experiment And Analysis

Training Config FFHQ/CelebA-HQ DIV2K/Flickr2K CIFAR10

Table 1. Training parameter settings of the model.

Scientific Reports | (2024) 14:4272 | https://doi.org/10.1038/s41598-024-52370-3 10

Scientific Reports | (2024) 14:4272 | https://doi.org/10.1038/s41598-024-52370-3 11

Methods PSNR↑ SSIM↑ LPIPS↓ LR-PSNR↑

Table 2. Results for 8× SR of faces on CelebA-HQ.

Methods PSNR↑ SSIM↑ LPIPS↓ LR-PSNR↑

Table 3. Results for 4× SR of general images on DIV2K.

Scientific Reports | (2024) 14:4272 | https://doi.org/10.1038/s41598-024-52370-3 12

Methods Models FID@50k↓ IS↑

Scientific Reports | (2024) 14:4272 | https://doi.org/10.1038/s41598-024-52370-3 13

Methods PSNR↑ SSIM↑ LPIPS↓ LR-PSNR↑

Table 5. The effectiveness of LR Encoder is verified on CelebA-HQ (8×).

Methods PSNR↑ SSIM↑ LPIPS↓ LR-PSNR↑

Z T PSNR ↑ SSIM ↑ LPIPS ↓ LR-PSNR ↑

Table 7. Ablations of SRDDGAN for faces SR on CelebA-HQ(8×).

Scientific Reports | (2024) 14:4272 | https://doi.org/10.1038/s41598-024-52370-3 14

Figure 10. Real-world performance of SRDDGAN versus other state-of-the-art models.

Scientific Reports | (2024) 14:4272 | https://doi.org/10.1038/s41598-024-52370-3 15

Received: 28 May 2023; Accepted: 17 January 2024

Scientific Reports | (2024) 14:4272 | https://doi.org/10.1038/s41598-024-52370-3 16

Scientific Reports | (2024) 14:4272 | https://doi.org/10.1038/s41598-024-52370-3 17

© The Author(s) 2024

Scientific Reports | (2024) 14:4272 | https://doi.org/10.1038/s41598-024-52370-3 18

You might also like