A Practical Contrastive Learning Framework For Single Image Super Resolution
A Practical Contrastive Learning Framework For Single Image Super Resolution
Abstract— Contrastive learning has achieved remarkable suc- image clustering [8], fine-gained image classification [9], and
cess on various high-level tasks, but there are fewer contrastive knowledge distillation [10].
learning-based methods proposed for low-level tasks. It is When it comes to low-level image processing tasks, there
challenging to adopt vanilla contrastive learning technologies
proposed for high-level visual tasks to low-level image restoration are some challenges to directly applying contrastive learning
problems straightly. Because the acquired high-level global visual approaches. First, the learned global visual representations are
representations are insufficient for low-level tasks requiring rich inadequate for low-level tasks that call for rich texture and
texture and context information. In this article, we investigate the context information. Second, a series of data augmentations
contrastive learning-based single-image super-resolution (SISR) have been developed to generate positive and negative samples
from two perspectives: positive and negative sample construction
and feature embedding. The existing methods take naive sample for high-level downstream tasks [3], [4], [11], [12]. However,
construction approaches (e.g., considering the low-quality input except for some simple geometric augmentations (e.g., rotation
as a negative sample and the ground truth as a positive sample) augmentation in the widely used self-ensemble scheme), most
and adopt a prior model (e.g., pretrained very deep convolutional of the complicated data augmentations cannot maintain the
networks proposed by visual geometry group (VGG) model) dense pixel correspondences and, thus, are not suitable for low-
to obtain the feature embedding. To this end, we propose a
practical contrastive learning framework for SISR (PCL-SR). level tasks. Third, a meaningful latent space (or embedding
We involve the generation of many informative positive and space) is required for contrastive loss. In contrast to high-level
hard negative samples in frequency space. Instead of utilizing tasks that try to obtain the best semantic representations, low-
an additional pretrained network, we design a simple but level tasks aim at reconstructing restored results in data space.
effective embedding network inherited from the discriminator It is of great importance to explore a proper and meaningful
network, which is more task-friendly. Compared with the exist-
ing benchmark methods, we retrain them by our proposed embedding space where the contrastive loss can be effectively
PCL-SR framework and achieve superior performance. Extensive defined.
experiments have been conducted to show the effectiveness Current contrastive learning-based methods for low-level
and technical contributions of our proposed PCL-SR thorough tasks mainly focus on exploiting negative samples, while let
ablation studies. The code and resulting models will be released the ground-truth image as the positive sample. For exam-
via https://github.com/Aitical/PCL-SISR.
ple, [13] treats the degraded image (the input hazy image)
Index Terms— Contrastive learning, data augmentation, fea- as a negative sample and presents a novel image dehazing
ture embedding, super-resolution.
method with a contrastive regularization. In [14] and [15],
they take other examples in the dataset as negative samples
I. I NTRODUCTION
for contrastive image super-resolution and underwater image
Authorized licensed use limited to: ULAKBIM UASL - YILDIZ TEKNIK UNIVERSITESI. Downloaded on February 04,2025 at 08:53:13 UTC from IEEE Xplore. Restrictions apply.
WU et al.: PRACTICAL CONTRASTIVE LEARNING FRAMEWORK FOR SINGLE-IMAGE SUPER-RESOLUTION 15835
TABLE I
C OMPARISON B ETWEEN THE P ROPOSED C ONTRASTIVE L EARNING F RAMEWORK AND
C URRENT C ONTRASTIVE L EARNING A PPROACHES FOR L OW-L EVEL TASKS
contrastive loss to pretrain a kernel estimation model, which data augmentation strategy to generate informative positive
aims at separating different degradations and obtaining the and negative samples for SISR; 2) we rethink and explore a
degradation-aware representation. novel way to obtain a task-friendly embedding network where
Table I summarizes the characteristics of current contrastive contrastive loss works efficiently by reusing the discriminator
learning-based image restoration methods, which are presented of SR network; and 3) extensive experiments show that our
most recently. The positive samples are defined as the ground method, dubbed PCL-SR, outperforms several representative
truth, while the negative samples are simply defined as the SISR methods. In addition, ablation studies are conducted to
degraded images or other images in the dataset [13], [14], [15]. analyze different components of our proposed method.
These negative samples are dissimilar to the reconstructed We will first give some related work of SISR and contrastive
image and easily distinguished, i.e., they are too distant to learning approaches in Section II. In Section III, we introduce
contribute to the contrastive loss. According to the specific our proposed PCL framework and explain it in detail. Then,
image restoration tasks, another line of research [16], [17], Section IV describes our training settings and experimental
[18], [19] tries to generate some invariant (global) features of results, including ablation analysis, where we compare the per-
image, which are immune to the noise, rain, resolution, and formance of our approach with other state-of-the-art (SoTA)
blur, based on constrictive loss. They overlooked the ingredient methods. Finally, some conclusions are drawn in Section V.
of constructing effective positive and negative pairs for the
II. R ELATED W ORK
reconstructed image. In addition, the constrictive losses of
In this section, we will briefly introduce the related work
these methods are defined on some specific embedding space
about self-supervised contrastive learning methods, contrastive
and cannot well generalize to other methods.
learning methods for image restoration, contrastive learning
In this article, we propose a practical contrastive learning
for image-to-image (I2I) translation, and SISR methods.
framework for SISR (PCL-SR), and we investigate it from
two perspectives: sample construction and feature embedding. A. Contrastive Learning
As revealed by recent studies, super-resolved results of current Contrastive learning has emerged as an effective paradigm
deep learning methods are smooth and look implausible (may for unsupervised representation learning, by maximizing
be averaged from all possible outputs of the SR network). mutual information. In recent years, several works [2], [3],
Based on these observations and hard negative mining stud- [4], [23], [24] have studied this paradigm. The contrastive
ies [20], [21], we propose to generate multiple hard negative loss, which is similar to previous works [25], [26], [27] in
samples by applying some slight blurry to the ground truth deep metric learning, aims to push the anchor sample away
and generate multiple positive samples by simply sharpening from negative samples and pull it closer to positive samples
the ground truth, resulting in informative positive and hard in the latent space. The selection of negative and positive
negative pairs for the super-resolved image. In this way, samples is crucial and depends on downstream tasks. In [2],
we believe that more hard negative samples will encourage [3], and [4], randomly augmented samples from the anchor
the super-resolved image far from smooth results, while more sample serve as positive samples, while samples from others
positive samples will force the network to draw in more are negative. Tian et al. [11] analyze the optimal augmentation
detailed information. On the other hand, in contrast to adopting strategy in detail. Cai et al. [12] propose a feature space aug-
a prior model as the embedding network (e.g., pretrained mentation strategy, sampling numerous positive and negative
very deep convolutional networks proposed by visual geometry samples from the latent feature distribution. Chuang et al. [20],
group (VGG) [1] network), we propose to leverage a cheap and Robinson et al. [21], and Sinha et al. [28] analyzed the impact
task-friendly feature embedding network, the discriminator of of hard negative samples and exploited hard negative mining
SR network, to embed positive/negative/anchor samples into strategies to dig more informative negative samples.
a proper feature space where the contrastive loss will be
effectively defined. B. Constrictive Learning for Image Restoration
Our contributions are summarized as follows: 1) we propose For single-image dehazing, Wu et al. [13] propose a method
a practical and general contrastive learning framework for the that utilizes a pretrained VGG model to obtain latent embed-
single-image super-resolution (SISR) task, and we take a valid dings, where positive and negative samples are selected from
Authorized licensed use limited to: ULAKBIM UASL - YILDIZ TEKNIK UNIVERSITESI. Downloaded on February 04,2025 at 08:53:13 UTC from IEEE Xplore. Restrictions apply.
15836 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 11, NOVEMBER 2024
the ground truth and corresponding hazy image, respec- architecture that combines Swin Transformer [41] and con-
tively, and contrastive loss is conducted using intermediate volutions into images restoration tasks. Besides investigating
feature maps extracted from the VGG network. For blind more powerful network architectures, some perceptual-driven
super-resolution (BSR), Wang et al. [19] apply the contrastive approaches have also been proposed utilizing perceptual loss
loss to pretrain a kernel estimation model, which obtains a functions to achieve better visual quality [42], [43], [44], [45].
degradation-aware representation by distinguishing different In this article, we present a PCL-SR. We investigate the
degradations. The authors assumed that patches from the same contrastive learning paradigm for SISR from two perspec-
image are under the same degradation and that patches from tives: sample generation and feature embedding. In contrast
different images are not the same, thus selecting positive to the existing methods, which propose hard negative sam-
and negative samples accordingly. Zhang et al. [18] propose a ples from semantic discrimination [20], [21], [28] or patch
two-stage BSR method. First, an image encoder is pretrained feature similarity [22], we exploit task-valid hard negative
to learn resolution-invariant features by contrastive loss. Then, samples for SISR tasks by performing a slight blur operation.
another contrastive loss is adopted to finetune the super- Moreover, while most existing methods adopt pretrained VGG
resolved results. as the feature embedding network, we propose an adaptive
Recently, Liu et al. [22] introduce the contrastive prior into frequency-aware feature embedding approach, which learns
underwater image enhancement, which takes observed under- task-friendly embedded features and is able to be sensitive
water images as negative samples and clear in-air images as to high-frequency information.
positive, and the pretrained VGG network is adopted as the
feature embedding network to provide multilayer intermediate III. M ETHOD
features. In this section, we introduce our proposed PCL-SR in detail.
We first introduce the preliminaries of contrastive learning, and
C. Constrictive Learning for I2I Translation then, we describe our positive and negative sample generation
Contrastive unpaired image-to-image translation (CUT) [29] strategy and the training of our feature encoder. Finally,
is a pioneering work that proposed patch-based contrastive we present our main framework, which employs contrastive
loss for unpaired I2I translation, which effectively obtains learning to further improve the performance of the existing
structure-preserved and style-transferred features in latent SISR works.
space. For a given patch of the input image, positive and
negative samples are the corresponding transferred patch and A. Preliminaries
other random patches from the same domain. Han et al. [30] Contrastive learning is one of the most powerful approaches
propose the dual contrastive learning framework for unsuper- for representation learning. It aims at pulling the anchor
vised image-to-image translation (DCLGAN), which utilizes a sample close to the positive samples and pushing it far away
bidirectional patch-based contrastive loss between the source from negative samples in latent space [2], [3], [4]. For the
and target domain. In recent work, Liu et al. [22] explore image dataset I, the representation learning model E is trained
the effectiveness of negative samples in patch-based con- to extract representations R = {ri |ri = E(Ii ), Ii ∈ I} with
trastive loss and proposed a negative pruning technology called InfoNCE loss [23], [24]. The loss LInfoNCE is based on a
ranking and pruning noise contrastive estimation (RankNCE), softmax formulation, and for the ith sample, the loss Li is
which reduces the number of negative samples by ranking the as formulated as follows:
similarity score between the corresponding negative patches exp riT · ri+ τ
and the anchor patch feature. Ko et al. [31] introduce the Li = − log (1)
exp riT · ri+ τ + Kj=1 exp riT · r −
j τ
P
self-supervised paradigm proposed in [6] into the I2I task and
explored a self-supervised dense consistency regularization where τ is the temperature hyperparameter. ri+ means the
(DCR), where two augmented samples with an overlap region representation of the positive sample usually generated by
are taken as the input, and the DCR is calculated on the dense random data augmentations from the same sample Ii . K is the
representation of the overlap region. number of negative samples, and {r − K
j } j=1 is the set of negative
representations from negative samples {I j |I j ∈ I, j ̸ = i} Kj=1
D. Single-Image Super-Resolution that are random selected other images from dataset. Finally,
Deep-learning-based methods have dominated the SISR total contrastive loss is as follows:
field in recent years. Dong et al. propose the first convolutional N
1 X
neural network (CNN)-based SR (SRCNN) method. Since LInfoNCE = Li . (2)
N i=1
that, various efficient and deeper architectures have been
proposed for SR, and the performance on benchmark datasets In addition, the work in [7] modified and applied contrastive
has been continuously improved by newly developed network loss to the supervised classification task where there are more
architectures [32], [33], [34], [35], [36], [37]. In addition, than one positive samples. For the ith image, this supervised
Zhou et al. [38] propose adaptive patch feature aggregation contrastive loss is as follows:
with a graph neural network (IGNN). Chen et al. [39] pro-
p τ
P
1 X exp riT · r +
posed a multitask pretrained image processing transformer Li = − log
exp riT · ri+ τ + Kj=1 exp riT · r −j τ
P
P p=1
(IPT). Liang et al. [40] develop the image restoration using
swin transformer (SwinIR) method, which introduces a hybrid (3)
Authorized licensed use limited to: ULAKBIM UASL - YILDIZ TEKNIK UNIVERSITESI. Downloaded on February 04,2025 at 08:53:13 UTC from IEEE Xplore. Restrictions apply.
WU et al.: PRACTICAL CONTRASTIVE LEARNING FRAMEWORK FOR SINGLE-IMAGE SUPER-RESOLUTION 15837
P
where P is the number of the positive set, noted as {r +
p } p=1 . Here, we generate slightly blurry images from the ground
Then, the total supervised contrastive loss LSupCL is truth as our hard negative sample set Ni , because they are
N close to the ground truth, thus forcing the reconstructed SR
1 X image become closer to the ground truth. For the ith image,
LSupCL = Li . (4)
N i=1 we denote its negative set as follows:
KN
Contrastive learning is beneficial to various downstream Ni = N j N j = Blur IiHR j=1
(6)
tasks and achieves promising performance. As we described
in Section I, we can find that to employ contrastive learning where K N is the number of negative samples and we use
methods, carefully designed sample selection and construction K N = K P = 4 as default. Blur is blur function with a
strategy and a task-related latent space need to be explored. random Gaussian kernel. For the blur kernel setting, the size is
Next, we will describe our sample generation strategy from uniformly sampled from {3, 5, 7, 9, 11}, and random Gaussian
the perspective of frequency domain and how to train a kernel samples the kernel width uniformly from [0.3, 1.5].
task-friendly embedding network in SISR instead of a pre-
trained model, such as VGG network. C. Feature Embedding Network
In this section, we introduce a simple but efficient way to
B. Positive and Negative Sample Generations obtain a task-friendly embedding network. As described in
The SISR task aims to transform LR images (noted as I LR ) Section I, VGG-based perceptual loss is widely adopted [43],
into sharp, realistic, and HR images (noted as I HR ). As LR [44], and recent work in [13] designs a contrastive loss based
images are formulated from an image degradation process on the pretrained VGG model. We believe that a task-friendly
and contain only the low-frequency information, SR model embedding network is better, because features obtained by
focuses on learning a reverse translation to recover the lost pretrained VGG tend to be high-level semantic information.
high-frequency components (e.g., edge and texture informa- In addition, compared with the SISR task, pretraining with
tion). With this in mind, we take valid data augmentations to ImageNet is a very heavy task. Furthermore, a good embed-
generate our positive and negative samples. ding network for SISR should be degradation-aware in order
1) Informative Positive Sample Generation: In addition to that contrastive loss can work even if the super-resolved results
the only HR ground truth, we further generate K P sharpened are very close to the ground truth. In other words, a good
images as positive set Pi by applying different high-pass embedding network should be able to distinguish changes in
kernels on HR image. For the ith image, we denote its positive detail.
set as follows: Inspired by adversarial learning approaches in [43] and [48],
KP
Pi = P j P j = Sharpen IiHR j=1
(5) we find that the discriminator learned in the vanilla generative
adversarial networkinfo (GAN) framework is degradation-
where K P is the number of positive samples. Sharpen presents aware, because it can correctly distinguish whether the input
a random sharpness function. This is different from the image is fake or not. With this in mind, we employ a
existing contrastive learning-based image restoration methods, GAN-like framework to obtain our task-friendly embedding
which consider only the ground truth as the positive sam- network by forcing it to distinguish SR and HR images,
ple [13], [18]. It should be noted that in order to generate more as illustrated in Fig. 1. Notably, to enhance the high-frequency
informative positive samples, we apply different high-pass ker- components learning, we separate the image to low- and high-
nels to the HR ground truth. This positive sample generation frequency parts, and our embedding network is trained with
strategy is designed following two observations: 1) the object only the frequency components, which has been verified to be
of the SISR task is to obtain detailed results; we can use some effective in real-world SR problems [49]. Here, we use Haar
informative positive samples to induce more high-frequency wavelet transform to extract the informative high-frequency
details for the reconstruction results and 2) SISR is an ill-posed components. The four sub-bands decomposed by Haar wavelet
problem, and the mapping between the LR and HR images is transform, noted as low-low frequency band (LL), low-high
“one-to-many.” That is, the number of ground truth should not frequency band (LH), high-low frequency band (HL), and
only be one, and there are many possible HR samples except high-high frequency band (HH). Then, we stack the three high-
the given ground truth [46]. Our proposed positive sample frequency-related components (LH, HL, and HH) as the input
generation method can be seen as a very coarse one. and feed them to discriminator network E D . Instead of training
2) Hard Negative Sample Generation: In order to intro- with conventional real or fake binary classification network,
duce contrastive learning to the low-level image restoration we adopt a contrastive discriminator loss proposed in [50] to
problems, recent works [13], [14], [15] simply take degraded train our E D . It is a one-against-a-batch classification in the
images (e.g., the input hazy image or LR image) or other softmax cross-entropy formulation, and thus, the loss function
images in the dataset. When compared with the super-resolved of the E D can be formulated as follows:
image, low-quality negative samples are dissimilar and easy
e E D (Hw ( I ))
" HR
#
to be distinguished. Inspired from the hard negative samples LCLD = E I HR log
e E D ( I ) + I LR e E D (Hw ( I ))
HR P SR
mining and adversarial training methods [20], [21], [47],
e−E D (Hw ( I ))
a natural idea is whether we can feed some difficult hard " SR
#
examples that are similar to ground truth as hard negative + E I LR log (7)
e−E D ( I ) + −E D (Hw ( I HR ))
SR P
samples. I HR e
Authorized licensed use limited to: ULAKBIM UASL - YILDIZ TEKNIK UNIVERSITESI. Downloaded on February 04,2025 at 08:53:13 UTC from IEEE Xplore. Restrictions apply.
15838 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 11, NOVEMBER 2024
Fig. 1. Overview of our proposed contrastive learning framework for SISR. We adopt the GAN-like training processing that updating our embedding
network E D and target SR network FSR iteratively. We train our embedding network to learn degradation-aware features. Then, E D is frozen, and SR network
is trained with the pixelwise construction loss L P and our contrastive loss LCL .
Authorized licensed use limited to: ULAKBIM UASL - YILDIZ TEKNIK UNIVERSITESI. Downloaded on February 04,2025 at 08:53:13 UTC from IEEE Xplore. Restrictions apply.
WU et al.: PRACTICAL CONTRASTIVE LEARNING FRAMEWORK FOR SINGLE-IMAGE SUPER-RESOLUTION 15839
TABLE II
Q UANTITATIVE C OMPARISON W ITH S OTA M ETHODS ON B ENCHMARK DATASETS (PSNR (dB)/SSIM).
R ESULTS OF O UR PCL A RE IN B OLD , AND THE I MPROVEMENTS A RE IN G REEN
and the SR network are trained alternately. When training the networks for single image super-resolution (EDSR)1 [32],
discriminator, we fix the SR network, whose parameters are residual channel attention network (RCAN)2 [35], holistic
shared from the previous step. When training the SR network, attention network (HAN)3 [59], and non-local sparse net-
the embedding network is frozen, and its parameters are the work (NLSN)4 [60] with our contrastive learning framework.
same as the discriminator of the previous step. In addition, we also add some representative methods for
For training, we crop patches of size 48 × 48 from LR comparison: SRCNN [61], fast super-resolution convolutional
image with the corresponding HR patches. We augment the neural network (FSRCNN) [62], very deep super resolution
training data with random horizontal flips and 90 rotations. (VDSR) [63], Laplacian pyramid super resolution network
Our model is trained by Adam optimizer [51] with β1 = 0.9, (LapSRN) [64], MemNet [65], single convolutional super-
β2 = 0.999, and ϵ = 10−8 . The batch size we used is 16. resolution network for multiple degradations (SRMDNF) [66],
deep back-projection networks (D-DBPN) [34], residual dense
IV. E XPERIMENTS network (RDN) [33], super-resolution feedback network
A. Experiment Setup (SRFBN) [67], second-order attention network (SAN) [68],
1) Datasets and Metrics: Following [32] and [35], we use IGNN [38], and SwinIR [40]. We use PyTorch [69] to imple-
diverse 2K resolution image dataset (DIV2K) dataset [52], ment our proposed approach and our code, and all trained
which contains 800 images for training and 100 images for models can be achieved at our project page.
evaluation. Datasets for testing include Set5 [53], Set14 [54],
B100 [55], Manga109 [56], and Urban100 [57] with the B. Main Results
upscaling factor: ×4. For comparison, we measure peak As summarized in Table II, we tabulate the quantitative
signal-to-noise ratio (PSNR) and structural similarity index results of different methods. The EDSR-S has 16 residual
measure (SSIM) [58] on the Y channel of transformed YCbCr blocks and 64 channels, while EDSR-L is the large version
space. and has 32 residual blocks and 256 channels. Compared with
2) Comparison Methods: Our proposed contrastive learning the existing methods, one can find that all of our retrained
framework for SISR task is generic, and it can be applied to models surpass the original results on all the benchmark
any existing method. We compare our method with the SoTA 1 https://github.com/sanghyun-son/EDSR-PyTorch
methods. To evaluate our proposed method, we apply our 2 https://github.com/yulunzhang/RCAN
propose ECL framework to the existing benchmark methods 3 https://github.com/wwlCape/HAN
Fig. 2. Visual comparison between the results of benchmark methods and our retrained counterparts. PSNR and SSIM scores are presented.
Authorized licensed use limited to: ULAKBIM UASL - YILDIZ TEKNIK UNIVERSITESI. Downloaded on February 04,2025 at 08:53:13 UTC from IEEE Xplore. Restrictions apply.
WU et al.: PRACTICAL CONTRASTIVE LEARNING FRAMEWORK FOR SINGLE-IMAGE SUPER-RESOLUTION 15841
Fig. 3. Visual comparison. Comparisons of images with fine details. Results obtained by our PCL are in bold that contain more clear and more accurate
reconstructed textures with fewer artifacts.
TABLE III
Q UALITATIVE C OMPARISON ON T EST DATASETS
TABLE IV
A BLATION S TUDY R ESULTS . H ERE , C OLUMN LCL P RESENTS T YPES OF E MBEDDING N ETWORKS . VGG M EANS A DOPTING THE P RETRAINED VGG
N ETWORK , AND D P RESENTS O UR L EARNABLE D ISCRIMINATOR . H AND F A RE S HORTED FOR H AAR AND F OURIER T RANSFORMATION ,
R ESPECTIVELY. H + D AND F + D P RESENT T HAT THE E MBEDDING N ETWORK TAKES THE F REQUENCY T RANSFORMED I NPUT
TABLE V
R ESULTS OF EDSR-S. T HE PCL-F P RESENTS T HAT W E R ETRAIN EDSR-S BACKBONE W ITH A F IXED
E MBEDDING N ETWORK , W HICH I S P RETRAINED W ITH EDSR-L BACKBONE
1) Task-Generalized Embedding Network: In contrast to becomes more proper for SISR task by distinguishing if the
some other methods that employ prior VGG network to build input is blurred. Finally, our full pipeline, employing the Haar
their contrastive loss [13], [14], our embedding network is transform, obtains the best performance.
learned with the SR network like the discriminator in the GAN In addition, when our embedding network is trained and
framework to obtain the task-friendly embedding network. finished, it is general to work with any SR backbones, such
Here, we analyze different types of the embedding network. as the prior VGG. Here, we retrain the EDSR-S backbone
Considering the prior VGG network is pretrained in RGB with fixed embedding network pretrained with the EDSR-L
space, for a fair comparison, we remove the wavelet transform backbone, noted with the suffix -F. Notably, when we apply
and train our embedding network in RGB space as well, our pretrained embedding network, it is used in the same way
indicated as Config 2. Then, we adopt the prior VGG-19 as the pretrained VGG network, and the results are shown
network to calculate our contrastive loss in RGB space, in Table V. Some interesting observations can be obtained
denoted as Config 1. The last is our full pipeline, where the that the results of our PCL-F are slightly better. We think
wavelet transform is applied to highlight the difference in it is reasonable, because the pretrained embedding network
frequency space, and this embedding network is learned with based on the EDSR-L contains implicit knowledge to help the
frequency map correspondingly, noted as Config 3. Results small backbone learn, and using a fixed embedding network
are reported in Table IV. We can find that employing prior avoids the unstable problems in GAN-like training processing.
VGG network directly obtains even worse performance on Considering there are several great GAN-based real-world SR
benchmark datasets compared with baseline RCAN. Same works, such as [70], discriminators trained contain rich prior
with the observation in [13] and [14], they propose their about the backbone and degradation. We think it is meaningful
contrastive losses instead of the original normalized ℓ2 -based to explore these task-friendly prior in pretrained discriminators
contrastive loss, because the original formulation cannot work in feature works.
properly. Since exploring a proper formulation for contrastive 2) Sample Construction: As presented in Table IV, we com-
loss is beyond the scope of this article, we adopt the original pare the performance of Configs 3 and 4 with and without
contrastive loss and extend it to a pixelwise loss, such as [29]. the proposed task-valid sample generation, demonstrating its
Notably, our learned embedding network in Config 2 can effectiveness. In addition, we provide a decoupled analysis
achieve comparable performance with our basic contrastive of the impact of positive and negative samples in Table VI.
loss, which shows that our embedding network is learned and Primarily, it is notable that the incorporation of “hard”
Authorized licensed use limited to: ULAKBIM UASL - YILDIZ TEKNIK UNIVERSITESI. Downloaded on February 04,2025 at 08:53:13 UTC from IEEE Xplore. Restrictions apply.
WU et al.: PRACTICAL CONTRASTIVE LEARNING FRAMEWORK FOR SINGLE-IMAGE SUPER-RESOLUTION 15843
TABLE VI
A BLATION S TUDIES A BOUT DATA AUGMENTATIONS ON B ENCHMARK DATASETS (PSNR (dB)/SSIM/LPIPS)
Authorized licensed use limited to: ULAKBIM UASL - YILDIZ TEKNIK UNIVERSITESI. Downloaded on February 04,2025 at 08:53:13 UTC from IEEE Xplore. Restrictions apply.
15844 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 11, NOVEMBER 2024
[4] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for [32] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee, “Enhanced deep
unsupervised visual representation learning,” in Proc. IEEE/CVF Conf. residual networks for single image super-resolution,” in Proc. IEEE
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 9726–9735. Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jul. 2017,
[5] X. Chen, H. Fan, R. Girshick, and K. He, “Improved baselines with pp. 1132–1140.
momentum contrastive learning,” 2020, arXiv:2003.04297. [33] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu, “Residual dense
[6] X. Chen and K. He, “Exploring simple Siamese representation learning,” network for image super-resolution,” in Proc. IEEE/CVF Conf. Comput.
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Vis. Pattern Recognit., Jun. 2018, pp. 2472–2481.
Jun. 2021, pp. 15745–15753. [34] M. Haris, G. Shakhnarovich, and N. Ukita, “Deep back-projection
[7] P. Khosla et al., “Supervised contrastive learning,” in Proc. Adv. Neural networks for super-resolution,” in Proc. IEEE/CVF Conf. Comput. Vis.
Inf. Process. Syst., vol. 33, 2020, pp. 18661–18673. Pattern Recognit., Jun. 2018, pp. 1664–1673.
[8] Y. Li, P. Hu, J. Z. Liu, D. Peng, J. T. Zhou, and X. Peng, “Contrastive [35] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu, “Image super-
clustering,” in Proc. AAAI, 2021, pp. 8547–8555. resolution using very deep residual channel attention networks,” in Proc.
[9] G. Bukchin et al., “Fine-grained angular contrastive learning with coarse ECCV, 2018, pp. 286–301.
labels,” in Proc. CVPR, Jun. 2021, pp. 8730–8740. [36] Y. Tang, W. Gong, X. Chen, and W. Li, “Deep inception-residual
[10] Y. Tian, D. Krishnan, and P. Isola, “Contrastive representation distilla- Laplacian pyramid networks for accurate single-image super-resolution,”
tion,” in Proc. ICLR, 2020., pp. 1–11. IEEE Trans. Neural Netw. Learn. Syst., vol. 31, no. 5, pp. 1514–1528,
May 2020.
[11] Y. Tian et al., “What makes for good views for contrastive learning?”
in Proc. Adv. Neural Inf. Process. Syst., vol. 33, 2020, pp. 6827–6839. [37] J. Xin, J. Li, X. Jiang, N. Wang, H. Huang, and X. Gao, “Wavelet-
based dual recursive network for image super-resolution,” IEEE
[12] Q. Cai, Y. Wang, Y. Pan, T. Yao, and T. Mei, “Joint contrastive learning
Trans. Neural Netw. Learn. Syst., vol. 33, no. 2, pp. 707–720,
with infinite possibilities,” in Proc. Adv. Neural Inf. Process. Syst.,
Feb. 2022.
vol. 33, 2020, pp. 12638–12648.
[38] S. Zhou, J. Zhang, W. Zuo, and C. C. Loy, “Cross-scale internal graph
[13] H. Wu et al., “Contrastive learning for compact single image dehazing,” neural network for image super-resolution,” in Proc. Adv. Neural Inf.
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Process. Syst., vol. 33, 2020, pp. 3499–3509.
Jun. 2021, pp. 10546–10555.
[39] H. Chen et al., “Pre-trained image processing transformer,” in Proc.
[14] Y. Wang et al., “Towards compact single image super-resolution via IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021,
contrastive self-distillation,” in Proc. Thirtieth Int. Joint Conf. Artif. pp. 12294–12305.
Intell., Aug. 2021, pp. 1122–1128.
[40] J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool, and R. Timofte,
[15] J. Han et al., “Single underwater image restoration by contrastive “SwinIR: Image restoration using Swin transformer,” in Proc. IEEE/CVF
learning,” in Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS), Int. Conf. Comput. Vis. Workshops (ICCVW), Oct. 2021, pp. 1833–1844.
Jul. 2021, pp. 2385–2388.
[41] Z. Liu et al., “Swin transformer: Hierarchical vision transformer using
[16] N. Dong, M. Maggioni, Y. Yang, E. Pérez-Pellitero, A. Leonardis, shifted windows,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV),
and S. McDonagh, “Residual contrastive learning for image reconstruc- Oct. 2021, pp. 9992–10002.
tion: Learning transferable representations from noisy images,” 2021,
[42] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-
arXiv:2106.10070.
time style transfer and super-resolution,” in Proc. ECCV, 2016,
[17] X. Chen et al., “Unpaired deep image deraining using dual contrastive pp. 694–711.
learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
(CVPR), Jun. 2022, pp. 2007–2016. [43] C. Ledig et al., “Photo-realistic single image super-resolution using
a generative adversarial network,” in Proc. IEEE Conf. Comput. Vis.
[18] J. Zhang, S. Lu, F. Zhan, and Y. Yu, “Blind image super-resolution via Pattern Recognit. (CVPR), Jul. 2017, pp. 105–114.
contrastive representation learning,” 2021, arXiv:2107.00708.
[44] X. Wang et al., “ESRGAN: Enhanced super-resolution genera-
[19] L. Wang et al., “Unsupervised degradation representation learning for tive adversarial networks,” in Proc. ECCVW, vol. 11133, 2018,
blind super-resolution,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern pp. 63–79.
Recognit. (CVPR), Jun. 2021, pp. 10576–10585.
[45] W. Zhang, Y. Liu, C. Dong, and Y. Qiao, “RankSRGAN: Gener-
[20] C.-Y. Chuang, J. Robinson, Y.-C. Lin, A. Torralba, and S. Jegelka, ative adversarial networks with ranker for image super-resolution,”
“Debiased contrastive learning,” in Proc. Adv. Neural Inf. Process. Syst., in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019,
vol. 33, 2020, pp. 8765–8775. pp. 3096–3105.
[21] J. D. Robinson, C.-Y. Chuang, S. Sra, and S. Jegelka, “Contrastive [46] Y. Jo, S. Wug Oh, P. Vajda, and S. J. Kim, “Tackling the ill-
learning with hard negative samples,” in Proc. ICLR, 2021, pp. 1–10. posedness of super-resolution through adaptive target generation,”
[22] R. Liu, Z. Jiang, S. Yang, and X. Fan, “Twin adversarial contrastive in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR),
learning for underwater image enhancement and beyond,” IEEE Trans. Jun. 2021, pp. 16231–16240.
Image Process., vol. 31, pp. 4922–4936, 2022. [47] Q. Hu, X. Wang, W. Hu, and G. Qi, “AdCo: Adversarial contrast
[23] A. Van Den Oord, Y. Li, and O. Vinyals, “Representation learning with for efficient learning of unsupervised representations from self-trained
contrastive predictive coding,” 2018, arXiv:1807.03748. negative adversaries,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
[24] R. Devon Hjelm et al., “Learning deep representations by mutual infor- Recognit. (CVPR), Jun. 2021, pp. 1074–1083.
mation estimation and maximization,” in Proc. ICLR, 2019, pp. 1–11. [48] J. Ian Goodfellow et al., “Generative adversarial nets,” in Proc. NeurIPS,
[25] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by vol. 27, 2014, pp. 1–11.
learning an invariant mapping,” in Proc. CVPR, 2006, pp. 1735–1742. [49] M. Fritsche, S. Gu, and R. Timofte, “Frequency separation for real-world
[26] F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified super-resolution,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. Workshop
embedding for face recognition and clustering,” in Proc. IEEE Conf. (ICCVW), Oct. 2019, pp. 3599–3608.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 815–823. [50] N. Yu et al., “Dual contrastive loss and attention for GANs,”
[27] K. Sohn, “Improved deep metric learning with multi-class N-pair loss in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021,
objective,” in Proc. NeurIPS, 2016, pp. 1849–1857. pp. 6711–6722.
[28] A. Sinha, K. Ayush, J. Song, B. Uzkent, H. Jin, and S. Ermon, “Negative [51] P. D. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
data augmentation,” in Proc. ICLR, 2021, pp. 1–17. in Proc. ICLR, 2015, pp. 1–12.
[29] T. Park, A. A. Efros, R. Zhang, and J.-Y. Zhu, “Contrastive learn- [52] E. Agustsson and R. Timofte, “NTIRE 2017 challenge on single
ing for unpaired image-to-image translation,” in Proc. ECCV, 2020, image super-resolution: Dataset and study,” in Proc. IEEE Conf.
pp. 319–345. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jul. 2017,
[30] J. Han, M. Shoeiby, L. Petersson, and M. A. Armin, “Dual con- pp. 1122–1131.
trastive learning for unsupervised image-to-image translation,” in Proc. [53] M. Bevilacqua, A. Roumy, C. Guillemot, and M.-L.-A. Morel, “Low-
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), complexity single-image super-resolution based on nonnegative neighbor
Jun. 2021, pp. 746–755. embedding,” in Proc. Brit. Mach. Vis. Conf., 2012, p. 135.
[31] M. Ko et al., “Self-supervised dense consistency regularization for [54] R. Zeyde, M. Elad, and M. Protter, “On single image scale-up using
image-to-image translation,” in Proc. IEEE/CVF Conf. Comput. Vis. sparse-representations,” in Curves and Surfaces, J.-D. Boissonnat et al.,
Pattern Recognit. (CVPR), Jun. 2022, pp. 18280–18289. Eds. Berlin, Germany: Springer, 2012, pp. 711–730.
Authorized licensed use limited to: ULAKBIM UASL - YILDIZ TEKNIK UNIVERSITESI. Downloaded on February 04,2025 at 08:53:13 UTC from IEEE Xplore. Restrictions apply.
WU et al.: PRACTICAL CONTRASTIVE LEARNING FRAMEWORK FOR SINGLE-IMAGE SUPER-RESOLUTION 15845
[55] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human Gang Wu received the B.E. degree from the School
segmented natural images and its application to evaluating segmentation of Computer Science and Technology, Soochow
algorithms and measuring ecological statistics,” in Proc. 8th IEEE Int. University, Jiangsu, China, in 2020. He is currently
Conf. Comput. Vision. (ICCV), Jun. 2001, pp. 416–423. pursuing the Ph.D. degree with the Faculty of
[56] Y. Matsui et al., “Sketch-based manga retrieval using manga109 dataset,” Computing, Harbin Institute of Technology, Harbin,
Multim. Tools Appl., vol. 76, pp. 21811–21838, Oct. 2017. China.
[57] J. Huang, A. Singh, and N. Ahuja, “Single image super-resolution from His research interests include image restoration,
transformed self-exemplars,” in Proc. IEEE Conf. Comput. Vis. Pattern representation learning, and self-supervised learning.
Recognit. (CVPR), Jun. 2015, pp. 5197–5206.
[58] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image
quality assessment: From error visibility to structural similarity,” IEEE
Trans. Image Process., vol. 13, no. 4, pp. 600–612, Apr. 2004.
[59] B. Niu et al., “Single image super-resolution via a holistic attention Junjun Jiang (Senior Member, IEEE) received the
network,” in Proc. ECCV, 2020, pp. 191–207. B.S. degree in mathematics from Huaqiao Univer-
[60] Y. Mei, Y. Fan, and Y. Zhou, “Image super-resolution with non- sity, Quanzhou, China, in 2009, and the Ph.D. degree
local sparse attention,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern in computer science from Wuhan University, Wuhan,
Recognit. (CVPR), Jun. 2021, pp. 3516–3525. China, in 2014.
[61] C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using From 2015 to 2018, he was an Associate Professor
deep convolutional networks,” IEEE Trans. Pattern Anal. Mach. Intell., with the School of Computer Science, China Univer-
vol. 38, no. 2, pp. 295–307, Feb. 2016. sity of Geosciences, Wuhan. From 2016 to 2018,
[62] C. Dong, C. C. Loy, and X. Tang, “Accelerating the super- he was a Project Researcher with the National
resolution convolutional neural network,” in Proc. ECCV, 2016, Institute of Informatics (NII), Tokyo, Japan. He is
pp. 391–407. currently a Professor with the School of Computer
[63] J. Kim, J. K. Lee, and K. M. Lee, “Accurate image super-resolution Science and Technology, Harbin Institute of Technology, Harbin, China. His
using very deep convolutional networks,” in Proc. IEEE Conf. Comput. research interests include image processing and computer vision.
Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 1646–1654. Dr. Jiang received the Best Student Paper Runner-Up Award at The
[64] W. Lai, J. Huang, N. Ahuja, and M. Yang, “Deep Laplacian International Multimedia Modeling Conference (MMM) 2017, the Finalist
pyramid networks for fast and accurate super-resolution,” in Proc. of the World’s FIRST 10k Best Paper Award at International Conference
IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, on Multimedia and Expo (ICME) 2017, and the Best Paper Award at Inter-
pp. 5835–5843. national Forum of Digital TV and Multimedia Communication (IFTC) 2018.
[65] Y. Tai, J. Yang, X. Liu, and C. Xu, “MemNet: A persistent memory He received the 2016 China Computer Federation (CCF) Outstanding Doctoral
network for image restoration,” in Proc. IEEE Int. Conf. Comput. Vis. Dissertation Award and the 2015 Association for Computing Machinery
(ICCV), Oct. 2017, pp. 4549–4557. (ACM) Wuhan Doctoral Dissertation Award. He is an Associate Editor of
[66] K. Zhang, W. Zuo, and L. Zhang, “Learning a single convolutional super- Remote Sensing.
resolution network for multiple degradations,” in Proc. IEEE/CVF Conf.
Comput. Vis. Pattern Recognit., Jun. 2018, pp. 3262–3271.
[67] Z. Li, J. Yang, Z. Liu, X. Yang, G. Jeon, and W. Wu, “Feedback network Xianming Liu (Member, IEEE) received the B.S.,
for image super-resolution,” in Proc. IEEE/CVF Conf. Comput. Vis. M.S., and Ph.D. degrees from the Harbin Institute
Pattern Recognit. (CVPR), Jun. 2019, pp. 3862–3871. of Technology (HIT), Harbin, China, in 2006, 2008,
[68] T. Dai, J. Cai, Y. Zhang, S. Xia, and L. Zhang, “Second- and 2012, respectively, all in computer science.
order attention network for single image super-resolution,” in Proc. In 2011, he was a Visiting Student with the
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, Department of Electrical and Computer Engi-
pp. 11057–11066. neering, McMaster University, Hamilton, ON,
[69] A. Paszke et al., “PyTorch: An imperative style, high-performance Canada, where he was a Post-Doctoral Fellow
deep learning library,” in Proc. NeurIPS, vol. 32, 2019, from 2012 to 2013. He was a Project Researcher
pp. 1–11. with the National Institute of Informatics (NII),
[70] X. Wang, L. Xie, C. Dong, and Y. Shan, “Real-ESRGAN: Training Tokyo, Japan, from 2014 to 2017. He is currently
real-world blind super-resolution with pure synthetic data,” in Proc. a Professor with the School of Computer Science and Technology, HIT.
IEEE/CVF Int. Conf. Comput. Vis. Workshops (ICCVW), Oct. 2021, Dr. Liu was a recipient of the IEEE International Conference on Multimedia
pp. 1905–1914. and Expo (ICME) 2016 Best Student Paper Award.
Authorized licensed use limited to: ULAKBIM UASL - YILDIZ TEKNIK UNIVERSITESI. Downloaded on February 04,2025 at 08:53:13 UTC from IEEE Xplore. Restrictions apply.