0% found this document useful (0 votes)

77 views10 pages

A Fully Progressive Approach To Single Image Super Resolution Paper 1

1) The document proposes a new progressive approach called ProSR for single image super-resolution that upsamples an image in intermediate steps using a pyramidal network architecture and is trained from low to high upsampling factors. 2) ProSR achieves state-of-the-art results in terms of PSNR and SSIM for upsampling factors up to 4x. For higher factors like 8x, the GAN-extended version ProGanSR is able to generate more photorealistic images by hallucinating plausible details. 3) The progressive multi-scale training strategy not only improves results for all upsampling factors but also significantly shortens total training time compared to common multi-scale training approaches.

Uploaded by

happydude7632

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

77 views10 pages

A Fully Progressive Approach To Single Image Super Resolution Paper 1

Uploaded by

happydude7632

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

A Fully Progressive Approach to Single-Image Super-Resolution

Yifan Wang1,2 Federico Perazzi2 Brian McWilliams2

Alexander Sorkine-Hornung2 Olga Sorkine-Hornung1 Christopher Schroers2
1 2
ETH Zurich Disney Research

Ours w/o GAN Ours w/ GAN Ours w/o GAN Ours w/ GAN

Input Input Input

4× 8×

Figure 1: Examples of our 4× and 8× upsampling results. Our model without GAN sets a new state-of-the-art benchmark in terms of PSNR/SSIM; our
GAN-extended model yields high perceptual quality and is able to hallucinate plausible details up to 8× upsampling ratio.

Abstract cessing has recently sparked increased interest in super-

resolution. In particular, approaches to single image su-
Recent deep learning approaches to single image super- per resolution (SISR) have achieved impressive results by
resolution have achieved impressive results in terms of tra- learning the mapping from low-resolution (LR) to high-
ditional error measures and perceptual quality. However, in resolution (HR) images based on data. Typically, the up-
each case it remains challenging to achieve high quality re- scaling function is a deep neural network (DNN) that is
sults for large upsampling factors. To this end, we propose a trained in a fully supervised manner with tuples of LR
method (ProSR) that is progressive both in architecture and patches and corresponding HR targets. DNNs are able to
training: the network upsamples an image in intermediate learn abstract feature representations in the input image that
steps, while the learning process is organized from easy to allow some degree of disambiguation of the fine details in
hard, as is done in curriculum learning. To obtain more the HR output.
photorealistic results, we design a generative adversarial Most existing SISR networks adopt one of the two fol-
network (GAN), named ProGanSR, that follows the same lowing direct approaches. The first upsamples the LR image
progressive multi-scale design principle. This not only al- with a simple interpolation method (e.g., bicubic) in the be-
lows to scale well to high upsampling factors (e.g., 8×) but ginning and then essentially learns how to deblur [7,20,33].
constitutes a principled multi-scale approach that increases The second proposes upsampling only at the end of the
the reconstruction quality for all upsampling factors simul- processing pipeline, typically using a sub-pixel convolution
taneously. In particular ProSR ranks 2nd in terms of SSIM layer [30] or transposed convolution layer to recover the HR
and 4th in terms of PSNR in the NTIRE2018 SISR chal- result [8, 23, 30, 37]. While the first class of approaches has
lenge [35]. Compared to the top-ranking team, our model a large memory footprint and a high computational cost, as
is marginally lower, but runs 5 times faster. it operates on upsampled images, the second class is more
prone to checkerboard artifacts [27] due to simple concate-
nation of upsampling layers. Thus it remains challenging to
1. Introduction achieve high quality results for large upsampling factors.
In this paper, we propose a method that is progressive
The widespread availability of high resolution displays both in architecture and training. We design the network
and rapid advancements in deep learning based image pro- to reconstruct a high resolution image in intermediate steps
Alexander Sorkine-Hornung is now at Oculus. This work was com- by progressively performing a 2× upsampling of the input
pleted during his time at Disney Research. from the previous level. As building blocks for each level

1
u2

v0
(2x) u0

x v1
...
(4x)

v2
(8x)

r0 (2x) r1 (4x) r2 (8x)

R0 (x) R1(x) R2(x)

Dense bicubic x2
sub-pixel
Conv(3,3) Compression bicubic x4
convolution bicubic x8
Unit (DCU)
ŷ0 ŷ1 ŷ2
compression dense block pyramid

Figure 2: Asymmetric pyramidal architecture. More DCUs are allocated in the lower pyramid level to improve the reconstruction accuracy and to reduce
memory consumption.

of the pyramid, we propose dense compression units, which image priors such as heavy-tailed gradient distributions
are adapted from dense blocks [16] to suit super-resolution. [10, 29], gradient profiles [32], multi-scale recurrence [13],
Compared to existing progressive SISR models [21, 22], self-examples [11], and total variation [26]. In contrast,
we improve the reconstruction accuracy by simplifying the exemplar-based approaches such as nearest-neighbor [12]
information propagation within the network; furthermore and sparse dictionary learning [36, 38, 40] have exploited
we propose to use an asymmetric pyramidal structure with the inherent redundancy of large-scale image datasets. Re-
more layers in the lower levels to enable high upsampling cently, Dong et al. [6] showed the superiority of a simple
ratios while remaining efficient. To obtain more photoreal- three-layer convolutional network (CNN) over sparse cod-
istic results, we adopt the GAN framework [14] and design ing techniques. Since then, deep convolutional architectures
a discriminator that matches the progressive nature of our have consistently pushed the state-of-art forward.
generator network by operating on the residual outputs of
each scale. Such paired progressive design allows us to ob- Direct vs. Progressive Reconstruction. Direct recon-
tain a multi-scale generator with a unified discriminator in struction techniques [7, 20, 23, 24, 33, 37] upscale the im-
a single training. age to the desired spatial resolution in a single step. Early
In this framework, we can naturally utilize a form of cur- approaches [7, 20, 33] upscale the LR image in a pre-
riculum learning, which is known to improve training [4] processing step. Thus, the CNN learns to deblur the input
by organizing the learning process from easy (small upsam- image. However, this requires the network to learn a feature
pling factors) to hard (large upsampling factors). Compared representation for a high-resolution image which is com-
to common multi-scale training, the proposed training strat- putationally expensive [30]. To overcome this limitation,
egy not only improves results for all upsampling factors, but many approaches opt for operating on the low dimensional
also significantly shortens the total training time and stabi- features and perform upsampling at the end of the network
lizes the GAN training. via sub-pixel convolution [30] or transposed convolution.
We evaluate our progressive multi-scale approach A popular progressive reconstruction approach is de-
against the state-of-art on a variety of datasets, where we scribed by LapSRN by Lai et al. [21]. In their work, the up-
demonstrate improved performance in terms of traditional sampling follows the principle of Laplacian pyramids, i.e.
error measures (e.g., PSNR) as well as perceptual quality, each level learns to predict a residual that should explain the
particularly for larger upsampling ratios. difference between a simple upscale of the previous level
and the desired result. Since the loss functions are com-
2. Related Work puted at each scale, this provides a form of intermediate su-
Single image super-resolution techniques (SISR) have pervision. Lai et al. improved their method with deep and
been an active area of investigation for more than a wider recursive architecture and multi-scale training [22].
decade [12]. The ill-posed nature of this problem has typi- While [22] improved the accuracy, there remains a consid-
cally been tackled using statistical techniques: most notably erable gap between the top-performing approach in terms
of PSNR [24]. In particular, as we show in Section 4.2,
...
the Laplacian pyramidal structure aggravates the optimiza- Rs(xs)
xs r ŝ
tion difficulty. Furthermore, the recursive pyramids result us-1 us rs α
in quadratic growth of computation in the higher pyramid
level, becoming the bottleneck for reducing runtime and ̂ = Rs-1(xs)
r s-1
expanding the network capability. Lastly, in addition to a rs-1
bilinear x2 1-α
progressive generator, we also propose a progressive dis-
u2 u1 u0
criminator along with a progressive training strategy.
...
α
Perceptual Loss Functions. The aforementioned tech-
1-α
niques optimize the reconstruction error by minimizing the
v2 v1 v0
ℓ1 -norm and descendants such as the Charbonnier penalty
function [21]. Although these approaches yield small re-
Conv(3,3) AvgPool
construction errors, they are unable to hallucinate perceptu- r2
̂
ally plausible high-frequencies details. To this end, Ledig et
al. [23] proposed a perceptual loss function consisting of a Figure 3: Schematic illustration of the blending procedure in curriculum
content loss that captures perceptual similarities and an ad- training for the generator (top) and the discriminator (bottom). vscale , rs
versary to steer the reconstruction closer to the latent man- denote the scale-specific input and reconstruction layer, and us denotes
the pyramid of scale s. α varies from 0 to 1 during blending to control the
ifold of natural solutions. Based on this, Sajjadi et al. [28] impact of the new pyramid.
apply an additional texture loss to encourage similarity with
the original image. In contrast to these works, we design a
discriminator that operates on the residual outputs of each in the lower pyramid not only reduces the memory con-
scale and train progressively with a strategy based on cur- sumption but also increases the receptive field with respect
riculum learning. With this, our GAN model is able to up- to the original image, hence it outperforms the symmet-
sample perceptually pleasing SR images for multiple scales ric variant in terms of reconstruction quality and runtime.
up to 8×. While the decomposition of u is shared among the pyramid
levels, we also use two scale-specific sub-networks, denoted
3. Progressive Multi-scale Super-resolution by vs and rs , which allow for an individual transformation
between scale-varying image space and a normalized fea-
Given a set of n LR input images with corresponding HR ture space. A schematic illustration of our progressive up-
target images {(x1 , y1 ), . . . , (xn , yn )}, we consider the sampling architecture is detailed in Figure 2.
problem of estimating an upscaling function u : X → Y , To simplify learning, the network is designed to output
where X and Y denote the space of LR and HR images, the residual
respectively. Finding a suitable parameterisation for the up-
scaling function u for large upsampling ratios, is challeng- Rs (x) = (rs ◦ us ◦ · · · ◦ u0 ◦ vs )(x) (1)
ing: the larger the ratio, the more complex the function class
w.r.t a fixed upsampling of the input ϕs (x) through e.g.
required.
bicubic interpolation. Thus, for a given scaling factor s the
To this end, we propose a progressive solution to learn
estimated HR image can be computed as
the upscaling function u. In the following, we propose
our pyramidal architecture, ProSR, for multi-scale super-
resolution in Section 3.1 and 3.2. In Section 3.3 we propose ŷ = Rs (x) + ϕs (x). (2)
ProGanSR, a progressive multi-scale GAN for perceptual
enhancement. Finally, we discuss a curriculum learning Notably, our network doesn’t follow the Laplacian pyra-
scheme in Section 3.4. mid principle like in [21, 22], i.e. the intermediate sub-net
outputs are neither supervised nor used as base image in the
3.1. Pyramidal Decomposition subsequent level. Such design performs favorably over the
Laplacian alternative, as it simplifies the backward-pass and
We propose a pyramidal decomposition of u into a series
thus reduces the optimization difficulty. Additionally we
of simpler functions u0 , . . . , us . Each function—or level—
do not downsample the groundtruth to create labels, which
is tasked with refining the feature representation and per-
is done for the intermediate supervision in [21, 22]. This
forming a 2× upsampling of its own input. Each level of the
avoids artefacts that may result from subsampling.
pyramid consists of a cascade of dense compression units
(DCUs) followed by a sub-pixel convolution layer. We as- 3.2. Dense Compression Units
sign more DCUs in the lower pyramid levels, resulting in
the asymmetric structure. Having more computation power We base the construction of each pyramid level on the
recently proposed DenseNet architecture [16]. Similarly to
skip connections [15], dense connections improve gradient square loss instead of the original cross-entropy loss [25].
flow alleviating vanishing and shattered gradients [3]. Denoting the predicted residual and real residual as r̂ and r,
The core component in each level of the pyramid is a the discriminator loss and generator loss for a training ex-
dense compression unit (DCU), which consists of a modi- ample of scale s can be expressed as
fied densely connected block followed by 1 × 1 convolution
C ONV (1,1). LiDs =(D (rˆi s ))2 + (D (ris ) − 1)2 (3)
s
The original dense layer is composed of BN-R E LU- LiRs =(D (rˆi ) − 1) + 2
(4)
C ONV (1,1)-BN-R E LU-C ONV (3,3). Following recent X
kΦk (ŷi ) − Φk (yi )k2 ,
practice in super-resolution [9, 24, 39], we remove all batch
k∈{2,4}
normalizations. However, since the features from previous
layers may have varying scales, we also remove the first where Φk denotes the k-th pooling layer input in
ReLU to rescale the features with C ONV (1-1). This leads to VGG16 [31].
a modified dense layer composition: C ONV (1,1)-R E LU-
C ONV (3,3). 3.4. Curriculum Learning
Contrary to DenseNet, we break the dense connection at Curriculum learning [4] is a strategy to improve training
the end of each DCU with a C ONV (1,1) compression layer, by gradually increasing the difficulty of the learning task. It
which re-assembles information efficiently and leads to a is often used in sequence prediction tasks and in sequential
slight performance gain in spite of the breakage of dense decision making problems where large speedups in training
connection. For a very deep model we apply pyramid-wise time and improvements in generalisation performance can
as well as local residual links to improve the gradient prop- be obtained.
agation as shown in Figure 2. The pyramidal decomposition of u allows us to apply
curriculum learning in a natural way. The loss for a training
3.3. Progressive GAN
example (xsi , yi ) of scale s can be defined as
Generative adversarial networks (GANs) [14] have
emerged as a powerful method to enhance the perceptual LiRs = kRs (xsi ) + ϕs (xsi ) − yi k1 (5)
quality of the upsampled images [14, 23, 28] in SISR. where xsi corresponds to s× downsampled version of yi .
However, training GANs is notoriously difficult and suc- Then the goal at scale s is to find
cess at applying GANs to SISR has been limited to single- XX
scale upsampling at relatively low target resolutions. In or- θ̂s = argmin LiRs′ , (6)
θs
der to enable multi-scale GAN-enhanced SISR, we propose s′ ≤s i
a modular and progressive discriminator network similar to
where θs parameterises all functions in and below the cur-
the generator network proposed in the previous section. As
rent scale (u0 , v0 , r0 , . . . , us , vs , rs ) according to our pyra-
illustrated in the bottom of Figure 3, the architecture has
midal network shown in Figure 2. Our training curriculum
a reverse pyramid structure {u2 , u1 , u0 }, where each level
starts by training only the 2× portion of the network. When
gradually reduces the spatial dimension of the input image
we proceed to a new phase in the curriculum (e.g. to 4×),
with AVG P OOLING. Similar to the generator, scale-specific
a new level of the pyramid is gradually blended in to re-
image transformation layers vscale are implied before each
duce its impact on the previously trained layers. As Figure 3
pyramid. To accommodate the multi-scale outputs from the
shows, for the generator the predicted residual r̂s at scale s
generator, the network is fully convolutional and outputs a
is a linear combination of the outputs from level s and s−1,
small patch of features similar to PatchGAN [18]. The com-
while in analog for the discriminator, the output features
plete specs of the discriminator can be found in the supple-
from the new pyramid are combined with the output of the
mental material.
scale-specific input layer from the previous level vscale−1 ,
Similar to the generator network, the discriminator op-
before entering the trained pyramids {us−1 , . . . , u0 }. Bilin-
erates on the residual between the original and bicubic up-
ear interpolation and AVG P OOL are used to match the spa-
sampled image. This allows both generator and discrimi-
tial dimensions before merging. In both cases, α controls
nator to concentrate only on the important sources of vari-
the influence of the new pyramid and thus it varies from
ation which are not already well captured by the standard
0 to 1 during the blending procedure. As a result we incre-
upsampling operation. Since these regions are challenging
mentally add training pairs of the next scale. While a similar
to upsample well, they correspond to the largest percep-
idea was proposed in [19] to improve high-resolution image
tual errors. This can also be viewed as subtracting a data-
generation, we use this strategy in the context of multi-scale
dependent baseline from the discriminator which helps to
training. Finally, to assemble the batches, we randomly se-
reduce variance.
lect one of the scales s to avoid mixing batch statistics as
As the training objective, we use the more stable least
suggested in [2].
Compared to simple multi-scale training where training Asymmetric Pyramid. In this section we show the ad-
examples from different scales are simultaneously fed to the vantage of the proposed asymmetric pyramidal architecture.
network, such progressive training strategy greatly shortens We compare the following constellations while keeping the
the total training time. Furthermore, it yields a further per- total number of DCUs constant:
formance gain for all included scales compared to single-
Model Architecture
scale and simple multi-scale training and alleviates instabil-
ities in GAN training. Direct D−D−D−D−S−S
Asymmetric Pyramid D−D−D−S−D−S
4. Evaluation Here, D denotes a dense compression unit with 6 dense
layers and S denotes the sub-pixel upsampler. As Table 1
Before we compare with popular state-of-the-art ap-
shows, the asymmetric pyramidal architecture considerably
proaches, we first discuss the benefits of each of our pro-
improves the reconstruction accuracy compared to direct
posed components using a small 24-layer model.
upsampling. This demonstrates the advantage of utilizing
All presented models are trained with the DIV2K [34]
high-dimensional features directly. Furthermore, by assign-
training set, which contains 800 high-resolution images.
ing more computation in the lower pyramid, the penalty
The training details are listed in the supplemental material.
in memory and computation consumption compared to di-
For evaluation, the benchmark datasets Set5 [5], Set14 [41],
rect upsample approach is significantly reduced. As shown
BSD100 [1], Urban100 [17], and the DIV2K validation set
in Table 1, for small model, asymmetric pyramid model
[34] are used. As it is commonly done in SISR, all evalua-
achieves the same runtime as direct upsampling.
tions are conducted on the luminance channel.
4.1. Ablation study Curriculum Learning. We extend the 4-DCU asymmet-
ric pyramid model to 8× upsampling to quantify the ben-
efit of curriculum learning over simultaneous multi-scale
Ablation Study Method PSNR Parameters runtime
training. As Table 2 shows, simultaneous training typically
Single Dense Block, has small or even negative impact on the lowest scale (2×),
Baseline Dense Layer BRCBRC, 28.30 8.22M 0.19s
Single Scale
which is also evident in VDSR [20] (see Table 2). On the
other hand, curriculum learning always improves the recon-
Block Division 4 DCUs 28.32 1.79M 0.11s
struction quality and outperforms simultaneous training by
Architecture Asymmetric Pyramid 28.41 1.89M 0.11s
an average of 0.04dB.
Training Curriculum Learning 28.45 1.89M 0.11s Furthermore, curriculum learning considerably shortens
Increased network the training time. As Figure 4 shows, the network reaches
Very Deep Model width and depth 28.94 13.4M 0.27s
longer training the same number of epochs and quality faster than simulta-
neous training, since the 2× subnet requires less computa-
Table 1: Overview of experiments in the ablation study. The introduction tion and hence less time for each update.
of DCUs, block division, an asymmetric pyramid layout, and curriculum
learning allow to consistently increase reconstruction quality. Reported 4.2. Comparison with other progressive architec-
PSNR values refer to 4× results of Set14. The runtime is tested for 4× tures.
upscaling of 128 × 128 image.
In contrast to our approach, existing progressive methods
Table 1 summarizes the consistent increase in recon- [21, 22] typically rely on deep supervision. They impose a
struction quality stemming from each proposed component. loss on all scales which can be denoted as
As a baseline, we start from a single dense block with two X ′

sub-pixel upsampling layers in the end and a residual con- Lis = ℓ1 ψs′ (yi ), ŷis + ℓ1 (yis , ŷis ) , (7)
nection from the LR input to the final output. In the follow- s′ <s

ing, we describe the individual steps in more detail.

with ψs′ being a downsampling operation to scale s′ .
Futhermore, following the structure of a Laplacian pyramid,
Dense Compression Units. To demonstrate the benefit of each level is encouraged to learn the difference between a
DCUs described in Section 3.2, we replace the single-block bicubic upscale of the previous level instead of the upsam-
from the baseline model with multiple DCUs. As Table 1 pled LR image. Thus the residual connections are given by
shows, the number of network parameters can be drastically
ŷ s = r̂ s + ϕ2 y s−1 ,

reduced without harming the reconstruction accuracy. We (8)
can even observe a slight performance gain as the network
is able to reassemble features more efficiently due to the where ϕ2 denotes an upscaling operator by a factor of 2.
injection of compression layers.
Improvement
w.r.t single-scale Set5 Set14 B100 U100 DIV2K average
2×/4×/8× (dB)
simultaneous -0.05/+0.09/-0.01 +0.01/+0.03/+0.05 +0.02/+0.01/+0.03 +0.12/+0.06/+0.08 +0.06/-0.02/+0.05 +0.06/+0.02/+0.05
curriculum +0.05/+0.11/+0.08 +0.08/+0.04/+0.06 +0.07/+0.03/+0.05 +0.21/+0.09/+0.08 +0.13/+0.02/+0.05 +0.13/+0.05/+0.06

Table 2: Gain of simultaneous training and curriculum learning w.r.t. single-scale training on all datasets. The average is computed accounting the number
of images in the datasets. Curriculum learning improves the training for all scales while simultaneous training hampers the training of the lowest scale.

2× 4× 8×
22.6
32 Curriculum
Curriculum
Curriculum
26
PSNR (dB)

Simultaneous
31 Simultaneous 22.4
25.5
Simultaneous

30
25 22.2

0 500 1,000 0 500 1,000 0 500 1,000

Time elapsed (min) Time elapsed (min) Time elapsed (min)

Figure 4: Training time comparison between curriculum learning and multiscale simultaneous learning. We train the multiscale model and plot the PSNR
evaluation of the individual scales. The elapsed epoch is encoded as the line color. Because curriculum learning activates the smaller subnets first, it requires
much less time to reach the same evaluation quality.

B100 Set14 DRRN, that have been retrained with 8× data. To produce
model 8× EDSR results, we extend their 4× model by adding an-
2× 4× 8× 2× 4× 8× other sub-pixel convolution layer. For training, we follow
their practice which means we initialize the weights of the
single ours - 27.44 - - 28.41 -
scale alt - 27.32 - - 28.20 - 8× model from the pretrained 4× model.
Due to discrepancy in the model size within existing ap-
multi ours 31.95 27.47 24.75 33.24 28.45 24.86
scale alt 31.92 27.38 24.70 33.22 28.28 24.76 proaches, we divide them into two classes based on whether
they have more or less than 5 million parameters. Accord-
Table 3: Comparison with other progressive approaches. ingly, we provide two models with different sizes, denoted
as ProSRs and ProSRℓ , to compete in both classes. ProSRs
has 56 dense layers in total with growth-rate k = 12 and a
We also evaluate such alternative progressive architec- total of 3.1M parameters. ProSRℓ has 104 dense layers with
ture but observed large decrease in PSNR as shown in Ta- growth-rate k = 40 and 15.5M parameters which is roughly
ble 3. Therefore, we conclude that it is less stable to use a third of the parameters of EDSR.
varying sub-scale upsampling results as base images com- Table 4 summarizes the quantitative comparison with
pared to fixed interpolated results and that using a down- other state-of-the-art approaches in terms of PSNR. An ex-
sampling kernel to create the HR label images could intro- tended list that includes SSIM scores can be found in the
duce undesired artefacts. supplemental material. As Table 4 shows, ProSRs achieves
the lowest error in most datasets. The very deep model,
4.3. Comparison with State-of-the-art Approaches ProSRℓ , shows consistent advantage in higher upsampling
In this section, we provide an extensive quantitative ratios and is comparable with EDSR in 2×. In general, our
and qualitative comparison with other state-of-the-art ap- progressive design allows to raise the margin in PSNR be-
proaches. tween our results and the state-of-the art as the upsampling
ratio increases.
Quantitative Comparison. For a quantitative compari-
son, we benchmark against VDSR [20], DRRN [33], Lap- Qualitative comparison. First, we qualitatively compare
SRN [21], MsLapSRN [22], EDSR [24]. We obtained our method without GAN to other methods that also min-
models from Lai et al. [22] for 8× versions of VDSR and imise the ℓ1 loss or related norms. Figure 7 show results of
PSNR [23] [28] Ours HR PSNR [23] [28] Ours HR

Figure 5: Comparison of 4× GAN results (best viewed when zoomed in). Our approach is less prone to artefacts and aligns well with the original image.

w/ GAN w/o GAN Input w/ GAN w/o GAN Input

Figure 6: Hallucinated details in 8× upsample result with adversarial loss.

our method and the most recent state-of-the-art approaches known downsampling kernels (bicubic). We participated in
in 4× and 8×. the challenge with the ProSRℓ network. In addition to the
Concerning our perceptually-driven model with GAN, method described above, we utilised the geometry ensemble
we compare with SRGAN [23] and EnhanceNet [28]. As used in [24], which yielded a 0.07dB PSNR gain in the val-
Figure 5 shows, the hallucinated details align well with fine idation set. Our model ranks 2nd in terms of SSIM and 4th
structures in the ground truth, even though we do not have in terms of PSNR. Compared to the top-ranking team, our
an explicit texture matching loss as EnhanceNet [28]. While model is marginally lower by 0.002 and 0.04dB in SSIM
SRGAN and EnhanceNet can only upscale 4×, our method and PSNR respectively, but runs 5 times as fast in test time.
is able to extend to 8×. Results are shown in Figure 6. We Other tracks in the challenge target 4× upscaling but
provide an extended qualitative comparison in the supple- consider unknown degradation. Given that this task is dif-
mental material. ferent to the bicubic 8× setting, the participating teams and
the rankings differ. Without specific adaptation for this sce-
5. Runtime. nario, we also participated in these tracks for completeness
and ranked in the mid-range (7th/9th/7th). We believe fur-
The asymmetric pyramid architecture contributes to ther improvement can be achieved with targeted preprocess-
faster runtime compared to other approaches that have sim- ing and extended training data.
ilar reconstruction accuracy. In our test environment with
NVIDIA TITAN XP and cudnn6.0, ProSRℓ takes on av-
7. Conclusion
erage 0.8s, 2.1s and 4.4s to upsample a 520 × 520 image
by 2×, 4× and 8×. In the NTIRE challenge, we reported In this work we propose a progressive approach to ad-
the runtime including geometric ensemble, which requires dress SISR. We leverage asymmetric pyramid design and
8 forward passes for each transformed version of the input Dense Compression Units in the architecture, both of which
image. Nonetheless, our runtime is still 5 times faster than lead to improved memory efficiency and reconstruction ac-
the top-ranking team. curacy. A matching pyramidal discriminator is proposed,
which enables optimizing for perceptual quality simultane-
6. NTIRE Challenge ously for multiple scale. Furthermore we leverage a form
of curriculum learning which not only increases the perfor-
The “New Trends in Image Restoration and Enhance- mance for all scales but also reduces the total training time.
ment” (NTIRE) 2018 super-resolution challenge [35] aims Our models sets a new state-of-the-art benchmark in both
at benchmarking SISR methods in challenging scenarios.
traditional error measures and perceptual quality.
In particular, one of the challenge tracks targets 8× up-
scaling, where the low resolution images are generated with
2× 4× 8×
PSNR
S14 B100 U100 DIV2K S14 B100 U100 DIV2K S14 B100 U100 DIV2K
# params < 5M
VDSR 33.05 31.90 30.77 35.26 28.02 27.29 25.18 29.72 24.26 24.49 21.70 26.22
DRRN 33.23 32.05 31.23 35.49 28.21 27.38 25.44 29.95 24.42 24.59 21.88 26.37
LapSRN 33.08 31.80 30.41 35.63 28.19 27.32 25.21 29.88 24.35 24.54 21.81 26.40
MsLapSRN 33.28 32.05 31.15 35.62 28.26 27.43 25.51 30.39 24.57 24.65 22.06 26.52
SRDenseNet - - - - 28.50 27.53 26.05 - - - -
ProSRs (ours) 33.36 32.02 31.42 35.80 28.59 27.58 26.01 30.39 24.93 24.80 22.43 26.88
# params > 5M
EDSR 33.92 32.32 32.93 36.47 28.80 27.71 26.64 30.71 24.96 24.83 22.53 26.96
ProSRl (ours) 34.00 32.34 32.91 36.44 28.94 27.79 26.89 30.81 25.29 24.99 23.04 27.36

Table 4: Comparison with state-of-the-art approaches. For clarity, we highlight the best approach in blue.

8× LR DRRN [33] MsLapSRN [22] EDSR [24] ProSRℓ (Ours) HR

24.31 dB/0.6627 24.29 dB/0.667 24.96 dB/0.699 25.18 dB/0.708

8× LR DRRN [33] MsLapSRN [22] EDSR [24] ProSRℓ (Ours) HR

27.55 dB/0.7663 27.62 dB/0.769 27.93 dB/0.776 28.20 dB/0.781

4× LR DRRN [33] MsLapSRN [22] EDSR [24] ProSRℓ (Ours) HR

21.50 dB/0.5218 21.49 dB/0.524 21.79 dB/0.553 21.86 dB/0.557

4× LR DRRN [33] MsLapSRN [22] EDSR [24] ProSRℓ (Ours) HR

22.32 dB/0.6926 22.25 dB/0.698 22.91 dB/0.719 22.93 dB/0.715

Figure 7: Visual comparison with other state-of-the-art methods.

References erative adversarial nets. In Advances in neural information
processing systems, pages 2672–2680, 2014. 2, 4
[1] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Con-
[15] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-
tour detection and hierarchical image segmentation. IEEE
ing for image recognition. In Proceedings of the IEEE con-
transactions on pattern analysis and machine intelligence,
ference on computer vision and pattern recognition, pages
33(5):898–916, 2011. 5
770–778, 2016. 4
[2] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN.
[16] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten.
arXiv preprint arXiv:1701.07875, 2017. 4
Densely connected convolutional networks. arXiv preprint
[3] D. Balduzzi, M. Frean, L. Leary, J. P. Lewis, K. W.-D. Ma,
arXiv:1608.06993, 2016. 2, 3
and B. McWilliams. The shattered gradients problem: If
[17] J.-B. Huang, A. Singh, and N. Ahuja. Single image super-
resnets are the answer, then what is the question? In D. Pre-
resolution from transformed self-exemplars. In Proceedings
cup and Y. W. Teh, editors, Proceedings of the 34th Inter-
of the IEEE Conference on Computer Vision and Pattern
national Conference on Machine Learning, volume 70 of
Recognition, pages 5197–5206, 2015. 5
Proceedings of Machine Learning Research, pages 342–350,
International Convention Centre, Sydney, Australia, 06–11 [18] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-
Aug 2017. PMLR. 4 to-image translation with conditional adversarial networks.
arXiv preprint arXiv:1611.07004, 2016. 4
[4] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Cur-
riculum learning. In Proceedings of the 26th annual interna- [19] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive
tional conference on machine learning, pages 41–48. ACM, growing of gans for improved quality, stability, and variation.
2009. 2, 4 arXiv preprint arXiv:1710.10196, 2017. 4
[5] M. Bevilacqua, A. Roumy, C. Guillemot, and M. L. Alberi- [20] J. Kim, J. Kwon Lee, and K. Mu Lee. Accurate image super-
Morel. Low-complexity single-image super-resolution based resolution using very deep convolutional networks. In Pro-
on nonnegative neighbor embedding. 2012. 5 ceedings of the IEEE Conference on Computer Vision and
[6] C. Dong, C. C. Loy, K. He, and X. Tang. Learning a deep Pattern Recognition, pages 1646–1654, 2016. 1, 2, 5, 6
convolutional network for image super-resolution. In Com- [21] W.-S. Lai, J.-B. Huang, N. Ahuja, and M.-H. Yang. Deep
puter Vision - ECCV 2014 - 13th European Conference, laplacian pyramid networks for fast and accurate super-
Zurich, Switzerland, September 6-12, 2014, Proceedings, resolution. In IEEE Conference on Computer Vision and
Part IV, pages 184–199, 2014. 2 Pattern Recognition, 2017. 2, 3, 5, 6
[7] C. Dong, C. C. Loy, K. He, and X. Tang. Image [22] W.-S. Lai, J.-B. Huang, N. Ahuja, and M.-H. Yang. Fast and
super-resolution using deep convolutional networks. IEEE accurate image super-resolution with deep laplacian pyramid
transactions on pattern analysis and machine intelligence, networks. arXiv preprint arXiv:1710.01992, 2017. 2, 3, 5,
38(2):295–307, 2016. 1, 2 6, 8
[8] C. Dong, C. C. Loy, and X. Tang. Accelerating the super- [23] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham,
resolution convolutional neural network. In European Con- A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al.
ference on Computer Vision, pages 391–407. Springer, 2016. Photo-realistic single image super-resolution using a gener-
1 ative adversarial network. arXiv preprint arXiv:1609.04802,
[9] Y. Fan, H. Shi, J. Yu, D. Liu, W. Han, H. Yu, Z. Wang, 2016. 1, 2, 3, 4, 7
X. Wang, and T. S. Huang. Balanced two-stage residual net- [24] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee. Enhanced
works for image super-resolution. In Computer Vision and deep residual networks for single image super-resolution.
Pattern Recognition Workshops (CVPRW), 2017 IEEE Con- In The IEEE Conference on Computer Vision and Pattern
ference on, pages 1157–1164. IEEE, 2017. 4 Recognition (CVPR) Workshops, July 2017. 2, 4, 6, 7, 8
[10] C. Fernandez-Granda and E. J. Candès. Super-resolution [25] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smol-
via transform-invariant group-sparse regularization. In IEEE ley. Least squares generative adversarial networks. arXiv
International Conference on Computer Vision, ICCV 2013, preprint ArXiv:1611.04076, 2016. 4
Sydney, Australia, December 1-8, 2013, pages 3336–3343, [26] A. Marquina and S. Osher. Image super-resolution by
2013. 2 tv-regularization and bregman iteration. J. Sci. Comput.,
[11] G. Freedman and R. Fattal. Image and video upscaling from 37(3):367–382, 2008. 2
local self-examples. ACM Transactions on Graphics, 30(2), [27] A. Odena, V. Dumoulin, and C. Olah. Deconvolution and
2011. 2 checkerboard artifacts. Distill, 2016. 1
[12] W. T. Freeman, T. R. Jones, and E. C. Pasztor. Example- [28] M. S. M. Sajjadi, B. Schölkopf, and M. Hirsch. Enhancenet:
based super-resolution. IEEE Computer Graphics and Ap- Single image super-resolution through automated texture
plications, 22(2):56–65, 2002. 2 synthesis. CoRR, abs/1612.07919, 2016. 3, 4, 7
[13] D. Glasner, S. Bagon, and M. Irani. Super-resolution from [29] Q. Shan, Z. Li, J. Jia, and C. Tang. Fast image/video upsam-
a single image. In IEEE 12th International Conference on pling. ACM Trans. Graph., 27(5):153:1–153:7, 2008. 2
Computer Vision, ICCV 2009, Kyoto, Japan, September 27 - [30] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken,
October 4, 2009, pages 349–356, 2009. 2 R. Bishop, D. Rueckert, and Z. Wang. Real-time single im-
[14] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, age and video super-resolution using an efficient sub-pixel
D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen- convolutional neural network. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition,
pages 1874–1883, 2016. 1, 2
[31] K. Simonyan and A. Zisserman. Very deep convolutional
networks for large-scale image recognition. arXiv preprint
arXiv:1409.1556, 2014. 4
[32] J. Sun, Z. Xu, and H. Shum. Image super-resolution using
gradient profile prior. In 2008 IEEE Computer Society Con-
ference on Computer Vision and Pattern Recognition (CVPR
2008), 24-26 June 2008, Anchorage, Alaska, USA, 2008. 2
[33] Y. Tai, J. Yang, and X. Liu. Image super-resolution via deep
recursive residual network. In Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition, 2017.
1, 2, 6, 8
[34] R. Timofte, E. Agustsson, L. Van Gool, M.-H. Yang,
L. Zhang, et al. Ntire 2017 challenge on single image super-
resolution: Methods and results. In The IEEE Conference
on Computer Vision and Pattern Recognition (CVPR) Work-
shops, July 2017. 5
[35] R. Timofte, S. Gu, J. Wu, L. Van Gool, L. Zhang, M.-H.
Yang, et al. Ntire 2018 challenge on single image super-
resolution: Methods and results. In The IEEE Conference
on Computer Vision and Pattern Recognition (CVPR) Work-
shops, June 2018. 1, 7
[36] R. Timofte, V. D. Smet, and L. J. V. Gool. Anchored neigh-
borhood regression for fast example-based super-resolution.
In IEEE International Conference on Computer Vision,
ICCV 2013, Sydney, Australia, December 1-8, 2013, pages
1920–1927, 2013. 2
[37] T. Tong, G. Li, X. Liu, and Q. Gao. Image super-resolution
using dense skip connections. In The IEEE International
Conference on Computer Vision (ICCV), Oct 2017. 1, 2
[38] J. Yang, J. Wright, T. S. Huang, and Y. Ma. Image super-
resolution as sparse representation of raw image patches.
In 2008 IEEE Computer Society Conference on Computer
Vision and Pattern Recognition (CVPR 2008), 24-26 June
2008, Anchorage, Alaska, USA, 2008. 2
[39] Z. Yang, K. Zhang, Y. Liang, and J. Wang. Single im-
age super-resolution with a parameter economic residual-like
convolutional neural network. In International Conference
on Multimedia Modeling, pages 353–364. Springer, 2017. 4
[40] R. Zeyde, M. Elad, and M. Protter. On single image scale-
up using sparse-representations. In Curves and Surfaces -
7th International Conference, Avignon, France, June 24-30,
2010, Revised Selected Papers, pages 711–730, 2010. 2
[41] R. Zeyde, M. Elad, and M. Protter. On single image scale-up
using sparse-representations. In International conference on
curves and surfaces, pages 711–730. Springer, 2010. 5