A Fully Progressive Approach To Single Image Super Resolution Paper 1
A Fully Progressive Approach To Single Image Super Resolution Paper 1
Ours w/o GAN Ours w/ GAN Ours w/o GAN Ours w/ GAN
Figure 1: Examples of our 4× and 8× upsampling results. Our model without GAN sets a new state-of-the-art benchmark in terms of PSNR/SSIM; our
GAN-extended model yields high perceptual quality and is able to hallucinate plausible details up to 8× upsampling ratio.
1
u2
u1
v0
(2x) u0
x v1
...
(4x)
v2
(8x)
Figure 2: Asymmetric pyramidal architecture. More DCUs are allocated in the lower pyramid level to improve the reconstruction accuracy and to reduce
memory consumption.
of the pyramid, we propose dense compression units, which image priors such as heavy-tailed gradient distributions
are adapted from dense blocks [16] to suit super-resolution. [10, 29], gradient profiles [32], multi-scale recurrence [13],
Compared to existing progressive SISR models [21, 22], self-examples [11], and total variation [26]. In contrast,
we improve the reconstruction accuracy by simplifying the exemplar-based approaches such as nearest-neighbor [12]
information propagation within the network; furthermore and sparse dictionary learning [36, 38, 40] have exploited
we propose to use an asymmetric pyramidal structure with the inherent redundancy of large-scale image datasets. Re-
more layers in the lower levels to enable high upsampling cently, Dong et al. [6] showed the superiority of a simple
ratios while remaining efficient. To obtain more photoreal- three-layer convolutional network (CNN) over sparse cod-
istic results, we adopt the GAN framework [14] and design ing techniques. Since then, deep convolutional architectures
a discriminator that matches the progressive nature of our have consistently pushed the state-of-art forward.
generator network by operating on the residual outputs of
each scale. Such paired progressive design allows us to ob- Direct vs. Progressive Reconstruction. Direct recon-
tain a multi-scale generator with a unified discriminator in struction techniques [7, 20, 23, 24, 33, 37] upscale the im-
a single training. age to the desired spatial resolution in a single step. Early
In this framework, we can naturally utilize a form of cur- approaches [7, 20, 33] upscale the LR image in a pre-
riculum learning, which is known to improve training [4] processing step. Thus, the CNN learns to deblur the input
by organizing the learning process from easy (small upsam- image. However, this requires the network to learn a feature
pling factors) to hard (large upsampling factors). Compared representation for a high-resolution image which is com-
to common multi-scale training, the proposed training strat- putationally expensive [30]. To overcome this limitation,
egy not only improves results for all upsampling factors, but many approaches opt for operating on the low dimensional
also significantly shortens the total training time and stabi- features and perform upsampling at the end of the network
lizes the GAN training. via sub-pixel convolution [30] or transposed convolution.
We evaluate our progressive multi-scale approach A popular progressive reconstruction approach is de-
against the state-of-art on a variety of datasets, where we scribed by LapSRN by Lai et al. [21]. In their work, the up-
demonstrate improved performance in terms of traditional sampling follows the principle of Laplacian pyramids, i.e.
error measures (e.g., PSNR) as well as perceptual quality, each level learns to predict a residual that should explain the
particularly for larger upsampling ratios. difference between a simple upscale of the previous level
and the desired result. Since the loss functions are com-
2. Related Work puted at each scale, this provides a form of intermediate su-
Single image super-resolution techniques (SISR) have pervision. Lai et al. improved their method with deep and
been an active area of investigation for more than a wider recursive architecture and multi-scale training [22].
decade [12]. The ill-posed nature of this problem has typi- While [22] improved the accuracy, there remains a consid-
cally been tackled using statistical techniques: most notably erable gap between the top-performing approach in terms
of PSNR [24]. In particular, as we show in Section 4.2,
...
the Laplacian pyramidal structure aggravates the optimiza- Rs(xs)
xs r ŝ
tion difficulty. Furthermore, the recursive pyramids result us-1 us rs α
in quadratic growth of computation in the higher pyramid
level, becoming the bottleneck for reducing runtime and ̂ = Rs-1(xs)
r s-1
expanding the network capability. Lastly, in addition to a rs-1
bilinear x2 1-α
progressive generator, we also propose a progressive dis-
u2 u1 u0
criminator along with a progressive training strategy.
...
α
Perceptual Loss Functions. The aforementioned tech-
1-α
niques optimize the reconstruction error by minimizing the
v2 v1 v0
ℓ1 -norm and descendants such as the Charbonnier penalty
function [21]. Although these approaches yield small re-
Conv(3,3) AvgPool
construction errors, they are unable to hallucinate perceptu- r2
̂
ally plausible high-frequencies details. To this end, Ledig et
al. [23] proposed a perceptual loss function consisting of a Figure 3: Schematic illustration of the blending procedure in curriculum
content loss that captures perceptual similarities and an ad- training for the generator (top) and the discriminator (bottom). vscale , rs
versary to steer the reconstruction closer to the latent man- denote the scale-specific input and reconstruction layer, and us denotes
the pyramid of scale s. α varies from 0 to 1 during blending to control the
ifold of natural solutions. Based on this, Sajjadi et al. [28] impact of the new pyramid.
apply an additional texture loss to encourage similarity with
the original image. In contrast to these works, we design a
discriminator that operates on the residual outputs of each in the lower pyramid not only reduces the memory con-
scale and train progressively with a strategy based on cur- sumption but also increases the receptive field with respect
riculum learning. With this, our GAN model is able to up- to the original image, hence it outperforms the symmet-
sample perceptually pleasing SR images for multiple scales ric variant in terms of reconstruction quality and runtime.
up to 8×. While the decomposition of u is shared among the pyramid
levels, we also use two scale-specific sub-networks, denoted
3. Progressive Multi-scale Super-resolution by vs and rs , which allow for an individual transformation
between scale-varying image space and a normalized fea-
Given a set of n LR input images with corresponding HR ture space. A schematic illustration of our progressive up-
target images {(x1 , y1 ), . . . , (xn , yn )}, we consider the sampling architecture is detailed in Figure 2.
problem of estimating an upscaling function u : X → Y , To simplify learning, the network is designed to output
where X and Y denote the space of LR and HR images, the residual
respectively. Finding a suitable parameterisation for the up-
scaling function u for large upsampling ratios, is challeng- Rs (x) = (rs ◦ us ◦ · · · ◦ u0 ◦ vs )(x) (1)
ing: the larger the ratio, the more complex the function class
w.r.t a fixed upsampling of the input ϕs (x) through e.g.
required.
bicubic interpolation. Thus, for a given scaling factor s the
To this end, we propose a progressive solution to learn
estimated HR image can be computed as
the upscaling function u. In the following, we propose
our pyramidal architecture, ProSR, for multi-scale super-
resolution in Section 3.1 and 3.2. In Section 3.3 we propose ŷ = Rs (x) + ϕs (x). (2)
ProGanSR, a progressive multi-scale GAN for perceptual
enhancement. Finally, we discuss a curriculum learning Notably, our network doesn’t follow the Laplacian pyra-
scheme in Section 3.4. mid principle like in [21, 22], i.e. the intermediate sub-net
outputs are neither supervised nor used as base image in the
3.1. Pyramidal Decomposition subsequent level. Such design performs favorably over the
Laplacian alternative, as it simplifies the backward-pass and
We propose a pyramidal decomposition of u into a series
thus reduces the optimization difficulty. Additionally we
of simpler functions u0 , . . . , us . Each function—or level—
do not downsample the groundtruth to create labels, which
is tasked with refining the feature representation and per-
is done for the intermediate supervision in [21, 22]. This
forming a 2× upsampling of its own input. Each level of the
avoids artefacts that may result from subsampling.
pyramid consists of a cascade of dense compression units
(DCUs) followed by a sub-pixel convolution layer. We as- 3.2. Dense Compression Units
sign more DCUs in the lower pyramid levels, resulting in
the asymmetric structure. Having more computation power We base the construction of each pyramid level on the
recently proposed DenseNet architecture [16]. Similarly to
skip connections [15], dense connections improve gradient square loss instead of the original cross-entropy loss [25].
flow alleviating vanishing and shattered gradients [3]. Denoting the predicted residual and real residual as r̂ and r,
The core component in each level of the pyramid is a the discriminator loss and generator loss for a training ex-
dense compression unit (DCU), which consists of a modi- ample of scale s can be expressed as
fied densely connected block followed by 1 × 1 convolution
C ONV (1,1). LiDs =(D (rˆi s ))2 + (D (ris ) − 1)2 (3)
s
The original dense layer is composed of BN-R E LU- LiRs =(D (rˆi ) − 1) + 2
(4)
C ONV (1,1)-BN-R E LU-C ONV (3,3). Following recent X
kΦk (ŷi ) − Φk (yi )k2 ,
practice in super-resolution [9, 24, 39], we remove all batch
k∈{2,4}
normalizations. However, since the features from previous
layers may have varying scales, we also remove the first where Φk denotes the k-th pooling layer input in
ReLU to rescale the features with C ONV (1-1). This leads to VGG16 [31].
a modified dense layer composition: C ONV (1,1)-R E LU-
C ONV (3,3). 3.4. Curriculum Learning
Contrary to DenseNet, we break the dense connection at Curriculum learning [4] is a strategy to improve training
the end of each DCU with a C ONV (1,1) compression layer, by gradually increasing the difficulty of the learning task. It
which re-assembles information efficiently and leads to a is often used in sequence prediction tasks and in sequential
slight performance gain in spite of the breakage of dense decision making problems where large speedups in training
connection. For a very deep model we apply pyramid-wise time and improvements in generalisation performance can
as well as local residual links to improve the gradient prop- be obtained.
agation as shown in Figure 2. The pyramidal decomposition of u allows us to apply
curriculum learning in a natural way. The loss for a training
3.3. Progressive GAN
example (xsi , yi ) of scale s can be defined as
Generative adversarial networks (GANs) [14] have
emerged as a powerful method to enhance the perceptual LiRs = kRs (xsi ) + ϕs (xsi ) − yi k1 (5)
quality of the upsampled images [14, 23, 28] in SISR. where xsi corresponds to s× downsampled version of yi .
However, training GANs is notoriously difficult and suc- Then the goal at scale s is to find
cess at applying GANs to SISR has been limited to single- XX
scale upsampling at relatively low target resolutions. In or- θ̂s = argmin LiRs′ , (6)
θs
der to enable multi-scale GAN-enhanced SISR, we propose s′ ≤s i
a modular and progressive discriminator network similar to
where θs parameterises all functions in and below the cur-
the generator network proposed in the previous section. As
rent scale (u0 , v0 , r0 , . . . , us , vs , rs ) according to our pyra-
illustrated in the bottom of Figure 3, the architecture has
midal network shown in Figure 2. Our training curriculum
a reverse pyramid structure {u2 , u1 , u0 }, where each level
starts by training only the 2× portion of the network. When
gradually reduces the spatial dimension of the input image
we proceed to a new phase in the curriculum (e.g. to 4×),
with AVG P OOLING. Similar to the generator, scale-specific
a new level of the pyramid is gradually blended in to re-
image transformation layers vscale are implied before each
duce its impact on the previously trained layers. As Figure 3
pyramid. To accommodate the multi-scale outputs from the
shows, for the generator the predicted residual r̂s at scale s
generator, the network is fully convolutional and outputs a
is a linear combination of the outputs from level s and s−1,
small patch of features similar to PatchGAN [18]. The com-
while in analog for the discriminator, the output features
plete specs of the discriminator can be found in the supple-
from the new pyramid are combined with the output of the
mental material.
scale-specific input layer from the previous level vscale−1 ,
Similar to the generator network, the discriminator op-
before entering the trained pyramids {us−1 , . . . , u0 }. Bilin-
erates on the residual between the original and bicubic up-
ear interpolation and AVG P OOL are used to match the spa-
sampled image. This allows both generator and discrimi-
tial dimensions before merging. In both cases, α controls
nator to concentrate only on the important sources of vari-
the influence of the new pyramid and thus it varies from
ation which are not already well captured by the standard
0 to 1 during the blending procedure. As a result we incre-
upsampling operation. Since these regions are challenging
mentally add training pairs of the next scale. While a similar
to upsample well, they correspond to the largest percep-
idea was proposed in [19] to improve high-resolution image
tual errors. This can also be viewed as subtracting a data-
generation, we use this strategy in the context of multi-scale
dependent baseline from the discriminator which helps to
training. Finally, to assemble the batches, we randomly se-
reduce variance.
lect one of the scales s to avoid mixing batch statistics as
As the training objective, we use the more stable least
suggested in [2].
Compared to simple multi-scale training where training Asymmetric Pyramid. In this section we show the ad-
examples from different scales are simultaneously fed to the vantage of the proposed asymmetric pyramidal architecture.
network, such progressive training strategy greatly shortens We compare the following constellations while keeping the
the total training time. Furthermore, it yields a further per- total number of DCUs constant:
formance gain for all included scales compared to single-
Model Architecture
scale and simple multi-scale training and alleviates instabil-
ities in GAN training. Direct D−D−D−D−S−S
Asymmetric Pyramid D−D−D−S−D−S
4. Evaluation Here, D denotes a dense compression unit with 6 dense
layers and S denotes the sub-pixel upsampler. As Table 1
Before we compare with popular state-of-the-art ap-
shows, the asymmetric pyramidal architecture considerably
proaches, we first discuss the benefits of each of our pro-
improves the reconstruction accuracy compared to direct
posed components using a small 24-layer model.
upsampling. This demonstrates the advantage of utilizing
All presented models are trained with the DIV2K [34]
high-dimensional features directly. Furthermore, by assign-
training set, which contains 800 high-resolution images.
ing more computation in the lower pyramid, the penalty
The training details are listed in the supplemental material.
in memory and computation consumption compared to di-
For evaluation, the benchmark datasets Set5 [5], Set14 [41],
rect upsample approach is significantly reduced. As shown
BSD100 [1], Urban100 [17], and the DIV2K validation set
in Table 1, for small model, asymmetric pyramid model
[34] are used. As it is commonly done in SISR, all evalua-
achieves the same runtime as direct upsampling.
tions are conducted on the luminance channel.
4.1. Ablation study Curriculum Learning. We extend the 4-DCU asymmet-
ric pyramid model to 8× upsampling to quantify the ben-
efit of curriculum learning over simultaneous multi-scale
Ablation Study Method PSNR Parameters runtime
training. As Table 2 shows, simultaneous training typically
Single Dense Block, has small or even negative impact on the lowest scale (2×),
Baseline Dense Layer BRCBRC, 28.30 8.22M 0.19s
Single Scale
which is also evident in VDSR [20] (see Table 2). On the
other hand, curriculum learning always improves the recon-
Block Division 4 DCUs 28.32 1.79M 0.11s
struction quality and outperforms simultaneous training by
Architecture Asymmetric Pyramid 28.41 1.89M 0.11s
an average of 0.04dB.
Training Curriculum Learning 28.45 1.89M 0.11s Furthermore, curriculum learning considerably shortens
Increased network the training time. As Figure 4 shows, the network reaches
Very Deep Model width and depth 28.94 13.4M 0.27s
longer training the same number of epochs and quality faster than simulta-
neous training, since the 2× subnet requires less computa-
Table 1: Overview of experiments in the ablation study. The introduction tion and hence less time for each update.
of DCUs, block division, an asymmetric pyramid layout, and curriculum
learning allow to consistently increase reconstruction quality. Reported 4.2. Comparison with other progressive architec-
PSNR values refer to 4× results of Set14. The runtime is tested for 4× tures.
upscaling of 128 × 128 image.
In contrast to our approach, existing progressive methods
Table 1 summarizes the consistent increase in recon- [21, 22] typically rely on deep supervision. They impose a
struction quality stemming from each proposed component. loss on all scales which can be denoted as
As a baseline, we start from a single dense block with two X ′
sub-pixel upsampling layers in the end and a residual con- Lis = ℓ1 ψs′ (yi ), ŷis + ℓ1 (yis , ŷis ) , (7)
nection from the LR input to the final output. In the follow- s′ <s
Table 2: Gain of simultaneous training and curriculum learning w.r.t. single-scale training on all datasets. The average is computed accounting the number
of images in the datasets. Curriculum learning improves the training for all scales while simultaneous training hampers the training of the lowest scale.
2× 4× 8×
22.6
32 Curriculum
Curriculum
Curriculum
26
PSNR (dB)
Simultaneous
31 Simultaneous 22.4
25.5
Simultaneous
30
25 22.2
Figure 4: Training time comparison between curriculum learning and multiscale simultaneous learning. We train the multiscale model and plot the PSNR
evaluation of the individual scales. The elapsed epoch is encoded as the line color. Because curriculum learning activates the smaller subnets first, it requires
much less time to reach the same evaluation quality.
B100 Set14 DRRN, that have been retrained with 8× data. To produce
model 8× EDSR results, we extend their 4× model by adding an-
2× 4× 8× 2× 4× 8× other sub-pixel convolution layer. For training, we follow
their practice which means we initialize the weights of the
single ours - 27.44 - - 28.41 -
scale alt - 27.32 - - 28.20 - 8× model from the pretrained 4× model.
Due to discrepancy in the model size within existing ap-
multi ours 31.95 27.47 24.75 33.24 28.45 24.86
scale alt 31.92 27.38 24.70 33.22 28.28 24.76 proaches, we divide them into two classes based on whether
they have more or less than 5 million parameters. Accord-
Table 3: Comparison with other progressive approaches. ingly, we provide two models with different sizes, denoted
as ProSRs and ProSRℓ , to compete in both classes. ProSRs
has 56 dense layers in total with growth-rate k = 12 and a
We also evaluate such alternative progressive architec- total of 3.1M parameters. ProSRℓ has 104 dense layers with
ture but observed large decrease in PSNR as shown in Ta- growth-rate k = 40 and 15.5M parameters which is roughly
ble 3. Therefore, we conclude that it is less stable to use a third of the parameters of EDSR.
varying sub-scale upsampling results as base images com- Table 4 summarizes the quantitative comparison with
pared to fixed interpolated results and that using a down- other state-of-the-art approaches in terms of PSNR. An ex-
sampling kernel to create the HR label images could intro- tended list that includes SSIM scores can be found in the
duce undesired artefacts. supplemental material. As Table 4 shows, ProSRs achieves
the lowest error in most datasets. The very deep model,
4.3. Comparison with State-of-the-art Approaches ProSRℓ , shows consistent advantage in higher upsampling
In this section, we provide an extensive quantitative ratios and is comparable with EDSR in 2×. In general, our
and qualitative comparison with other state-of-the-art ap- progressive design allows to raise the margin in PSNR be-
proaches. tween our results and the state-of-the art as the upsampling
ratio increases.
Quantitative Comparison. For a quantitative compari-
son, we benchmark against VDSR [20], DRRN [33], Lap- Qualitative comparison. First, we qualitatively compare
SRN [21], MsLapSRN [22], EDSR [24]. We obtained our method without GAN to other methods that also min-
models from Lai et al. [22] for 8× versions of VDSR and imise the ℓ1 loss or related norms. Figure 7 show results of
PSNR [23] [28] Ours HR PSNR [23] [28] Ours HR
Figure 5: Comparison of 4× GAN results (best viewed when zoomed in). Our approach is less prone to artefacts and aligns well with the original image.
our method and the most recent state-of-the-art approaches known downsampling kernels (bicubic). We participated in
in 4× and 8×. the challenge with the ProSRℓ network. In addition to the
Concerning our perceptually-driven model with GAN, method described above, we utilised the geometry ensemble
we compare with SRGAN [23] and EnhanceNet [28]. As used in [24], which yielded a 0.07dB PSNR gain in the val-
Figure 5 shows, the hallucinated details align well with fine idation set. Our model ranks 2nd in terms of SSIM and 4th
structures in the ground truth, even though we do not have in terms of PSNR. Compared to the top-ranking team, our
an explicit texture matching loss as EnhanceNet [28]. While model is marginally lower by 0.002 and 0.04dB in SSIM
SRGAN and EnhanceNet can only upscale 4×, our method and PSNR respectively, but runs 5 times as fast in test time.
is able to extend to 8×. Results are shown in Figure 6. We Other tracks in the challenge target 4× upscaling but
provide an extended qualitative comparison in the supple- consider unknown degradation. Given that this task is dif-
mental material. ferent to the bicubic 8× setting, the participating teams and
the rankings differ. Without specific adaptation for this sce-
5. Runtime. nario, we also participated in these tracks for completeness
and ranked in the mid-range (7th/9th/7th). We believe fur-
The asymmetric pyramid architecture contributes to ther improvement can be achieved with targeted preprocess-
faster runtime compared to other approaches that have sim- ing and extended training data.
ilar reconstruction accuracy. In our test environment with
NVIDIA TITAN XP and cudnn6.0, ProSRℓ takes on av-
7. Conclusion
erage 0.8s, 2.1s and 4.4s to upsample a 520 × 520 image
by 2×, 4× and 8×. In the NTIRE challenge, we reported In this work we propose a progressive approach to ad-
the runtime including geometric ensemble, which requires dress SISR. We leverage asymmetric pyramid design and
8 forward passes for each transformed version of the input Dense Compression Units in the architecture, both of which
image. Nonetheless, our runtime is still 5 times faster than lead to improved memory efficiency and reconstruction ac-
the top-ranking team. curacy. A matching pyramidal discriminator is proposed,
which enables optimizing for perceptual quality simultane-
6. NTIRE Challenge ously for multiple scale. Furthermore we leverage a form
of curriculum learning which not only increases the perfor-
The “New Trends in Image Restoration and Enhance- mance for all scales but also reduces the total training time.
ment” (NTIRE) 2018 super-resolution challenge [35] aims Our models sets a new state-of-the-art benchmark in both
at benchmarking SISR methods in challenging scenarios.
traditional error measures and perceptual quality.
In particular, one of the challenge tracks targets 8× up-
scaling, where the low resolution images are generated with
2× 4× 8×
PSNR
S14 B100 U100 DIV2K S14 B100 U100 DIV2K S14 B100 U100 DIV2K
# params < 5M
VDSR 33.05 31.90 30.77 35.26 28.02 27.29 25.18 29.72 24.26 24.49 21.70 26.22
DRRN 33.23 32.05 31.23 35.49 28.21 27.38 25.44 29.95 24.42 24.59 21.88 26.37
LapSRN 33.08 31.80 30.41 35.63 28.19 27.32 25.21 29.88 24.35 24.54 21.81 26.40
MsLapSRN 33.28 32.05 31.15 35.62 28.26 27.43 25.51 30.39 24.57 24.65 22.06 26.52
SRDenseNet - - - - 28.50 27.53 26.05 - - - -
ProSRs (ours) 33.36 32.02 31.42 35.80 28.59 27.58 26.01 30.39 24.93 24.80 22.43 26.88
# params > 5M
EDSR 33.92 32.32 32.93 36.47 28.80 27.71 26.64 30.71 24.96 24.83 22.53 26.96
ProSRl (ours) 34.00 32.34 32.91 36.44 28.94 27.79 26.89 30.81 25.29 24.99 23.04 27.36
Table 4: Comparison with state-of-the-art approaches. For clarity, we highlight the best approach in blue.