0% found this document useful (0 votes)
16 views9 pages

REF-20-Accurate Image Super-Resolution Using Very Deep Convolutional Networks

The document presents a highly accurate single-image super-resolution (SR) method using a very deep convolutional network (VDSR) that improves accuracy by increasing network depth to 20 layers. The method addresses issues of slow convergence in training by utilizing residual-learning and extremely high learning rates, resulting in better performance compared to existing methods like SRCNN. Additionally, the proposed approach efficiently handles multi-scale SR in a single network, reducing the need for multiple scale-dependent models.

Uploaded by

JIJIN K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views9 pages

REF-20-Accurate Image Super-Resolution Using Very Deep Convolutional Networks

The document presents a highly accurate single-image super-resolution (SR) method using a very deep convolutional network (VDSR) that improves accuracy by increasing network depth to 20 layers. The method addresses issues of slow convergence in training by utilizing residual-learning and extremely high learning rates, resulting in better performance compared to existing methods like SRCNN. Additionally, the proposed approach efficiently handles multi-scale SR in a single network, reducing the need for multiple scale-dependent models.

Uploaded by

JIJIN K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

2016 IEEE Conference on Computer Vision and Pattern Recognition

Accurate Image Super-Resolution Using Very Deep Convolutional Networks

Jiwon Kim, Jung Kwon Lee and Kyoung Mu Lee


Department of ECE, ASRI, Seoul National University, Korea
{j.kim, deruci, kyoungmu}@snu.ac.kr

Abstract 37.6
VDSR (Ours)
37.4
We present a highly accurate single-image super-
resolution (SR) method. Our method uses a very deep con- 37.2

PSNR (dB)
volutional network inspired by VGG-net used for ImageNet
37
classification [19]. We find increasing our network depth
shows a significant improvement in accuracy. Our final 36.8
model uses 20 weight layers. By cascading small filters
SRCNN
many times in a deep network structure, contextual infor- 36.6
A+
mation over large image regions is exploited in an efficient SelfEx RFL
way. With very deep networks, however, convergence speed 36.4
10 2 10 1 10 0 10 -1 10 -2
becomes a critical issue during training. We propose a sim- slow running time(s) fast

ple yet effective training procedure. We learn residuals only


Figure 1: Our VDSR improves PSNR for scale factor ×2 on
and use extremely high learning rates (104 times higher
dataset Set5 in comparison to the state-of-the-art methods (SR-
than SRCNN [6]) enabled by adjustable gradient clipping. CNN uses the public slower implementation using CPU). VDSR
Our proposed method performs better than existing meth- outperforms SRCNN by a large margin (0.87 dB).
ods in accuracy and visual improvements in our results are
easily noticeable.
end-to-end manner. Their method, termed SRCNN, does
not require any engineered features that are typically neces-
1. Introduction sary in other methods [25, 26, 21, 22] and shows the state-
We address the problem of generating a high-resolution of-the-art performance.
(HR) image given a low-resolution (LR) image, commonly While SRCNN successfully introduced a deep learning
referred as single image super-resolution (SISR) [12], [8], technique into the super-resolution (SR) problem, we find
[9]. SISR is widely used in computer vision applications its limitations in three aspects: first, it relies on the con-
ranging from security and surveillance imaging to medical text of small image regions; second, training converges too
imaging where more image details are required on demand. slowly; third, the network only works for a single scale.
Many SISR methods have been studied in the computer In this work, we propose a new method to practically
vision community. Early methods include interpolation resolve the issues.
such as bicubic interpolation and Lanczos resampling [7] Context We utilize contextual information spread over
more powerful methods utilizing statistical image priors very large image regions. For a large scale factor, it is often
[20, 13] or internal patch recurrence [9]. the case that information contained in a small patch is not
Currently, learning methods are widely used to model a sufficient for detail recovery (ill-posed). Our very deep net-
mapping from LR to HR patches. Neighbor embedding [4, work using large receptive field takes a large image context
15] methods interpolate the patch subspace. Sparse coding into account.
[25, 26, 21, 22] methods use a learned compact dictionary Convergence We suggest a way to speed-up the train-
based on sparse signal representation. Lately, random forest ing: residual-learning CNN and extremely high learning
[18] and convolutional neural network (CNN) [6] have also rates. As LR image and HR image share the same infor-
been used with large improvements in accuracy. mation to a large extent, explicitly modelling the residual
Among them, Dong et al. [6] has demonstrated that a image, which is the difference between HR and LR images,
CNN can be used to learn a mapping from LR to HR in an is advantageous. We propose a network structure for effi-

1063-6919/16 $31.00 © 2016 IEEE 1646


DOI 10.1109/CVPR.2016.182
Authorized licensed use limited to: DRDO-ITR. Downloaded on December 20,2024 at 08:33:09 UTC from IEEE Xplore. Restrictions apply.
cient learning when input and output are highly correlated. does. Training time might be spent on learning this auto-
Moreover, our initial learning rate is 104 times higher than encoder so that the convergence rate of learning the other
that of SRCNN [6]. This is enabled by residual-learning part (image details) is significantly decreased. In contrast,
and gradient clipping. since our network models the residual images directly, we
Scale Factor We propose a single-model SR approach. can have much faster convergence with even better accu-
Scales are typically user-specified and can be arbitrary in- racy.
cluding fractions. For example, one might need smooth Scale As in most existing SR methods, SRCNN is
zoom-in in an image viewer or resizing to a specific dimen- trained for a single scale factor and is supposed to work
sion. Training and storing many scale-dependent models in only with the specified scale. Thus, if a new scale is on de-
preparation for all possible scenarios is impractical. We find mand, a new model has to be trained. To cope with multiple
a single convolutional network is sufficient for multi-scale- scale SR (possibly including fractional factors), we need to
factor super-resolution. construct individual single scale SR system for each scale
Contribution In summary, in this work, we propose a of interest.
highly accurate SR method based on a very deep convolu- However, preparing many individual machines for all
tional network. Very deep networks converge too slowly possible scenarios to cope with multiple scales is inefficient
if small learning rates are used. Boosting convergence rate and impractical. In this work, we design and train a sin-
with high learning rates lead to exploding gradients and we gle network to handle multiple scale SR problem efficiently.
resolve the issue with residual-learning and gradient clip- This turns out to work very well. Our single machine is
ping. In addition, we extend our work to cope with multi- compared favorably to a single-scale expert for the given
scale SR problem in a single network. Our method is rel- sub-task. For three scales factors (×2, 3, 4), we can reduce
atively accurate and fast in comparison to state-of-the-art the number of parameters by three-fold.
methods as illustrated in Figure 1. In addition to the aforementioned issues, there are some
minor differences. Our output image has the same size as
2. Related Work the input image by padding zeros every layer during train-
SRCNN is a representative state-of-art method for deep ing whereas output from SRCNN is smaller than the input.
learning-based SR approach. So, let us analyze and com- Finally, we simply use the same learning rates for all lay-
pare it with our proposed method. ers while SRCNN uses different learning rates for different
layers in order to achieve stable convergence.
2.1. Convolutional Network for Image Super-
Resolution 3. Proposed Method
Model SRCNN consists of three layers: patch extrac- 3.1. Proposed Network
tion/representation, non-linear mapping and reconstruction.
Filters of spatial sizes 9 × 9, 1 × 1, and 5 × 5 were used For SR image reconstruction, we use a very deep convo-
respectively. lutional network inspired by Simonyan and Zisserman [19].
In [6], Dong et al. attempted to prepare deeper models, The configuration is outlined in Figure 2. We use d layers
but failed to observe superior performance after a week of where layers except the first and the last are of the same
training. In some cases, deeper models gave inferior perfor- type: 64 filter of the size 3 × 3 × 64, where a filter operates
mance. They conclude that deeper networks do not result in on 3 × 3 spatial region across 64 channels (feature maps).
better performance (Figure 9). The first layer operates on the input image. The last layer,
However, we argue that increasing depth significantly used for image reconstruction, consists of a single filter of
boosts performance. We successfully use 20 weight lay- size 3 × 3 × 64.
ers (3 × 3 for each layer). Our network is very deep (20 The network takes an interpolated low-resolution image
vs. 3 [6]) and information used for reconstruction (recep- (to the desired size) as input and predicts image details.
tive field) is much larger (41 × 41 vs. 13 × 13). Modelling image details is often used in super-resolution
Training For training, SRCNN directly models high- methods [21, 22, 15, 3] and we find that CNN-based meth-
resolution images. A high-resolution image can be de- ods can benefit from this domain-specific knowledge.
composed into a low frequency information (corresponding In this work, we demonstrate that explicitly modelling
to low-resolution image) and high frequency information image details (residuals) has several advantages. These are
(residual image or image details). Input and output images further discussed later in Section 4.2.
share the same low-frequency information. This indicates One problem with using a very deep network to predict
that SRCNN serves two purposes: carrying the input to the dense outputs is that the size of the feature map gets reduced
end layer and reconstructing residuals. Carrying the input every time convolution operations are applied. For example,
to the end is conceptually similar to what an auto-encoder when an input of size (n+1)×(n+1) is applied to a network

1647

Authorized licensed use limited to: DRDO-ITR. Downloaded on December 20,2024 at 08:33:09 UTC from IEEE Xplore. Restrictions apply.
ILR Conv.1 ReLu.1 Conv.D-1 ReLu.D-1 Conv.D (Residual) HR
x r y

Figure 2: Our Network Structure. We cascade a pair of layers (convolutional and nonlinear) repeatedly. An interpolated low-resolution
(ILR) image goes through layers and transforms into a high-resolution (HR) image. The network predicts a residual image and the addition
of ILR and the residual gives the desired output. We use 64 filters for each convolutional layer and some sample feature maps are drawn
for visualization. Most features after applying rectified linear units (ReLu) are zero.

with receptive field size n × n, the output image is 1 × 1. squared error 12 ||y − f (x)||2 averaged over the training set
This is in accordance with other super-resolution meth- is minimized.
ods since many require surrounding pixels to infer cen- Residual-Learning In SRCNN, the network must pre-
ter pixels correctly. This center-surround relation is use- serve all input detail since the image is discarded and the
ful since the surrounding region provides more constraints output is generated from the learned features alone. With
to this ill-posed problem (SR). For pixels near the image many weight layers, this becomes an end-to-end relation
boundary, this relation cannot be exploited to the full extent requiring very long-term memory. For this reason, the van-
and many SR methods crop the result image. ishing/exploding gradients problem [2] can be critical. We
This methodology, however, is not valid if the required can solve this problem simply with residual-learning.
surround region is very big. After cropping, the final image As the input and output images are largely similar, we
is too small to be visually pleasing. define a residual image r = y − x, where most values are
To resolve this issue, we pad zeros before convolutions likely to be zero or small. We want to predict this resid-
to keep the sizes of all feature maps (including the output ual image. The loss function now becomes 12 ||r − f (x)||2 ,
image) the same. It turns out that zero-padding works sur- where f (x) is the network prediction.
prisingly well. For this reason, our method differs from In networks, this is reflected in the loss layer as follows.
most other methods in the sense that pixels near the image Our loss layer takes three inputs: residual estimate, network
boundary are also correctly predicted. input (ILR image) and ground truth HR image. The loss
Once image details are predicted, they are added back to is computed as the Euclidean distance between the recon-
the input ILR image to give the final image (HR). We use structed image (the sum of network input and output) and
this structure for all experiments in our work. ground truth.
Training is carried out by optimizing the regression ob-
3.2. Training jective using mini-batch gradient descent based on back-
We now describe the objective to minimize in order to propagation (LeCun et al. [14]). We set the momentum
find optimal parameters of our model. Let x denote an in- parameter to 0.9. The training is regularized by weight de-
terpolated low-resolution image and y a high-resolution im- cay (L2 penalty multiplied by 0.0001).
age. Given a training dataset {x(i) , y(i) }N
i=1 , our goal is to
High Learning Rates for Very Deep Networks Train-
learn a model f that predicts values ŷ = f (x), where ŷ is ing deep models can fail to converge in realistic limit of
an estimate of the target HR image. We minimize the mean time. SRCNN [6] fails to show superior performance with

1648

Authorized licensed use limited to: DRDO-ITR. Downloaded on December 20,2024 at 08:33:09 UTC from IEEE Xplore. Restrictions apply.
more than three weight layers. While there can be various Epoch 10 20 40 80
reasons, one possibility is that they stopped their training Residual 36.90 36.64 37.12 37.05
procedure before networks converged. Their learning rate Non-Residual 27.42 19.59 31.38 35.66
10−5 is too small for a network to converge within a week Difference 9.48 17.05 5.74 1.39
on a common GPU. Looking at Fig. 9 of [6], it is not easy to (a) Initial learning rate 0.1
say their deeper networks have converged and their perfor-
mances were saturated. While more training will eventually Epoch 10 20 40 80
resolve the issue, but increasing depth to 20 does not seems Residual 36.74 36.87 36.91 36.93
practical with SRCNN. Non-Residual 30.33 33.59 36.26 36.42
It is a basic rule of thumb to make learning rate high to Difference 6.41 3.28 0.65 0.52
boost training. But simply setting learning rate high can (b) Initial learning rate 0.01
also lead to vanishing/exploding gradients [2]. For the rea-
son, we suggest an adjustable gradient clipping for maximal Epoch 10 20 40 80
boost in speed while suppressing exploding gradients. Residual 36.31 36.46 36.52 36.52
Adjustable Gradient Clipping Gradient clipping is a Non-Residual 33.97 35.08 36.11 36.11
technique that is often used in training recurrent neural net- Difference 2.35 1.38 0.42 0.40
works [17]. But, to our knowledge, its usage is limited in (c) Initial learning rate 0.001
training CNNs. While there exist many ways to limit gra- Table 1: Performance table (PSNR) for residual and non-residual
dients, one of the common strategies is to clip individual networks (‘Set5’ dataset, ×2). Residual networks rapidly ap-
gradients to the predefined range [−θ, θ]. proach their convergence within 10 epochs.
With clipping, gradients are in a certain range. With
stochastic gradient descent commonly used for training,
learning rate is multiplied to adjust the step size. If high
4. Understanding Properties
learning rate is used, it is likely that θ is tuned to be small In this section, we study three properties of our proposed
to avoid exploding gradients in a high learning rate regime. method. First, we show that large depth is necessary for
But as learning rate is annealed to get smaller, the effective the task of SR. A very deep network utilizes more con-
gradient (gradient multiplied by learning rate) approaches textual information in an image and models complex func-
zero and training can take exponentially many iterations to tions with many nonlinear layers. We experimentally verify
converge if learning rate is decreased geometrically. that deeper networks give better performances than shallow
For maximal speed of convergence, we clip the gradients ones.
to [− γθ , γθ ], where γ denotes the current learning rate. We Second, we show that our residual-learning network con-
find the adjustable gradient clipping makes our convergence verges much faster than the standard CNN. Moreover, our
procedure extremely fast. Our 20-layer network training is network gives a significant boost in performance.
done within 4 hours whereas 3-layer SRCNN takes several Third, we show that our method with a single network
days to train. performs as well as a method using multiple networks
Multi-Scale While very deep models can boost perfor- trained for each scale. We can effectively reduce model
mance, more parameters are now needed to define a net- capacity (the number of parameters) of multi-network ap-
work. Typically, one network is created for each scale fac- proaches.
tor. Considering that fractional scale factors are often used,
we need an economical way to store and retrieve networks. 4.1. The Deeper, the Better
For this reason, we also train a multi-scale model. With
this approach, parameters are shared across all predefined Convolutional neural networks exploit spatially-local
scale factors. Training a multi-scale model is straightfor- correlation by enforcing a local connectivity pattern be-
ward. Training datasets for several specified scales are com- tween neurons of adjacent layers [1]. In other words, hidden
bined into one big dataset. units in layer m take as input a subset of units in layer m−1.
Data preparation is similar to SRCNN [5] with some dif- They form spatially contiguous receptive fields.
ferences. Input patch size is now equal to the size of the Each hidden unit is unresponsive to variations outside of
receptive field and images are divided into sub-images with the receptive field with respect to the input. The architecture
no overlap. A mini-batch consists of 64 sub-images, where thus ensures that the learned filters produce the strongest
sub-images from different scales can be in the same batch. response to a spatially local input pattern.
We implement our model using the MatConvNet1 pack- However, stacking many such layers leads to filters that
age [23]. become increasingly global (i.e. responsive to a larger re-
gion of pixel space). In other words, a filter of very large
1 http://www.vlfeat.org/matconvnet/
support can be effectively decomposed into a series of small

1649

Authorized licensed use limited to: DRDO-ITR. Downloaded on December 20,2024 at 08:33:09 UTC from IEEE Xplore. Restrictions apply.
37.1 33.3 31

37 33.2 30.9

33.1
36.9 30.8

33
PSNR (dB)

PSNR (dB)

PSNR (dB)
36.8 30.7
32.9
36.7 30.6
32.8

36.6 30.5
32.7

36.5 32.6 30.4

36.4 32.5 30.3


5 10 15 20 5 10 15 20 5 10 15 20
Depth Depth Depth

(a) Test Scale Factor 2 (b) Test Scale Factor 3 (c) Test Scale Factor 4

Figure 3: Depth vs Performance


38 38 38

36 36
36
34
34
34
32
32
32
30
PSNR (dB)
PSNR (dB)

PSNR (dB)
30
28 30
28
26
28
26
24
26
24
22
Residual Residual Residual
Non-Residual 24 Non-Residual 22 Non-Residual
20
Bicubic Bicubic Bicubic
18 22 20
0 20 40 60 80 0 20 40 60 80 0 20 40 60 80
Epochs Epochs Epochs

(a) Initial learning rate 0.1 (b) Initial learning rate 0.01 (c) Initial learning rate 0.001

Figure 4: Performance curve for residual and non-residual networks. Two networks are tested under ‘Set5’ dataset with scale factor 2.
Residual networks quickly reach state-of-the-art performance within a few epochs, whereas non-residual networks (which models high-
resolution image directly) take many epochs to reach maximum performance. Moreover, the final accuracy is higher for residual networks.

filters.
[19].
In this work, we use filters of the same size, 3×3, for all
We now experimentally show that very deep networks
layers. For the first layer, the receptive field is of size 3×3.
significantly improve SR performance. We train and test
For the next layers, the size of the receptive field increases
networks of depth ranging from 5 to 20 (only counting
by 2 in both height and width. For depth D network, the
weight layers excluding nonlinearity layers). In Figure 3,
receptive field has size (2D + 1) × (2D + 1). Its size is
we show the results. In most cases, performance increases
proportional to the depth.
as depth increases. As depth increases, performance im-
In the task of SR, this corresponds to the amount of proves rapidly.
contextual information that can be exploited to infer high-
frequency components. A large receptive field means the 4.2. Residual-Learning
network can use more context to predict image details. As
SR is an ill-posed inverse problem, collecting and analyz- As we already have a low-resolution image as the in-
ing more neighbor pixels give more clues. For example, if put, predicting high-frequency components is enough for
there are some image patterns entirely contained in a recep- the purpose of SR. Although the concept of predicting resid-
tive field, it is plausible that this pattern is recognized and uals has been used in previous methods [21, 22, 26], it has
used to super-resolve the image. not been studied in the context of deep-learning-based SR
framework.
In addition, very deep networks can exploit high nonlin-
In this work, we have proposed a network structure that
earities. We use 19 rectified linear units and our networks
learns residual images. We now study the effect of this mod-
can model very complex functions with moderate number
ification to a standard CNN structure in detail.
of channels (neurons). The advantages of making a thin
First, we find that this residual network converges much
deep network is well explained in Simonyan and Zisserman
faster. Two networks are compared experimentally: the

1650

Authorized licensed use limited to: DRDO-ITR. Downloaded on December 20,2024 at 08:33:09 UTC from IEEE Xplore. Restrictions apply.
Test / Train ×2 ×3 ×4 ×2,3 ×2,4 ×3,4 ×2,3,4 Bicubic
×2 37.10 30.05 28.13 37.09 37.03 32.43 37.06 33.66
×3 30.42 32.89 30.50 33.22 31.20 33.24 33.27 30.39
×4 28.43 28.73 30.84 28.70 30.86 30.94 30.95 28.42
Table 2: Scale Factor Experiment. Several models are trained with different scale sets. Quantitative evaluation (PSNR) on dataset ‘Set5’
is provided for scale factors 2,3 and 4. Red color indicates that test scale is included during training. Models trained with multiple scales
perform well on the trained scales.

×418# ×418#
×5# ×5#
×518# ×518#
×6# ×6#
×618# ×618#
7#
×7 7#
×7

Figure 5: (Top) Our results using a single network for all scale factors. Super-resolved images over all scales are clean and sharp. (Bottom)
Results of Dong et al. [5] (×3 model used for all scales). Result images are not visually pleasing. To handle multiple scales, existing
methods require multiple networks.

residual network and the standard non-residual network. ple scales. Many SR processes for different scales can be
We use depth 10 (weight layers) and scale factor 2. Perfor- executed with our multi-scale machine with much smaller
mance curves for various learning rates are shown in Figure capacity than that of single-scale machines combined.
4. All use the same learning rate scheduling mechanism that We start with an interesting experiment as follows: we
has been mentioned above. train our network with a single scale factor strain and it is
Second, at convergence, the residual network shows su- tested under another scale factor stest . Here, factors 2,3 and
perior performance. In Figure 4, residual networks give 4 that are widely used in SR comparisons are considered.
higher PSNR when training is done. Possible pairs (strain ,stest ) are tried for the dataset ‘Set5’
Another remark is that if small learning rates are used, [15]. Experimental results are summarized in Table 2.
networks do not converge in the given number of epochs. If Performance is degraded if strain = stest . For scale factor
initial learning rate 0.1 is used, PSNR of a residual-learning 2, the model trained with factor 2 gives PSNR of 37.10 (in
network reaches 36.90 within 10 epochs. But if 0.001 is dB), whereas models trained with factor 3 and 4 give 30.05
used instead, the network never reaches the same level of and 28.13, respectively. A network trained over single-scale
performance (its performance is 36.52 after 80 epochs). In data is not capable of handling other scales. In many tests,
a similar manner, residual and non-residual networks show it is even worse than bicubic interpolation, the method used
dramatic performance gaps after 10 epochs (36.90 vs. 27.42 for generating the input image.
for rate 0.1). We now test if a model trained with scale augmentation
In short, this simple modification to a standard non- is capable of performing SR at multiple scale factors. The
residual network structure is very powerful and one can ex- same network used above is trained with multiple scale fac-
plore the validity of the idea in other image restoration prob- tors strain = {2, 3, 4}. In addition, we experiment with the
lems where input and output images are highly correlated. cases strain = {2, 3}, {2, 4}, {3, 4} for more comparisons.
We observe that the network copes with any scale used
4.3. Single Model for Multiple Scales
during training. When strain = {2, 3, 4} (×2, 3, 4 in Ta-
Scale augmentation during training is a key technique to ble 2), its PSNR for each scale is comparable to those
equip a network with super-resolution machines of multi- achieved from the corresponding result of single-scale net-

1651

Authorized licensed use limited to: DRDO-ITR. Downloaded on December 20,2024 at 08:33:09 UTC from IEEE Xplore. Restrictions apply.
Ground Truth A+ [22] RFL [18] SelfEx [11] SRCNN [5] VDSR (Ours)
(PSNR, SSIM) (22.92, 0.7379) (22.90, 0.7332) (23.00, 0.7439) (23.15, 0.7487) (23.50, 0.7777)

Figure 6: Super-resolution results of “148026” (B100) with scale factor ×3. VDSR recovers sharp lines.

Ground Truth A+ [22] RFL [18] SelfEx [11] SRCNN [5] VDSR (Ours)
(PSNR, SSIM) (27.08, 0.7514) (27.08, 0.7508) (27.02, 0.7513) (27.16, 0.7545) (27.32, 0.7606)

Figure 7: Super-resolution results of “38092” (B100) with scale factor ×3. The horn in the image is sharp in the result of VDSR.

Bicubic A+ [22] RFL [18] SelfEx [11] SRCNN [5] VDSR (Ours)
Dataset Scale
PSNR/SSIM/time PSNR/SSIM/time PSNR/SSIM/time PSNR/SSIM/time PSNR/SSIM/time PSNR/SSIM/time
×2 33.66/0.9299/0.00 36.54/0.9544/0.58 36.54/0.9537/0.63 36.49/0.9537/45.78 36.66/0.9542/2.19 37.53/0.9587/0.13
Set5 ×3 30.39/0.8682/0.00 32.58/0.9088/0.32 32.43/0.9057/0.49 32.58/0.9093/33.44 32.75/0.9090/2.23 33.66/0.9213/0.13
×4 28.42/0.8104/0.00 30.28/0.8603/0.24 30.14/0.8548/0.38 30.31/0.8619/29.18 30.48/0.8628/2.19 31.35/0.8838/0.12
×2 30.24/0.8688/0.00 32.28/0.9056/0.86 32.26/0.9040/1.13 32.22/0.9034/105.00 32.42/0.9063/4.32 33.03/0.9124/0.25
Set14 ×3 27.55/0.7742/0.00 29.13/0.8188/0.56 29.05/0.8164/0.85 29.16/0.8196/74.69 29.28/0.8209/4.40 29.77/0.8314/0.26
×4 26.00/0.7027/0.00 27.32/0.7491/0.38 27.24/0.7451/0.65 27.40/0.7518/65.08 27.49/0.7503/4.39 28.01/0.7674/0.25
×2 29.56/0.8431/0.00 31.21/0.8863/0.59 31.16/0.8840/0.80 31.18/0.8855/60.09 31.36/0.8879/2.51 31.90/0.8960/0.16
B100 ×3 27.21/0.7385/0.00 28.29/0.7835/0.33 28.22/0.7806/0.62 28.29/0.7840/40.01 28.41/0.7863/2.58 28.82/0.7976/0.21
×4 25.96/0.6675/0.00 26.82/0.7087/0.26 26.75/0.7054/0.48 26.84/0.7106/35.87 26.90/0.7101/2.51 27.29/0.7251/0.21
×2 26.88/0.8403/0.00 29.20/0.8938/2.96 29.11/0.8904/3.62 29.54/0.8967/663.98 29.50/0.8946/22.12 30.76/0.9140/0.98
Urban100 ×3 24.46/0.7349/0.00 26.03/0.7973/1.67 25.86/0.7900/2.48 26.44/0.8088/473.60 26.24/0.7989/19.35 27.14/0.8279/1.08
×4 23.14/0.6577/0.00 24.32/0.7183/1.21 24.19/0.7096/1.88 24.79/0.7374/394.40 24.52/0.7221/18.46 25.18/0.7524/1.06
Table 3: Average PSNR/SSIM for scale factor ×2, ×3 and ×4 on datasets Set5, Set14, B100 and Urban100. Red color indicates the best
performance and blue color indicates the second best performance.

1652

Authorized licensed use limited to: DRDO-ITR. Downloaded on December 20,2024 at 08:33:09 UTC from IEEE Xplore. Restrictions apply.
work: 37.06 vs. 37.10 (×2), 33.27 vs. 32.89 (×3), 30.95 the learning rate was decreased 3 times, and the learning is
vs. 30.86 (×4). stopped after 80 epochs. Training takes roughly 4 hours on
Another pattern is that for large scales (×3, 4), our multi- GPU Titan Z.
scale network outperforms single-scale network: our model
(×2, 3), (×3, 4) and (×2, 3, 4) give PSNRs 33.22, 33.24 5.3. Benchmark
and 33.27 for test scale 3, respectively, whereas (×3) gives For benchmark, we follow the publicly available frame-
32.89. Similarly, (×2, 4), (×3, 4) and (×2, 3, 4) give 30.86, work of Huang et al. [21]. It enables the comparison of
30.94 and 30.95 (vs. 30.84 by ×4 model), respectively. many state-of-the-art results with the same evaluation pro-
From this, we observe that training multiple scales boosts cedure.
the performance for large scales. The framework applies bicubic interpolation to color
components of an image and sophisticated models to lumi-
5. Experimental Results nance components as in other methods [4], [9], [26]. This is
In this section, we evaluate the performance of our because human vision is more sensitive to details in inten-
method on several datasets. We first describe datasets used sity than in color.
for training and testing our method. Next, parameters nec- This framework crops pixels near image boundary. For
essary for training are given. our method, this procedure is unnecessary as our network
After outlining our experimental setup, we compare our outputs the full-sized image. For fair comparison, however,
method with several state-of-the-art SISR methods. we also crop pixels to the same amount.

5.1. Datasets for Training and Testing 5.4. Comparisons with State-of-the-Art Methods
Training dataset Different learning-based methods use We provide quantitative and qualitative comparisons.
different training images. For example, RFL [18] has two Compared methods are A+ [22], RFL[18], SelfEx [11] and
methods, where the first one uses 91 images from Yang et al. SRCNN [5]. In Table 3, we provide a summary of quantita-
[25] and the second one uses 291 images with the addition tive evaluation on several datasets. Our methods outperform
of 200 images from Berkeley Segmentation Dataset [16]. all previous methods in these datasets. Moreover, our meth-
SRCNN [6] uses a very large ImageNet dataset. ods are relatively fast. The public code of SRCNN based
We use 291 images as in [18] for benchmark with other on a CPU implementation is slower than the code used by
methods in this section. In addition, data augmentation (ro- Dong et. al [6] in their paper based on a GPU implementa-
tation or flip) is used. For results in previous sections, we tion.
used 91 images to train network fast, so performances can In Figures 6 and 7, we compare our method with top-
be slightly different. performing methods. In Figure 6, only our method perfectly
Test dataset For benchmark, we use four datasets. reconstructs the line in the middle. Similarly, in Figure 7,
Datasets ‘Set5’ [15] and ‘Set14’ [26] are often used for contours are clean and vivid in our method whereas they are
benchmark in other works [22, 21, 5]. Dataset ‘Urban100’, severely blurred or distorted in other methods.
a dataset of urban images recently provided by Huang et
al. [11], is very interesting as it contains many challeng- 6. Conclusion
ing images failed by many of the existing methods. Finally,
dataset ‘B100’, natural images in the Berkeley Segmenta- In this work, we have presented a super-resolution
tion Dataset used in Timofte et al. [22] and Yang and Yang method using very deep networks. Training a very deep
[24] for benchmark, is also employed. network is hard due to a slow convergence rate. We use
residual-learning and extremely high learning rates to opti-
5.2. Training Parameters mize a very deep network fast. Convergence speed is max-
imized and we use gradient clipping to ensure the train-
We provide parameters used to train our final model. We ing stability. We have demonstrated that our method out-
use a network of depth 20. Training uses batches of size 64. performs the existing method by a large margin on bench-
Momentum and weight decay parameters are set to 0.9 and marked images. We believe our approach is readily appli-
0.0001, respectively. cable to other image restoration problems such as denoising
For weight initialization, we use the method described in and compression artifact removal.
He et al. [10]. This is a theoretically sound procedure for
networks utilizing rectified linear units (ReLu).
References
We train all experiments over 80 epochs (9960 iterations
with batch size 64). Learning rate was initially set to 0.1 and [1] Y. Bengio, I. J. Goodfellow, and A. Courville. Deep learning.
then decreased by a factor of 10 every 20 epochs. In total, Book in preparation for MIT Press, 2015. 4

1653

Authorized licensed use limited to: DRDO-ITR. Downloaded on December 20,2024 at 08:33:09 UTC from IEEE Xplore. Restrictions apply.
[2] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term [22] R. Timofte, V. De Smet, and L. Van Gool. A+: Adjusted
dependencies with gradient descent is difficult. Neural Net- anchored neighborhood regression for fast super-resolution.
works, IEEE Transactions on, 5(2):157–166, 1994. 3, 4 In ACCV, 2014. 1, 2, 5, 7, 8
[3] M. Bevilacqua, A. Roumy, C. Guillemot, and M.-L. [23] A. Vedaldi and K. Lenc. Matconvnet – convolutional neural
Morel. Super-resolution using neighbor embedding of back- networks for matlab. CoRR, abs/1412.4564, 2014. 4
projection residuals. In Digital Signal Processing (DSP), [24] C.-Y. Yang and M.-H. Yang. Fast direct super-resolution by
2013 18th International Conference on, pages 1–8. IEEE, simple functions. In ICCV, 2013. 8
2013. 2 [25] J. Yang, J. Wright, T. S. Huang, and Y. Ma. Image super-
[4] H. Chang, D.-Y. Yeung, and Y. Xiong. Super-resolution resolution via sparse representation. TIP, 2010. 1, 8
through neighbor embedding. In CVPR, 2004. 1, 8 [26] R. Zeyde, M. Elad, and M. Protter. On single image scale-up
[5] C. Dong, C. C. Loy, K. He, and X. Tang. Learning a deep using sparse-representations. In Curves and Surfaces, pages
convolutional network for image super-resolution. In ECCV. 711–730. Springer, 2012. 1, 5, 8
2014. 4, 6, 7, 8
[6] C. Dong, C. C. Loy, K. He, and X. Tang. Image super-
resolution using deep convolutional networks. TPAMI, 2015.
1, 2, 3, 4, 8
[7] C. E. Duchon. Lanczos filtering in one and two dimensions.
Journal of Applied Meteorology, 18(8):1016–1022, 1979. 1
[8] W. T. Freeman, E. C. Pasztor, and O. T. Carmichael. Learn-
ing low-level vision. International journal of computer vi-
sion, 40(1):25–47, 2000. 1
[9] D. Glasner, S. Bagon, and M. Irani. Super-resolution from a
single image. In ICCV, 2009. 1, 8
[10] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into
rectifiers: Surpassing human-level performance on imagenet
classification. CoRR, abs/1502.01852, 2015. 8
[11] J.-B. Huang, A. Singh, and N. Ahuja. Single image super-
resolution using transformed self-exemplars. In CVPR, 2015.
7, 8
[12] M. Irani and S. Peleg. Improving resolution by image reg-
istration. CVGIP: Graphical models and image processing,
53(3):231–239, 1991. 1
[13] K. I. Kim and Y. Kwon. Single-image super-resolution using
sparse regression and natural image prior. TPAMI, 2010. 1
[14] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-
based learning applied to document recognition. Proceed-
ings of the IEEE, 86(11):2278–2324, 1998. 3
[15] C. G. Marco Bevilacqua, Aline Roumy and M.-L. A.
Morel. Low-complexity single-image super-resolution based
on nonnegative neighbor embedding. In BMVC, 2012. 1, 2,
6, 8
[16] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database
of human segmented natural images and its application to
evaluating segmentation algorithms and measuring ecologi-
cal statistics. In ICCV, 2001. 8
[17] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of
training recurrent neural networks. In ICML, 2013. 4
[18] S. Schulter, C. Leistner, and H. Bischof. Fast and accu-
rate image upscaling with super-resolution forests. In CVPR,
2015. 1, 7, 8
[19] K. Simonyan and A. Zisserman. Very deep convolutional
networks for large-scale image recognition. In ICLR, 2015.
1, 2, 5
[20] J. Sun, Z. Xu, and H.-Y. Shum. Image super-resolution using
gradient profile prior. In CVPR, 2008. 1
[21] R. Timofte, V. De, and L. V. Gool. Anchored neighborhood
regression for fast example-based super-resolution. In ICCV,
2013. 1, 2, 5, 8

1654

Authorized licensed use limited to: DRDO-ITR. Downloaded on December 20,2024 at 08:33:09 UTC from IEEE Xplore. Restrictions apply.

You might also like