0% found this document useful (0 votes)
2 views

Image Super-Resolution Using Deep convolutional networks

This document presents a deep learning method for single image super-resolution (SR) using a convolutional neural network (CNN) that learns an end-to-end mapping between low and high-resolution images. The proposed SRCNN outperforms traditional sparse-coding-based methods by optimizing all layers jointly, achieving state-of-the-art restoration quality and speed. Additionally, the method extends to process three color channels simultaneously, demonstrating improved overall reconstruction quality.

Uploaded by

hamzahfadhil2020
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Image Super-Resolution Using Deep convolutional networks

This document presents a deep learning method for single image super-resolution (SR) using a convolutional neural network (CNN) that learns an end-to-end mapping between low and high-resolution images. The proposed SRCNN outperforms traditional sparse-coding-based methods by optimizing all layers jointly, achieving state-of-the-art restoration quality and speed. Additionally, the method extends to process three color channels simultaneously, demonstrating improved overall reconstruction quality.

Uploaded by

hamzahfadhil2020
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO.

2, FEBRUARY 2016 295

Image Super-Resolution Using Deep


Convolutional Networks
Chao Dong, Chen Change Loy, Member, IEEE, Kaiming He, Member, IEEE, and
Xiaoou Tang, Fellow, IEEE

Abstract—We propose a deep learning method for single image super-resolution (SR). Our method directly learns an end-to-end
mapping between the low/high-resolution images. The mapping is represented as a deep convolutional neural network (CNN) that
takes the low-resolution image as the input and outputs the high-resolution one. We further show that traditional sparse-coding-based
SR methods can also be viewed as a deep convolutional network. But unlike traditional methods that handle each component
separately, our method jointly optimizes all layers. Our deep CNN has a lightweight structure, yet demonstrates state-of-the-art
restoration quality, and achieves fast speed for practical on-line usage. We explore different network structures and parameter settings
to achieve trade-offs between performance and speed. Moreover, we extend our network to cope with three color channels
simultaneously, and show better overall reconstruction quality.

Index Terms—Super-resolution, deep convolutional neural networks, sparse coding

1 INTRODUCTION

S INGLE image super-resolution (SR), which aims at recov-


ering a high-resolution image from a single low-resolu-
tion image, is a classical problem in computer vision. This
resolution patches. The overlapping reconstructed patches
are aggregated (e.g., by weighted averaging) to produce the
final output. This pipeline is shared by most external exam-
problem is inherently ill-posed since a multiplicity of solu- ple-based methods, which pay particular attention to learn-
tions exist for any given low-resolution pixel. In other ing and optimizing the dictionaries [2], [47], [48] or building
words, it is an underdetermined inverse problem, of which efficient mapping functions [24], [39], [40], [45]. However,
solution is not unique. Such a problem is typically mitigated the rest of the steps in the pipeline have been rarely opti-
by constraining the solution space by strong prior informa- mized or considered in an unified optimization framework.
tion. To learn the prior, recent state-of-the-art methods In this paper, we show that the aforementioned pipeline
mostly adopt the example-based [44] strategy. These meth- is equivalent to a deep convolutional neural network [26]
ods either exploit internal similarities of the same image [5], (more details in Section 3.2). Motivated by this fact, we con-
[13], [16], [19], [45], or learn mapping functions from exter- sider a convolutional neural network that directly learns
nal low- and high-resolution exemplar pairs [2], [4], [6], an end-to-end mapping between low- and high-resolution
[15], [22], [24], [36], [39], [40], [45], [46], [48], [49]. The exter- images. Our method differs fundamentally from existing
nal example-based methods can be formulated for generic external example-based approaches, in that ours does not
image super-resolution, or can be designed to suit domain explicitly learn the dictionaries [39], [47], [48] or manifolds
specific tasks, i.e., face hallucination [29], [48], according to [2], [4] for modeling the patch space. These are implicitly
the training samples provided. achieved via hidden layers. Furthermore, the patch extrac-
The sparse-coding-based (SC) method [47], [48] is one of tion and aggregation are also formulated as convolutional
the representative external example-based SR methods. layers, so are involved in the optimization. In our method,
This method involves several steps in its solution pipeline. the entire SR pipeline is fully obtained through learning,
First, overlapping patches are densely cropped from the with little pre/post-processing.
input image and pre-processed (e.g., subtracting mean and We name the proposed model super-resolution convolu-
normalization). These patches are then encoded by a low- tional neural network (SRCNN).1 The proposed SRCNN
resolution dictionary. The sparse coefficients are passed has several appealing properties. First, its structure is inten-
into a high-resolution dictionary for reconstructing high- tionally designed with simplicity in mind, and yet provides
superior accuracy2 compared with state-of-the-art example-
based methods. Fig. 1 shows a comparison on an example.
 C. Dong, C. C. Loy, and X. Tang are with the Department of Information Second, with moderate numbers of filters and layers, our
Engineering, The Chinese University of Hong Kong, Hong Kong, China. method achieves fast speed for practical on-line usage
E-mail: {dc012, ccloy, xtang}@ie.cuhk.edu.hk.
 K. He is with the Visual Computing Group, Microsoft Research Asia, even on a CPU. Our method is faster than a number of
Beijing 100080, China. E-mail: [email protected].
Manuscript received 30 Dec. 2014; revised 8 Apr. 2015; accepted 18 May 1. The implementation is available at http://mmlab.ie.cuhk.edu.
2015. Date of publication 31 May 2015; date of current version 13 Jan. 2016. hk/projects/SRCNN.html.
Recommended for acceptance by M.S. Brown. 2. Numerical evaluations by using different metrics such as the Peak
For information on obtaining reprints of this article, please send e-mail to: Signal-to-Noise Ratio (PSNR), structure similarity index (SSIM) [41],
[email protected], and reference the Digital Object Identifier below. multi-scale SSIM [42], information fidelity criterion (IFC) [37], when the
Digital Object Identifier no. 10.1109/TPAMI.2015.2439281 ground truth images are available.
0162-8828 ß 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
296 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO. 2, FEBRUARY 2016

larger filter size in the non-linear mapping layer, and


explore deeper structures by adding non-linear mapping
layers. Second, we extend the SRCNN to process three color
channels (either in YCbCr or RGB color space) simulta-
neously. Experimentally, we demonstrate that performance
can be improved in comparison to the single-channel net-
work. Third, considerable new analyses and intuitive
explanations are added to the initial results. We also extend
the original experiments from Set5 [2] and Set14 [49] test
images to BSD200 [31] (200 test images). In addition, we
compare with a number of recently published methods
and confirm that our model still outperforms existing
approaches using different evaluation metrics.

2 RELATED WORK
2.1 Image Super-Resolution
According to the image priors, single-image super resolution
algorithms can be categorized into four types—prediction
models, edge based methods, image statistical methods and
patch based (or example-based) methods. These methods
have been thoroughly investigated and evaluated in Yang
et al.’s work [44]. Among them, the example-based methods
[16], [24], [39], [45] achieve the state-of-the-art performance.
The internal example-based methods exploit the self-sim-
ilarity property and generate exemplar patches from the
Fig. 1. The proposed super-resolution convolutional neural network sur- input image. It is first proposed in Glasner’s work [16], and
passes the bicubic baseline with just a few training iterations, and out-
performs the sparse-coding-based method [48] with moderate training.
several improved variants [13], [43] are proposed to acceler-
The performance may be further improved with more training iterations. ate the implementation. The external example-based meth-
More details are provided in Section 4.4.1 (the Set5 dataset with an ods [2], [4], [6], [15], [36], [39], [46], [47], [48], [49] learn a
upscaling factor 3). The proposed method provides visually appealing mapping between low/high-resolution patches from exter-
reconstructed image.
nal datasets. These studies vary on how to learn a compact
dictionary or manifold space to relate low/high-resolution
example-based methods, because it is fully feed-forward patches, and on how representation schemes can be con-
and does not need to solve any optimization problem on ducted in such spaces. In the pioneer work of Freeman et al.
usage. Third, experiments show that the restoration quality [14], the dictionaries are directly presented as low/high-res-
of the network can be further improved when (i) larger and olution patch pairs, and the nearest neighbour (NN) of the
more diverse datasets are available, and/or (ii) a larger and input patch is found in the low-resolution space, with its
deeper model is used. On the contrary, larger datasets/ corresponding high-resolution patch used for reconstruc-
models can present challenges for existing example-based tion. Chang et al. [4] introduce a manifold embedding tech-
methods. Furthermore, the proposed network can cope nique as an alternative to the NN strategy. In Yang et al.’s
with three channels of color images simultaneously to work [47], [48], the above NN correspondence advances to a
achieve improved super-resolution performance. more sophisticated sparse coding formulation. Other map-
Overall, the contributions of this study are mainly in ping functions such as kernel regression [24], simple func-
three aspects: tion [45], random forest [36] and anchored neighborhood
regression [39], [40] are proposed to further improve the
1) We present a fully convolutional neural network for mapping accuracy and speed. The sparse-coding-based
image super-resolution. The network directly learns method and its several improvements [39], [40], [46] are
an end-to-end mapping between low- and high-reso- among the state-of-the-art SR methods nowadays. In these
lution images, with little pre/post-processing methods, the patches are the focus of the optimization; the
beyond the optimization. patch extraction and aggregation steps are considered as
2) We establish a relationship between our deep-learn- pre/post-processing and handled separately.
ing-based SR method and the traditional sparse-cod- The majority of SR algorithms [2], [4], [15], [39], [46], [47],
ing-based SR methods. This relationship provides a [48], [49] focus on gray-scale or single-channel image super-
guidance for the design of the network structure. resolution. For color images, the aforementioned methods
3) We demonstrate that deep learning is useful in the first transform the problem to a different color space (YCbCr
classical computer vision problem of super-resolu- or YUV), and SR is applied only on the luminance channel.
tion, and can achieve good quality and speed. There are also works attempting to super-resolve all chan-
A preliminary version of this work was presented earlier nels simultaneously. For example, Kim and Kwon [24] and
[11]. The present work adds to the initial version in signifi- Dai et al. [7] apply their model to each RGB channel and
cant ways. First, we improve the SRCNN by introducing combined them to produce the final results. However, none
DONG ET AL.: IMAGE SUPER-RESOLUTION USING DEEP CONVOLUTIONAL NETWORKS 297

Fig. 2. Given a low-resolution image Y, the first convolutional layer of the SRCNN extracts a set of feature maps. The second layer maps these
feature maps nonlinearly to high-resolution patch representations. The last layer combines the predictions within a spatial neighbourhood to produce
the final high-resolution image F ðYÞ.

of them has analyzed the SR performance of different chan- only pre-processing we perform.3 Let us denote the interpo-
nels, and the necessity of recovering all three channels. lated image as Y. Our goal is to recover from Y an image
F ðYÞ that is as similar as possible to the ground truth high-
2.2 Convolutional Neural Networks (CNN) resolution image X. For the ease of presentation, we still call
Convolutional neural networks date back decades [26] and Y a “low-resolution” image, although it has the same size as
deep CNNs have recently shown an explosive popularity X. We wish to learn a mapping F , which conceptually con-
partially due to its success in image classification [18], [25]. sists of three operations:
They have also been successfully applied to other computer
1) Patch extraction and representation. this operation
vision fields, such as object detection [33], [50], face recogni-
extracts (overlapping) patches from the low-resolu-
tion [38], and pedestrian detection [34]. Several factors are
tion image Y and represents each patch as a high-
of central importance in this progress: (i) the efficient train-
dimensional vector. These vectors comprise a set of
ing implementation on modern powerful GPUs [25], (ii) the
feature maps, of which the number equals to the
proposal of the rectified linear unit (ReLU) [32] which
dimensionality of the vectors.
makes convergence much faster while still presents good
2) Non-linear mapping. this operation nonlinearly maps
quality [25], and (iii) the easy access to an abundance of
each high-dimensional vector onto another high-
data (like ImageNet [9]) for training larger models. Our
dimensional vector. Each mapped vector is concep-
method also benefits from these progresses.
tually the representation of a high-resolution patch.
These vectors comprise another set of feature maps.
2.3 Deep Learning for Image Restoration
3) Reconstruction. this operation aggregates the above
There have been a few studies of using deep learning tech- high-resolution patch-wise representations to gener-
niques for image restoration. The multi-layer perceptron ate the final high-resolution image. This image is
(MLP), whose all layers are fully-connected (in contrast to expected to be similar to the ground truth X.
convolutional), is applied for natural image denoising [3] We will show that all these operations form a convolu-
and post-deblurring denoising [35]. More closely related to tional neural network. An overview of the network is
our work, the convolutional neural network is applied for depicted in Fig. 2. Next we detail our definition of each
natural image denoising [21] and removing noisy patterns operation.
(dirt/rain) [12]. These restoration problems are more or less
denoising-driven. Cui et al. [5] propose to embed auto-
3.1.1 Patch Extraction and Representation
encoder networks in their super-resolution pipeline under
the notion internal example-based approach [16]. The deep A popular strategy in image restoration (e.g., [1]) is to
model is not specifically designed to be an end-to-end solu- densely extract patches and then represent them by a set of
tion, since each layer of the cascade requires independent pre-trained bases such as PCA, DCT, Haar, etc. This is
optimization of the self-similarity search process and the equivalent to convolving the image by a set of filters, each
auto-encoder. On the contrary, the proposed SRCNN opti- of which is a basis. In our formulation, we involve the opti-
mizes an end-to-end mapping. Further, the SRCNN is faster mization of these bases into the optimization of the network.
at speed. It is not only a quantitatively superior method, but Formally, our first layer is expressed as an operation F1 :
also a practically useful one.
F1 ðYÞ ¼ maxð0; W1  Y þ B1 Þ; (1)
3 CONVOLUTIONAL NEURAL NETWORKS FOR
SUPER-RESOLUTION 3. Bicubic interpolation is also a convolutional operation, so it can be
3.1 Formulation formulated as a convolutional layer. However, the output size of this
layer is larger than the input size, so there is a fractional stride. To take
Consider a single low-resolution image, we first upscale it to advantage of the popular well-optimized implementations such as
the desired size using bicubic interpolation, which is the cuda-convnet [25], we exclude this “layer” from learning.
298 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO. 2, FEBRUARY 2016

Fig. 3. An illustration of sparse-coding-based methods in the view of a convolutional neural network.

where W1 and B1 represent the filters and biases respec- Motivated by this, we define a convolutional layer to
tively, and ‘’ denotes the convolution operation. Here, W1 produce the final high-resolution image:
corresponds to n1 filters of support c  f1  f1 , where c is
the number of channels in the input image, f1 is the spatial F ðYÞ ¼ W3  F2 ðYÞ þ B3 : (3)
size of a filter. Intuitively, W1 applies n1 convolutions on the
image, and each convolution has a kernel size c  f1  f1 . Here W3 corresponds to c filters of a size n2  f3  f3 , and
The output is composed of n1 feature maps. B1 is an B3 is a c-dimensional vector.
n1 -dimensional vector, whose each element is associated If the representations of the high-resolution patches are in
with a filter. We apply the ReLU (maxð0; xÞ) [32] on the filter the image domain (i.e., we can simply reshape each represen-
responses.4 tation to form the patch), we expect that the filters act like an
averaging filter; if the representations of the high-resolution
3.1.2 Non-Linear Mapping patches are in some other domains (e.g., coefficients in terms
of some bases), we expect that W3 behaves like first projec-
The first layer extracts an n1 -dimensional feature for each
ting the coefficients onto the image domain and then averag-
patch. In the second operation, we map each of these
ing. In either way, W3 is a set of linear filters.
n1 -dimensional vectors into an n2 -dimensional one. This is
Interestingly, although the above three operations are
equivalent to applying n2 filters which have a trivial spatial
motivated by different intuitions, they all lead to the same
support 1  1. This interpretation is only valid for 1  1 fil-
form as a convolutional layer. We put all three operations
ters. But it is easy to generalize to larger filters like 3  3 or
together and form a convolutional neural network (Fig. 2).
5  5. In that case, the non-linear mapping is not on a patch
In this model, all the filtering weights and biases are to be
of the input image; instead, it is on a 3  3 or 5  5 “patch”
optimized. Despite the succinctness of the overall structure,
of the feature map. The operation of the second layer is:
our SRCNN model is carefully developed by drawing
F2 ðYÞ ¼ maxð0; W2  F1 ðYÞ þ B2 Þ: (2) extensive experience resulted from significant progresses in
super-resolution [47], [48]. We detail the relationship in the
Here W2 contains n2 filters of size n1  f2  f2 , and B2 is next section.
n2 -dimensional. Each of the output n2 -dimensional vectors
is conceptually a representation of a high-resolution patch 3.2 Relationship to Sparse-Coding-Based Methods
that will be used for reconstruction. We show that the sparse-coding-based SR methods [47], [48]
It is possible to add more convolutional layers to increase can be viewed as a convolutional neural network. Fig. 3
the non-linearity. But this can increase the complexity of the shows an illustration.
model (n2  f2  f2  n2 parameters for one layer), and thus In the sparse-coding-based methods, let us consider that
demands more training time. We will explore deeper struc- an f1  f1 low-resolution patch is extracted from the input
tures by introducing additional non-linear mapping layers image. Then the sparse coding solver, like Feature-Sign [28],
in Section 4.3.3. will first project the patch onto a (low-resolution) dictio-
nary. If the dictionary size is n1 , this is equivalent to apply-
3.1.3 Reconstruction ing n1 linear filters (f1  f1 ) on the input image (the mean
In the traditional methods, the predicted overlapping high- subtraction is also a linear operation so can be absorbed).
resolution patches are often averaged to produce the final This is illustrated as the left part of Fig. 3.
full image. The averaging can be considered as a pre- The sparse coding solver will then iteratively process the
defined filter on a set of feature maps (where each position n1 coefficients. The outputs of this solver are n2 coefficients,
is the “flattened” vector form of a high-resolution patch). and usually n2 ¼ n1 in the case of sparse coding. These n2
coefficients are the representation of the high-resolution
patch. In this sense, the sparse coding solver behaves as a
4. The ReLU can be equivalently considered as a part of the second
operation (Non-linear mapping), and the first operation (Patch extrac- special case of a non-linear mapping operator, whose spatial
tion and representation) becomes purely linear convolution. support is 1  1. See the middle part of Fig. 3. However, the
DONG ET AL.: IMAGE SUPER-RESOLUTION USING DEEP CONVOLUTIONAL NETWORKS 299

sparse coding solver is not feed-forward, i.e., it is an itera- metric for quantitatively evaluating image restoration qual-
tive algorithm. On the contrary, our non-linear operator is ity, and is at least partially related to the perceptual quality.
fully feed-forward and can be computed efficiently. If we It is worth noticing that the convolutional neural networks
set f2 ¼ 1, then our non-linear operator can be considered do not preclude the usage of other kinds of loss functions, if
as a pixel-wise fully-connected layer. It is worth noting that only the loss functions are derivable. If a better perceptually
“the sparse coding solver” in SRCNN refers to the first two motivated metric is given during training, it is flexible for the
layers, but not just the second layer or the activation func- network to adapt to that metric. On the contrary, such a flexi-
tion (ReLU). Thus the nonlinear operation in SRCNN is also bility is in general difficult to achieve for traditional “hand-
well optimized through the learning process. crafted” methods. Despite that the proposed model is
The above n2 coefficients (after sparse coding) are then trained favoring a high PSNR, we still observe satisfactory
projected onto another (high-resolution) dictionary to pro- performance when the model is evaluated using alternative
duce a high-resolution patch. The overlapping high-resolu- evaluation metrics, e.g., SSIM, MSSIM (see Section 4.4.1).
tion patches are then averaged. As discussed above, this is The loss is minimized using stochastic gradient descent
equivalent to linear convolutions on the n2 feature maps. If with the standard backpropagation [27]. In particular, the
the high-resolution patches used for reconstruction are of weight matrices are updated as
size f3  f3 , then the linear filters have an equivalent spatial
support of size f3  f3 . See the right part of Fig. 3. @L ‘
Diþ1 ¼ 0:9  Di þ h  ; Wiþ1 ¼ Wi‘ þ Diþ1 ; (5)
The above discussion shows that the sparse-coding- @Wi‘
based SR method can be viewed as a kind of convolutional
neural network (with a different non-linear mapping). But where ‘ 2 f1; 2; 3g and i are the indices of layers and itera-
@L
not all operations have been considered in the optimization tions, h is the learning rate, and @W ‘ is the derivative. The fil-
i
in the sparse-coding-based SR methods. On the contrary, in ter weights of each layer are initialized by drawing
our convolutional neural network, the low-resolution dictio- randomly from a Gaussian distribution with zero mean and
nary, high-resolution dictionary, non-linear mapping, standard deviation 0.001 (and 0 for biases). The learning
together with mean subtraction and averaging, are all
rate is 104 for the first two layers, and 105 for the last
involved in the filters to be optimized. So our method opti-
layer. We empirically find that a smaller learning rate in the
mizes an end-to-end mapping that consists of all operations.
last layer is important for the network to converge (similar
The above analogy can also help us to design hyper-
to the denoising case [21]).
parameters. For example, we can set the filter size of the last
In the training phase, the ground truth images fXi g are
layer to be smaller than that of the first layer, and thus we
prepared as fsub  fsub  c-pixel sub-images randomly
rely more on the central part of the high-resolution patch (to
cropped from the training images. By “sub-images” we
the extreme, if f3 ¼ 1, we are using the center pixel with no
mean these samples are treated as small “images” rather
averaging). We can also set n2 < n1 because it is expected
than “patches”, in the sense that “patches” are overlapping
to be sparser. A typical and basic setting is f1 ¼ 9, f2 ¼ 1,
and require some averaging as post-processing but “sub-
f3 ¼ 5, n1 ¼ 64, and n2 ¼ 32 (we evaluate more settings in
images” need not. To synthesize the low-resolution samples
the experiment section). On the whole, the estimation of
fYi g, we blur a sub-image by a Gaussian kernel, sub-sample
a high resolution pixel utilizes the information of
it by the upscaling factor, and upscale it by the same factor
ð9 þ 5  1Þ2 ¼ 169 pixels. Clearly, the information exploited via bicubic interpolation.
for reconstruction is comparatively larger than that used in To avoid border effects during training, all the convolu-
existing external example-based approaches, e.g., using tional layers have no padding, and the network produces a
ð5 þ 5  1Þ2 ¼ 81 pixels5 [15], [48]. This is one of the reasons smaller output (ðfsub  f1  f2  f3 þ 3Þ2  c). The MSE loss
why the SRCNN gives superior performance. function is evaluated only by the difference between the cen-
tral pixels of Xi and the network output. Although we use a
3.3 Training fixed image size in training, the convolutional neural network
Learning the end-to-end mapping function F requires the can be applied on images of arbitrary sizes during testing.
estimation of network parameters Q ¼ fW1 ; W2 ; W3 ; B1 ; We implement our model using the cuda-convnet package
B2 ; B3 g. This is achieved through minimizing the loss [25]. We have also tried the Caffe package [23] and observed
between the reconstructed images F ðY; QÞ and the corre- similar performance.
sponding ground truth high-resolution images X. Given a
set of high-resolution images fXi g and their corresponding 4 EXPERIMENTS
low-resolution images fYi g, we use mean squared error
We first investigate the impact of using different datasets on
(MSE) as the loss function:
the model performance. Next, we examine the filters
learned by our approach. We then explore different archi-
1X n
LðQÞ ¼ jjF ðYi ; QÞ  Xi jj2 ; (4) tecture designs of the network, and study the relations
n i¼1 between super-resolution performance and factors like
depth, number of filters, and filter sizes. Subsequently, we
where n is the number of training samples. Using MSE as the compare our method with recent state-of-the-arts both
loss function favors a high PSNR. The PSNR is a widely-used quantitatively and qualitatively. Following [40], super-reso-
lution is only applied on the luminance channel (Y channel
5. The patches are overlapped with 4 pixels at each direction. in YCbCr color space) in Sections 4.1-4.4, so c ¼ 1 in the
300 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO. 2, FEBRUARY 2016

Fig. 4. Training with the much larger ImageNet dataset improves the
performance over the use of 91 images.

first/last layer, and performance (e.g., PSNR and SSIM) is


evaluated on the Y channel. At last, we extend the network Fig. 6. Example feature maps of different layers.
to cope with color images and evaluate the performance on
different channels. published implementation for upscaling factors 2 and 4.
Interestingly, each learned filter has its specific functionality.
4.1 Training Data For instance, the filters g and h are like Laplacian/Gaussian
As shown in the literature, deep learning generally benefits filters, the filters a - e are like edge detectors at different direc-
from big data training. For comparison, we use a relatively tions, and the filter f is like a texture extractor. Example fea-
small training set [39], [48] that consists of 91 images, and a ture maps of different layers are shown in Fig. 6. Obviously,
large training set that consists of 395,909 images from the feature maps of the first layer contain different structures
ILSVRC 2013 ImageNet detection training partition. The size (e.g., edges at different directions), while that of the second
of training sub-images is fsub ¼ 33. Thus the 91-image data- layer are mainly different on intensities.
set can be decomposed into 24,800 sub-images, which are
extracted from original images with a stride of 14. Whereas 4.3 Model and Performance Trade-Offs
the ImageNet provides over 5 million sub-images even using Based on the basic network settings (i.e., f1 ¼ 9, f2 ¼ 1,
a stride of 33. We use the basic network settings, i.e., f1 ¼ 9, f3 ¼ 5, n1 ¼ 64, and n2 ¼ 32), we will progressively modify
f2 ¼ 1, f3 ¼ 5, n1 ¼ 64, and n2 ¼ 32. We use the Set5 [2] as some of these parameters to investigate the best trade-off
the validation set. We observe a similar trend even if we use between performance and speed, and study the relations
the larger Set14 set [49]. The upscaling factor is 3. We use the between performance and parameters.
sparse-coding-based method [48] as our baseline, which
achieves an average PSNR value of 31.42 dB. 4.3.1 Filter Number
The test convergence curves of using different training
In general, the performance would improve if we increase
sets are shown in Fig. 4. The training time on ImageNet is
the network width,6 i.e., adding more filters, at the cost of
about the same as on the 91-image dataset since the number
running time. Specifically, based on our network default
of backpropagations is the same. As can be observed, with
settings of n1 ¼ 64 and n2 ¼ 32, we conduct two experi-
the same number of backpropagations (i.e., 8  108 ), the
ments: (i) one is with a larger network with n1 ¼ 128 and
SRCNNþImageNet achieves 32.52 dB, higher than 32.39 dB
n2 ¼ 64, and (ii) the other is with a smaller network with
yielded by that trained on 91 images. The results positively
n1 ¼ 32 and n2 ¼ 16. Similar to Section 4.1, we also train the
indicate that SRCNN performance may be further boosted
two models on ImageNet and test on Set5 with an upscaling
using a larger training set, but the effect of big data is not as
factor 3. The results observed at 8  108 backpropagations
impressive as that shown in high-level vision problems [25].
are shown in Table 1. It is clear that superior performance
This is mainly because that the 91 images have already cap-
could be achieved by increasing the width. However, if a
tured sufficient variability of natural images. On the other
fast restoration speed is desired, a small network width is
hand, our SRCNN is a relatively small network (8,032
preferred, which could still achieve better performance than
parameters), which could not overfit the the 91 images
the sparse-coding-based method (31.42 dB).
(24,800 samples). Nevertheless, we adopt the ImageNet,
which contains more diverse data, as the default training
set in the following experiments. 4.3.2 Filter Size
In this section, we examine the network sensitivity to differ-
4.2 Learned Filters for Super-Resolution ent filter sizes. In previous experiments, we set filter size
Fig. 5 shows examples of learned first-layer filters trained on f1 ¼ 9, f2 ¼ 1 and f3 ¼ 5, and the network could be denoted
the ImageNet by an upscaling factor 3. Please refer to our as 9-1-5. First, to be consistent with sparse-coding-based
methods, we fix the filter size of the second layer to be
f2 ¼ 1, and enlarge the filter size of other layers to f1 ¼ 11
and f3 ¼ 7 (11-1-7). All the other settings remain the same
with Section 4.1. The results with an upscaling factor 3 on
Set5 are 32.57 dB, which is slightly higher than the 32.52 dB
reported in Section 4.1. This indicates that a reasonably
Fig. 5. The figure shows the first-layer filters trained on ImageNet with an
upscaling factor 3. The filters are organized based on their respective 6. We use ‘width’ to term the number of filters in a layer, following
variances. [17]. The term ‘width’ may have other meanings in the literature.
DONG ET AL.: IMAGE SUPER-RESOLUTION USING DEEP CONVOLUTIONAL NETWORKS 301

TABLE 1
The Results of Using Different Filter Numbers in SRCNN

n1 ¼ 128 n1 ¼ 64 n1 ¼ 32
n2 ¼ 64 n2 ¼ 32 n2 ¼ 16
PSNR Time (sec) PSNR Time (sec) PSNR Time (sec)
32.60 0.60 32.52 0.18 32.26 0.05

Training is performed on ImageNet whilst the evaluation is conducted on the


Set5 dataset.

larger filter size could grasp richer structural information,


which in turn lead to better results.
Then we further examine networks with a larger filter
size of the second layer. Specifically, we fix the filter size
f1 ¼ 9, f3 ¼ 5, and enlarge the filter size of the second layer
to be (i) f2 ¼ 3 (9-3-5) and (ii) f2 ¼ 5 (9-5-5). Convergence
curves in Fig. 7 show that using a larger filter size could sig-
nificantly improve the performance. Specifically, the aver-
age PSNR values achieved by 9-3-5 and 9-5-5 on Set5 with
8  108 backpropagations are 32.66 and 32.75 dB, respec-
tively. The results suggest that utilizing neighborhood infor-
mation in the mapping stage is beneficial.
However, the deployment speed will also decrease with
a larger filter size. For example, the number of parameters
of 9-1-5, 9-3-5, and 9-5-5 is 8,032, 24,416, and 57,184 respec- Fig. 8. Comparisons between three-layer and four-layer networks.
tively. The complexity of 9-5-5 is almost twice of 9-3-5, but
the performance improvement is marginal. Therefore, the two non-linear mapping layers with n22 ¼ 32 and n23 ¼ 16
choice of the network scale should always be a trade-off filters on 9-1-5, then we have to set a smaller learning rate to
between performance and speed. ensure convergence, but we still do not observe superior per-
formance after a week of training (see Fig. 9a). We also tried
4.3.3 Number of Layers
to enlarge the filter size of the additional layer to f22 ¼ 3, and
Recent study by He and Sun [17] suggests that CNN could explore two deep structures—9-3-3-5 and 9-3-3-3. However,
benefit from increasing the depth of network moderately. from the convergence curves shown in Fig. 9b, these two net-
Here, we try deeper structures by adding another non-lin- works do not show better results than the 9-3-1-5 network.
ear mapping layer, which has n22 ¼ 16 filters with size All these experiments indicate that it is not “the deeper
f22 ¼ 1. We conduct three controlled experiments, i.e., 9-1- the better” in this deep model for super-resolution. It may be
1-5, 9-3-1-5, 9-5-1-5, which add an additional layer on 9-1-5, caused by the difficulty of training. Our CNN network con-
9-3-5, and 9-5-5, respectively. The initialization scheme and tains no pooling layer or full-connected layer, thus it is sensi-
learning rate of the additional layer are the same as the sec- tive to the initialization parameters and learning rate. When
ond layer. From Figs. 13a, 13b and 8c, we can observe that we go deeper (e.g., four or five layers), we find it hard to set
the four-layer networks converge slower than the three- appropriate learning rates that guarantee convergence. Even
layer network. Nevertheless, given enough training time,
the deeper networks will finally catch up and converge to
the three-layer ones.
The effectiveness of deeper structures for super resolution
is found not as apparent as that shown in image classification
[17]. Furthermore, we find that deeper networks do not
always result in better performance. Specifically, if we add
an additional layer with n22 ¼ 32 filters on 9-1-5 network,
then the performance degrades and fails to surpass the
three-layer network (see Fig. 9a). If we go deeper by adding

Fig. 7. A larger filter size leads to better results. Fig. 9. Deeper structure does not always lead to better results.
302 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO. 2, FEBRUARY 2016

TABLE 2
The Average Results of PSNR (dB), SSIM, IFC, NQM, WPSNR (dB) and MSSIM on the Set5 Dataset

Eval. Mat Scale Bicubic SC [48] NE+LLE [4] KK [24] ANR [39] A+ [39] SRCNN
2 33.66 - 35.77 36.20 35.83 36.54 36.66
PSNR 3 30.39 31.42 31.84 32.28 31.92 32.59 32.75
4 28.42 - 29.61 30.03 29.69 30.28 30.49
2 0.9299 - 0.9490 0.9511 0.9499 0.9544 0.9542
SSIM 3 0.8682 0.8821 0.8956 0.9033 0.8968 0.9088 0.9090
4 0.8104 - 0.8402 0.8541 0.8419 0.8603 0.8628
2 6.10 - 7.84 6.87 8.09 8.48 8.05
IFC 3 3.52 3.16 4.40 4.14 4.52 4.84 4.58
4 2.35 - 2.94 2.81 3.02 3.26 3.01
2 36.73 - 42.90 39.49 43.28 44.58 41.13
NQM 3 27.54 27.29 32.77 32.10 33.10 34.48 33.21
4 21.42 - 25.56 24.99 25.72 26.97 25.96
2 50.06 - 58.45 57.15 58.61 60.06 59.49
WPSNR 3 41.65 43.64 45.81 46.22 46.02 47.17 47.10
4 37.21 - 39.85 40.40 40.01 41.03 41.13
2 0.9915 - 0.9953 0.9953 0.9954 0.9960 0.9959
MSSSIM 3 0.9754 0.9797 0.9841 0.9853 0.9844 0.9867 0.9866
4 0.9516 - 0.9666 0.9695 0.9672 0.9720 0.9725

it converges, the network may fall into a bad local minimum, Test set. The Set5 [2] (5 images), Set14 [49] (14 images) and
and the learned filters are of less diversity even given enough BSD200 [31] (200 images)8 are used to evaluate the perfor-
training time. This phenomenon is also observed in [16], mance of upscaling factors 2, 3, and 4.
where improper increase of depth leads to accuracy satura- Evaluation metrics. Apart from the widely used PSNR and
tion or degradation for image classification. Why “deeper is SSIM [41] indices, we also adopt another four evaluation
not better” is still an open question, which requires investiga- matrices, namely IFC [37], noise quality measure (NQM)
tions to better understand gradients and training dynamics [8], weighted peak signal-to-noise ratio (WPSNR) and
in deep architectures. Therefore, we still adopt three-layer multi-scale structure similarity index (MSSSIM) [42], which
networks in the following experiments. obtain high correlation with the human perceptual scores as
reported in [44].
4.4 Comparisons to State-of-the-Arts
In this section, we show the quantitative and qualitative 4.4.1 Quantitative and Qualitative Evaluation
results of our method in comparison to state-of-the-art As shown in Tables 2, 3 and 4, the proposed SRCNN yields
methods. We adopt the model with good performance- the highest scores in most evaluation matrices in all experi-
speed trade-off: a three-layer network with f1 ¼ 9, f2 ¼ 5, ments.9 Note that our SRCNN results are based on the
f3 ¼ 5, n1 ¼ 64, and n2 ¼ 32 trained on the ImageNet. For checkpoint of 8  108 backpropagations. Specifically, for the
each upscaling factor 2 f2; 3; 4g, we train a specific network upscaling factor 3, the average gains on PSNR achieved by
for that factor.7 SRCNN are 0.15, 0.17, and 0.13 dB, higher than the next best
Comparisons. We compare our SRCNN with the state-of- approach, A+ [40], on the three datasets. When we take a
the-art SR methods: look at other evaluation metrics, we observe that SC, to our
surprise, gets even lower scores than the bicubic interpola-

SC - sparse coding-based method of Yang et al. [48]
tion on IFC and NQM. It is clear that the results of SC are

NE+LLE - neighbour embedding + locally linear
more visually pleasing than that of bicubic interpolation.
embedding method [4]
This indicates that these two metrics may not truthfully
 ANR - Anchored Neighbourhood Regression
reveal the image quality. Thus, regardless of these two met-
method [39]
rics, SRCNN achieves the best performance among all meth-
 A+ - Adjusted Anchored Neighbourhood Regression
ods and scaling factors.
method [40], and
It is worth pointing out that SRCNN surpasses the bicubic
 KK - the method described in [24], which achieves
baseline at the very beginning of the learning stage (see
the best performance among external example-based
Fig. 1), and with moderate training, SRCNN outperforms
methods, according to the comprehensive evaluation
existing state-of-the-art methods (see Fig. 4). Yet, the perfor-
conducted in Yang et al.’s work [44]
mance is far from converge. We conjecture that better results
The implementations are all from the publicly available
can be obtained given longer training time (see Fig. 10).
codes provided by the authors, and all images are down-
sampled using the same bicubic kernel.
8. We use the same 200 images as in [44].
9. The PSNR value of each image can be found in the supplementary
7. In the area of denoising [3], for each noise level a specific network file, which can be found on the Computer Society Digital Library at
is trained. http://doi.ieeecomputersociety.org/10.1109/TPAMI.2015.2439281.
DONG ET AL.: IMAGE SUPER-RESOLUTION USING DEEP CONVOLUTIONAL NETWORKS 303

TABLE 3
The Average Results of PSNR (dB), SSIM, IFC, NQM, WPSNR (dB) and MSSIM on the Set14 Dataset

Eval. Mat Scale Bicubic SC [48] NE+LLE [4] KK [24] ANR [39] A+ [39] SRCNN
2 30.23 - 31.76 32.11 31.80 32.28 32.45
PSNR 3 27.54 28.31 28.60 28.94 28.65 29.13 29.30
4 26.00 - 26.81 27.14 26.85 27.32 27.50
2 0.8687 - 0.8993 0.9026 0.9004 0.9056 0.9067
SSIM 3 0.7736 0.7954 0.8076 0.8132 0.8093 0.8188 0.8215
4 0.7019 - 0.7331 0.7419 0.7352 0.7491 0.7513
2 6.09 - 7.59 6.83 7.81 8.11 7.76
IFC 3 3.41 2.98 4.14 3.83 4.23 4.45 4.26
4 2.23 - 2.71 2.57 2.78 2.94 2.74
2 40.98 - 41.34 38.86 41.79 42.61 38.95
NQM 3 33.15 29.06 37.12 35.23 37.22 38.24 35.25
4 26.15 - 31.17 29.18 31.27 32.31 30.46
2 47.64 - 54.47 53.85 54.57 55.62 55.39
WPSNR 3 39.72 41.66 43.22 43.56 43.36 44.25 44.32
4 35.71 - 37.75 38.26 37.85 38.72 38.87
2 0.9813 - 0.9886 0.9890 0.9888 0.9896 0.9897
MSSSIM 3 0.9512 0.9595 0.9643 0.9653 0.9647 0.9669 0.9675
4 0.9134 - 0.9317 0.9338 0.9326 0.9371 0.9376

TABLE 4
The Average Results of PSNR (dB), SSIM, IFC, NQM, WPSNR (dB) and MSSIM on the BSD200 Dataset

Eval. Mat Scale Bicubic SC [48] NE+LLE [4] KK [24] ANR [39] A+ [39] SRCNN
2 28.38 - 29.67 30.02 29.72 30.14 30.29
PSNR 3 25.94 26.54 26.67 26.89 26.72 27.05 27.18
4 24.65 - 25.21 25.38 25.25 25.51 25.60
2 0.8524 - 0.8886 0.8935 0.8900 0.8966 0.8977
SSIM 3 0.7469 0.7729 0.7823 0.7881 0.7843 0.7945 0.7971
4 0.6727 - 0.7037 0.7093 0.7060 0.7171 0.7184
2 5.30 - 7.10 6.33 7.28 7.51 7.21
IFC 3 3.05 2.77 3.82 3.52 3.91 4.07 3.91
4 1.95 - 2.45 2.24 2.51 2.62 2.45
2 36.84 - 41.52 38.54 41.72 42.37 39.66
NQM 3 28.45 28.22 34.65 33.45 34.81 35.58 34.72
4 21.72 - 25.15 24.87 25.27 26.01 25.65
2 46.15 - 52.56 52.21 52.69 53.56 53.58
WPSNR 3 38.60 40.48 41.39 41.62 41.53 42.19 42.29
4 34.86 - 36.52 36.80 36.64 37.18 37.24
2 0.9780 - 0.9869 0.9876 0.9872 0.9883 0.9883
MSSSIM 3 0.9426 0.9533 0.9575 0.9588 0.9581 0.9609 0.9614
4 0.9005 - 0.9203 0.9215 0.9214 0.9256 0.9261

Figs. 14, 15 and 16 show the super-resolution results than other approaches without any obvious artifacts
of different approaches by an upscaling factor 3. As can across the image.
be observed, the SRCNN produces much sharper edges In addition, we report to another recent deep learning
method for image super-resolution (DNC) of Cui et al.
[5]. As they employ a different blur kernel (a Gaussian
filter with a standard deviation of 0.55), we train a spe-
cific network (9-5-5) using the same blur kernel as DNC
for fair quantitative comparison. The upscaling factor is
3 and the training set is the 91-image dataset. From the
convergence curve shown in Fig. 11, we observe that our
SRCNN surpasses DNC with just 2:7  107 backprops,
and a larger margin can be obtained given longer train-
ing time. This also demonstrates that the end-to-end
Fig. 10. The test convergence curve of SRCNN and results of other learning is superior to DNC, even if that model is
methods on the Set5 dataset. already “deep”.
304 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO. 2, FEBRUARY 2016

TABLE 5
Average PSNR (dB) of Different Channels and Training
Strategies on the Set5 Dataset

Training PSNR of different channel(s)


Strategies Y Cb Cr RGB color image
Bicubic 30.39 45.44 45.42 34.57
Y only 32.39 45.44 45.42 36.37
Fig. 11. The test convergence curve of SRCNN and the result of DNC on YCbCr 29.25 43.30 43.49 33.47
the Set5 dataset.
Y pre-train 32.19 46.49 46.45 36.32
CbCr pre-train 32.14 46.38 45.84 36.25
RGB 32.33 46.18 46.20 36.44
4.4.2 Running Time KK 32.37 44.35 44.22 36.32
Fig. 12 shows the running time comparisons of several
state-of-the-art methods, along with their restoration per- It is interesting to find out if super-resolution performance
formance on Set14. All baseline methods are obtained can be improved if we jointly consider all three channels in
from the corresponding authors’ MATLAB+MEX imple- the process.
mentation, whereas ours are in pure C++. We profile the Our method is flexible to accept more channels without
running time of all the algorithms using the same altering the learning mechanism and network design. In
machine (Intel CPU 3.10 GHz and 16 GB memory). Note particular, it can readily deal with three channels simulta-
that the processing time of our approach is highly linear neously by setting the input channels to c ¼ 3. In the follow-
to the test image resolution, since all images go through ing experiments, we explore different training strategies for
the same number of convolutions. Our method is always color image super-resolution, and subsequently evaluate
a trade-off between performance and speed. To show their performance on different channels.
this, we train three networks for comparison, which are Implementation details. Training is performed on the 91-
9-1-5, 9-3-5, and 9-5-5. It is clear that the 9-1-5 network is image dataset, and testing is conducted on the Set5 [2]. The
the fastest, while it still achieves better performance than network settings are: c ¼ 3, f1 ¼ 9, f2 ¼ 1, f3 ¼ 5, n1 ¼ 64,
the next state-of-the-art A+. Other methods are several and n2 ¼ 32. As we have proved the effectiveness of
times or even orders of magnitude slower in comparison SRCNN on different scales, here we only evaluate the per-
to 9-1-5 network. Note the speed gap is not mainly caused formance of upscaling factor 3.
by the different MATLAB/C++ implementations; rather, Comparisons. We compare our method with the state-of-
the other methods need to solve complex optimization art color SR method—KK [24]. We also try different learning
problems on usage (e.g., sparse coding or embedding), strategies for comparison:
whereas our method is completely feed-forward. The 9-5-
5 network achieves the best performance but at the cost  Y only: This is our baseline method, which is a sin-
of the running time. The test-time speed of our CNN can gle-channel (c ¼ 1) network trained only on the
be further accelerated in many ways, e.g., approximating luminance channel. The Cb, Cr channels are
or simplifying the trained networks [10], [30], with possible upscaled using bicubic interpolation.
slight degradation in performance.  YCbCr: Training is performed on the three channels
of the YCbCr space.
 Y pre-train: First, to guarantee the performance on
4.5 Experiments on Color Channels
the Y channel, we only use the MSE of the Y channel
In previous experiments, we follow the conventional
as the loss to pre-train the network. Then we employ
approach to super-resolve color images. Specifically, we
the MSE of all channels to fine-tune the parameters.
first transform the color images into the YCbCr space.
 CbCr pre-train: We use the MSE of the Cb, Cr
The SR algorithms are only applied on the Y channel, while
channels as the loss to pre-train the network, then
the Cb, Cr channels are upscaled by bicubic interpolation.
fine-tune the parameters on all channels.

Fig. 12. The proposed SRCNN achieves the state-of-the-art super-reso-


lution quality, whilst maintains high and competitive speed in comparison
to existing external example-based methods. The chart is based on
Set14 results summarized in Table 3. The implementation of all three Fig. 13. Chrominance channels of the first-layer filters using the “Y pre-
SRCNN networks are available on our project page. train” strategy.
DONG ET AL.: IMAGE SUPER-RESOLUTION USING DEEP CONVOLUTIONAL NETWORKS 305

Fig. 14. The “butterfly” image from Set5 with an upscaling factor 3.

Fig. 15. The “ppt3” image from Set14 with an upscaling factor 3.

Fig. 16. The “zebra” image from Set14 with an upscaling factor 3.
306 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO. 2, FEBRUARY 2016

 RGB: Training is performed on the three channels of problems, such as image deblurring or simultaneous SR+
the RGB space. denoising. One could also investigate a network to cope
The results are shown in Table 5, where we have the fol- with different upscaling factors.
lowing observations. (i) If we directly train on the YCbCr
channels, the results are even worse than that of bicubic REFERENCES
interpolation. The training falls into a bad local minimum, [1] M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An algorithm for
due to the inherently different characteristics of the Y and designing overcomplete dictionaries for sparse representation,”
Cb, Cr channels. (ii) If we pre-train on the Y or Cb, Cr chan- IEEE Trans. Signal Process., vol. 54, no. 11, pp. 4311–4322, Nov.
2006.
nels, the performance finally improves, but is still not better [2] M. Bevilacqua, A. Roumy, C. Guillemot, and M. L. A. Morel,
than “Y only” on the color image (see the last column of “Low-complexity single-image super-resolution based on non-
Table 5, where PSNR is computed in RGB color space). This negative neighbor embedding,” in Proc. Brit. Mach. Vis. Conf.,
suggests that the Cb, Cr channels could decrease the perfor- 2012, pp. 1–10.
[3] H. C. Burger, C. J. Schuler, and S. Harmeling, “Image denoising:
mance of the Y channel when training is performed in a uni- Can plain neural networks compete with BM3D?” in Proc. IEEE
fied network. (iii) We observe that the Cb, Cr channels have Conf. Comput. Vis. Pattern Recog., 2012, pp. 2392–2399.
higher PSNR values for “Y pre-train” than for “CbCr pre- [4] H. Chang, D. Y. Yeung, and Y. Xiong, “Super-resolution through
neighbor embedding,” presented at the IEEE Conf. Comput. Vis.
train”. The reason lies on the differences between the Cb, Cr Pattern Recog., Washington, DC, USA, 2004.
channels and the Y channel. Visually, the Cb, Cr channels are [5] Z. Cui, H. Chang, S. Shan, B. Zhong, and X. Chen, “Deep network
more blurry than the Y channel, thus are less affected by the cascade for image super-resolution,” in Proc. Eur. Conf. Comput.
downsampling process. When we pre-train on the Cb, Cr Vis., 2014, pp. 49–64.
[6] D. Dai, R. Timofte, and L. Van Gool, “Jointly optimized regressors
channels, there are only a few filters being activated. Then for image super-resolution,” Eurographics, vol. 7, p. 8, 2015.
the training will soon fall into a bad local minimum during [7] S. Dai, M. Han, W. Xu, Y. Wu, Y. Gong, and A. K. Katsaggelos,
fine-tuning. On the other hand, if we pre-train on the Y chan- “Softcuts: A soft edge smoothness prior for color image super-
resolution,” IEEE Trans. Image Process., vol. 18, no. 11, pp. 969–981,
nel, more filters will be activated, and the performance on May 2009.
Cb, Cr channels will be pushed much higher. Fig. 13 shows [8] N. Damera-Venkata, T. D. Kite, W. S. Geisler, B. L. Evans, and A. C.
the Cb, Cr channels of the first-layer filters with “Y pre- Bovik, “Image quality assessment based on a degradation model,”
train”, of which the patterns largely differ from that shown IEEE Trans. Image Process., vol. 9, no. 11, pp. 636–650, Apr. 2000.
[9] J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. Fei-Fei,
in Fig. 5. (iv) Training on the RGB channels achieves the best “ImageNet: A large-scale hierarchical image database,” in Proc.
result on the color image. Different from the YCbCr channels, IEEE Conf. Comput. Vis. Pattern Recog., 2009, pp. 248–255.
the RGB channels exhibit high cross-correlation among each [10] E. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus,
other. The proposed SRCNN is capable of leveraging such “Exploiting linear structure within convolutional networks for
efficient evaluation,” in Proc. Adv. Neural Inf. Process. Syst., 2014,
natural correspondences between the channels for recon- pp. 1269–1277.
struction. Therefore, the model achieves comparable result [11] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convo-
on the Y channel as “Y only”, and better results on Cb, Cr lutional network for image super-resolution,” in Proc. Eur. Conf.
Comput. Vis., 2014, pp. 184–199.
channels than bicubic interpolation. (v) In KK [24], super-res- [12] D. Eigen, D. Krishnan, and R. Fergus, “Restoring an image taken
olution is applied on each RGB channel separately. When we through a window covered with dirt or rain,” in Proc. IEEE Int.
transform its results to YCbCr space, the PSNR value of Y Conf. Comput. Vis., 2013, pp. 633–640.
channel is similar as “Y only”, but that of Cb, Cr channels are [13] G. Freedman and R. Fattal, “Image and video upscaling from local
self-examples,” ACM Trans. Graph., vol. 30, no. 11, p. 12, 2011.
poorer than bicubic interpolation. The result suggests that [14] W. T. Freeman, T. R. Jones, and E. C. Pasztor, “Example-based
the algorithm is biased to the Y channel. On the whole, our super-resolution,” Comput. Graph. Appl., vol. 22, no. 11, pp. 56–65,
method trained on RGB channels achieves better perfor- 2002.
[15] W. T. Freeman, E. C. Pasztor, and O. T. Carmichael, “Learning
mance than KK and the single-channel network (“Y only”). It low-level vision,” Int. J. Comput. Vis., vol. 40, no. 11, pp. 25–47,
is also worth noting that the improvement compared with 2000.
the single-channel network is not that significant (i.e., 0.07 [16] D. Glasner, S. Bagon, and M. Irani, “Super-resolution from a sin-
dB). This indicates that the Cb, Cr channels barely help in gle image,” in Proc. IEEE Int. Conf. Comput. Vis., 2009, pp. 349–356.
[17] K. He and J. Sun, “Convolutional neural networks at constrained
improving the performance. time cost,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2015,
pp. 3791–3799.
[18] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in
5 CONCLUSION deep convolutional networks for visual recognition,” in Proc. Eur.
Conf. Comput. Vis., 2014, pp. 346–361.
We have presented a novel deep learning approach for sin- [19] J. B. Huang, A. Singh, and N. Ahuja, “Single image super-resolu-
gle image SR. We show that conventional sparse-coding- tion from transformed self-exemplars,” in Proc. IEEE Conf. Com-
based SR methods can be reformulated into a deep convolu- put. Vis. Pattern Recog., 2015, pp. 5197–5206.
[20] M. Irani and S. Peleg, “Improving resolution by image regis-
tional neural network. The proposed approach, SRCNN, tration,” Graph. Models Image Process., vol. 53, no. 11, pp. 231–239,
learns an end-to-end mapping between low- and high-reso- 1991.
lution images, with little extra pre/post-processing beyond [21] V. Jain and S. Seung, “Natural image denoising with convolu-
the optimization. With a lightweight structure, the tional networks,” in Proc. Adv. Neural Inf. Process. Syst., 2008,
pp. 769–776.
SRCNN has achieved superior performance than the [22] K. Jia, X. Wang, and X. Tang, “Image transformation based on
state-of-the-art methods. We conjecture that additional learning dictionaries across image spaces,” IEEE Trans. Pattern
performance can be further gained by exploring more fil- Anal. Mach. Intell., vol. 35, no. 11, pp. 367–380, Feb. 2013.
[23] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,
ters and different training strategies. Besides, the pro- S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture
posed structure, with its advantages of simplicity and for fast feature embedding,” in Proc. ACM Int. Conf. Multimedia,
robustness, could be applied to other low-level vision 2014, pp. 675–678.
DONG ET AL.: IMAGE SUPER-RESOLUTION USING DEEP CONVOLUTIONAL NETWORKS 307

[24] K. I. Kim and Y. Kwon, “Single-image super-resolution using [48] J. Yang, J. Wright, T. S. Huang, and Y. Ma, “Image super-resolution
sparse regression and natural image prior,” IEEE Trans. Pattern via sparse representation,” IEEE Trans. Image Process., vol. 19,
Anal. Mach. Intell., vol. 32, no. 6, pp. 1127–1133, Jun. 2010. no. 11, pp. 2861–2873, Nov. 2010.
[25] A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet classifica- [49] R. Zeyde, M. Elad, and M. Protter, “On single image scale-up
tion with deep convolutional neural networks,” in Proc. Adv. Neu- using sparse-representations,” in Proc. 7th Int. Conf. Curves Surfa-
ral Inf. Process. Syst., 2012, pp. 1097–1105. ces, 2012, pp. 711–730.
[26] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. [50] N. Zhang, J. Donahue, R. Girshick, and T. Darrell, “Part-based R-
Hubbard, and L. D. Jackel, “Backpropagation applied to hand- CNNs for fine-grained category detection,” in Proc. Eur. Conf.
written zip code recognition,” Neural Comput., vol. 1, no. 4, Comput. Vis., 2014, pp. 834–849.
pp. 541–551, 1989.
[27] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based Chao Dong received the BS degree in informa-
learning applied to document recognition,” Proc. IEEE, vol. 86, no. tion engineering from the Beijing Institute of
11, pp. 2278–2324, Nov. 1998. Technology, China, in 2011. He is currently work-
[28] H. Lee, A. Battle, R. Raina, and A. Y. Ng, “Efficient sparse coding ing toward the PhD degree in the Department of
algorithms,” in Proc. Adv. Neural Inf. Process. Syst., 2006, pp. 801– Information Engineering, Chinese University of
808. Hong Kong. His research interests include image
[29] C. Liu, H. Y. Shum, and W. T. Freeman, “Face hallucination: Theory super-resolution and denoising.
and practice,” Int. J. Comput. Vis., vol. 75, no. 11, pp. 115–134, 2007.
[30] F. Mamalet and C. Garcia, “Simplifying convNets for fast
learning,” in Proc. Int. Conf. Artif. Neural Netw., 2012, pp. 58–65.
[31] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human
segmented natural images and its application to evaluating seg-
mentation algorithms and measuring ecological statistics,” in Chen Change Loy received the PhD degree in
Proc. IEEE Int. Conf. Comput. Vis., 2001, vol. 2, pp. 416–423. computer science from the Queen Mary Univer-
sity of London in 2010. He is currently a
[32] V. Nair and G. E. Hinton, “Rectified linear units improve
research assistant professor in the Department
restricted Boltzmann machines,” in Proc. Int. Conf. Mach. Learn.,
2010, pp. 807–814. of Information Engineering, Chinese University
[33] W. Ouyang, X. Wang, X. Zeng, S. Qiu, P. Luo, Y. Tian, H. Li, S. of Hong Kong. Previously, he was a postdoc-
Yang, Z. Wang, C.-C. Loy, and X. Tang, “Deepid-net: Deformable toral researcher at Vision Semantics Ltd. His
deep convolutional neural networks for object detection,” in Proc. research interests include computer vision and
IEEE Conf. Comput. Vis. Pattern Recogn., 2015, pp. 2403–2412. pattern recognition, with focus on face analysis,
deep learning, and visual surveillance. He is a
[34] W. Ouyang and X. Wang, “Joint deep learning for pedestrian
detection,” in Proc. IEEE Int. Conf. Comput. Vis., 2013, pp. 2056– member of the IEEE.
2063.
[35] C. J. Schuler, H. C. Burger, S. Harmeling, and B. Scholkopf, “A Kaiming He received the BS degree from Tsing-
machine learning approach for non-blind image deconvolution,” hua University in 2007, and the PhD degree
in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2013, pp. 1067–1074. from the Chinese University of Hong Kong in 2011.
[36] S. Schulter, C. Leistner, and H. Bischof, “Fast and accurate image He is a researcher at Microsoft Research Asia
upscaling with super-resolution forests,” in Proc. IEEE Conf. Com- (MSRA) since 2011. His research interests include
put. Vis. Pattern Recog., 2015, pp. 3791–3799. computer vision and computer graphics. He has
[37] H. R. Sheikh, A. C. Bovik, and G. De Veciana, “An information won the Best Paper Award at the IEEE Conference
fidelity criterion for image quality assessment using natural scene on Computer Vision and Pattern Recognition
statistics,” IEEE Trans. Image Process., vol. 14, no. 11, pp. 2117– (CVPR) 2009. He is a member of the IEEE.
2128, Dec. 2005.
[38] Y. Sun, Y. Chen, X. Wang, and X. Tang, “Deep learning face repre-
sentation by joint identification-verification,” in Proc. Adv. Neural
Inf. Process. Syst., 2014, pp. 1988–1996. Xiaoou Tang (S’93-M’96-SM’02-F’09) received
[39] R. Timofte, V. De Smet, and L. Van Gool, “Anchored neighbor- the BS degree from the University of Science and
hood regression for fast example-based super-resolution,” in Proc. Technology of China, Hefei, in 1990, the MS
IEEE Int. Conf. Comput. Vis., 2013, pp. 1920–1927. degree from the University of Rochester, New
York, in 1991, and the PhD degree from the Mas-
[40] R. Timofte, V. De Smet, and L. Van Gool, “A+: Adjusted anchored
sachusetts Institute of Technology, Cambridge,
neighborhood regression for fast super-resolution,” in Proc. IEEE
Asian Conf. Comput. Vis., 2014, pp. 111–126. in 1996. He is a professor in the Department of
[41] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, Information Engineering and an associate dean
“Image quality assessment: From error visibility to structural (Research) of the Faculty of Engineering, Chi-
similarity,” IEEE Trans. Image Process., vol. 13, no. 11, pp. 600– nese University of Hong Kong. He was the group
manager of the Visual Computing Group, Micro-
612, Apr. 2004.
soft Research Asia, from 2005 to 2008. His research interests include
[42] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural
similarity for image quality assessment,” in Proc. IEEE Conf. Rec. computer vision, pattern recognition, and video processing. He received
37th Asilomar Conf. Signals, Syst. Comput., 2003, vol. 2, pp. 1398– the Best Paper Award at the IEEE Conference on Computer Vision and
1402. Pattern Recognition (CVPR. 2009. He was a program chair of the IEEE
[43] C. Y. Yang, J. B. Huang, and M. H. Yang, “Exploiting self-similari- International Conference on Computer Vision (ICCV) 2009 and he is an
ties for single frame super-resolution,” in Proc. IEEE Asian Conf. associate editor of the IEEE Transactions on Pattern Analysis and
Machine Intelligence and the International Journal of Computer Vision.
Comput. Vis., 2010, pp. 497–510.
He is a fellow of the IEEE.
[44] C. Y. Yang, C. Ma, and M. H. Yang, “Single-image super-resolution:
A benchmark,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 372–386.
[45] J. Yang, Z. Lin, and S. Cohen, “Fast image super-resolution based
on in-place example regression,” in Proc. IEEE Conf. Comput. Vis. " For more information on this or any other computing topic,
Pattern Recog., 2013, pp. 1059–1066. please visit our Digital Library at www.computer.org/publications/dlib.
[46] J. Yang, Z. Wang, Z. Lin, S. Cohen, and T. Huang, “Coupled dic-
tionary training for image super-resolution,” IEEE Trans. Image
Process., vol. 21, no. 11, pp. 3467–3478, Aug. 2012.
[47] J. Yang, J. Wright, T. Huang, and Y. Ma, “Image super-resolution
as sparse representation of raw image patches,” in Proc. IEEE
Conf. Comput. Vis. Pattern Recog., 2008, pp. 1–8.

You might also like