Image Super-Resolution Using Deep convolutional networks
Image Super-Resolution Using Deep convolutional networks
Abstract—We propose a deep learning method for single image super-resolution (SR). Our method directly learns an end-to-end
mapping between the low/high-resolution images. The mapping is represented as a deep convolutional neural network (CNN) that
takes the low-resolution image as the input and outputs the high-resolution one. We further show that traditional sparse-coding-based
SR methods can also be viewed as a deep convolutional network. But unlike traditional methods that handle each component
separately, our method jointly optimizes all layers. Our deep CNN has a lightweight structure, yet demonstrates state-of-the-art
restoration quality, and achieves fast speed for practical on-line usage. We explore different network structures and parameter settings
to achieve trade-offs between performance and speed. Moreover, we extend our network to cope with three color channels
simultaneously, and show better overall reconstruction quality.
1 INTRODUCTION
2 RELATED WORK
2.1 Image Super-Resolution
According to the image priors, single-image super resolution
algorithms can be categorized into four types—prediction
models, edge based methods, image statistical methods and
patch based (or example-based) methods. These methods
have been thoroughly investigated and evaluated in Yang
et al.’s work [44]. Among them, the example-based methods
[16], [24], [39], [45] achieve the state-of-the-art performance.
The internal example-based methods exploit the self-sim-
ilarity property and generate exemplar patches from the
Fig. 1. The proposed super-resolution convolutional neural network sur- input image. It is first proposed in Glasner’s work [16], and
passes the bicubic baseline with just a few training iterations, and out-
performs the sparse-coding-based method [48] with moderate training.
several improved variants [13], [43] are proposed to acceler-
The performance may be further improved with more training iterations. ate the implementation. The external example-based meth-
More details are provided in Section 4.4.1 (the Set5 dataset with an ods [2], [4], [6], [15], [36], [39], [46], [47], [48], [49] learn a
upscaling factor 3). The proposed method provides visually appealing mapping between low/high-resolution patches from exter-
reconstructed image.
nal datasets. These studies vary on how to learn a compact
dictionary or manifold space to relate low/high-resolution
example-based methods, because it is fully feed-forward patches, and on how representation schemes can be con-
and does not need to solve any optimization problem on ducted in such spaces. In the pioneer work of Freeman et al.
usage. Third, experiments show that the restoration quality [14], the dictionaries are directly presented as low/high-res-
of the network can be further improved when (i) larger and olution patch pairs, and the nearest neighbour (NN) of the
more diverse datasets are available, and/or (ii) a larger and input patch is found in the low-resolution space, with its
deeper model is used. On the contrary, larger datasets/ corresponding high-resolution patch used for reconstruc-
models can present challenges for existing example-based tion. Chang et al. [4] introduce a manifold embedding tech-
methods. Furthermore, the proposed network can cope nique as an alternative to the NN strategy. In Yang et al.’s
with three channels of color images simultaneously to work [47], [48], the above NN correspondence advances to a
achieve improved super-resolution performance. more sophisticated sparse coding formulation. Other map-
Overall, the contributions of this study are mainly in ping functions such as kernel regression [24], simple func-
three aspects: tion [45], random forest [36] and anchored neighborhood
regression [39], [40] are proposed to further improve the
1) We present a fully convolutional neural network for mapping accuracy and speed. The sparse-coding-based
image super-resolution. The network directly learns method and its several improvements [39], [40], [46] are
an end-to-end mapping between low- and high-reso- among the state-of-the-art SR methods nowadays. In these
lution images, with little pre/post-processing methods, the patches are the focus of the optimization; the
beyond the optimization. patch extraction and aggregation steps are considered as
2) We establish a relationship between our deep-learn- pre/post-processing and handled separately.
ing-based SR method and the traditional sparse-cod- The majority of SR algorithms [2], [4], [15], [39], [46], [47],
ing-based SR methods. This relationship provides a [48], [49] focus on gray-scale or single-channel image super-
guidance for the design of the network structure. resolution. For color images, the aforementioned methods
3) We demonstrate that deep learning is useful in the first transform the problem to a different color space (YCbCr
classical computer vision problem of super-resolu- or YUV), and SR is applied only on the luminance channel.
tion, and can achieve good quality and speed. There are also works attempting to super-resolve all chan-
A preliminary version of this work was presented earlier nels simultaneously. For example, Kim and Kwon [24] and
[11]. The present work adds to the initial version in signifi- Dai et al. [7] apply their model to each RGB channel and
cant ways. First, we improve the SRCNN by introducing combined them to produce the final results. However, none
DONG ET AL.: IMAGE SUPER-RESOLUTION USING DEEP CONVOLUTIONAL NETWORKS 297
Fig. 2. Given a low-resolution image Y, the first convolutional layer of the SRCNN extracts a set of feature maps. The second layer maps these
feature maps nonlinearly to high-resolution patch representations. The last layer combines the predictions within a spatial neighbourhood to produce
the final high-resolution image F ðYÞ.
of them has analyzed the SR performance of different chan- only pre-processing we perform.3 Let us denote the interpo-
nels, and the necessity of recovering all three channels. lated image as Y. Our goal is to recover from Y an image
F ðYÞ that is as similar as possible to the ground truth high-
2.2 Convolutional Neural Networks (CNN) resolution image X. For the ease of presentation, we still call
Convolutional neural networks date back decades [26] and Y a “low-resolution” image, although it has the same size as
deep CNNs have recently shown an explosive popularity X. We wish to learn a mapping F , which conceptually con-
partially due to its success in image classification [18], [25]. sists of three operations:
They have also been successfully applied to other computer
1) Patch extraction and representation. this operation
vision fields, such as object detection [33], [50], face recogni-
extracts (overlapping) patches from the low-resolu-
tion [38], and pedestrian detection [34]. Several factors are
tion image Y and represents each patch as a high-
of central importance in this progress: (i) the efficient train-
dimensional vector. These vectors comprise a set of
ing implementation on modern powerful GPUs [25], (ii) the
feature maps, of which the number equals to the
proposal of the rectified linear unit (ReLU) [32] which
dimensionality of the vectors.
makes convergence much faster while still presents good
2) Non-linear mapping. this operation nonlinearly maps
quality [25], and (iii) the easy access to an abundance of
each high-dimensional vector onto another high-
data (like ImageNet [9]) for training larger models. Our
dimensional vector. Each mapped vector is concep-
method also benefits from these progresses.
tually the representation of a high-resolution patch.
These vectors comprise another set of feature maps.
2.3 Deep Learning for Image Restoration
3) Reconstruction. this operation aggregates the above
There have been a few studies of using deep learning tech- high-resolution patch-wise representations to gener-
niques for image restoration. The multi-layer perceptron ate the final high-resolution image. This image is
(MLP), whose all layers are fully-connected (in contrast to expected to be similar to the ground truth X.
convolutional), is applied for natural image denoising [3] We will show that all these operations form a convolu-
and post-deblurring denoising [35]. More closely related to tional neural network. An overview of the network is
our work, the convolutional neural network is applied for depicted in Fig. 2. Next we detail our definition of each
natural image denoising [21] and removing noisy patterns operation.
(dirt/rain) [12]. These restoration problems are more or less
denoising-driven. Cui et al. [5] propose to embed auto-
3.1.1 Patch Extraction and Representation
encoder networks in their super-resolution pipeline under
the notion internal example-based approach [16]. The deep A popular strategy in image restoration (e.g., [1]) is to
model is not specifically designed to be an end-to-end solu- densely extract patches and then represent them by a set of
tion, since each layer of the cascade requires independent pre-trained bases such as PCA, DCT, Haar, etc. This is
optimization of the self-similarity search process and the equivalent to convolving the image by a set of filters, each
auto-encoder. On the contrary, the proposed SRCNN opti- of which is a basis. In our formulation, we involve the opti-
mizes an end-to-end mapping. Further, the SRCNN is faster mization of these bases into the optimization of the network.
at speed. It is not only a quantitatively superior method, but Formally, our first layer is expressed as an operation F1 :
also a practically useful one.
F1 ðYÞ ¼ maxð0; W1 Y þ B1 Þ; (1)
3 CONVOLUTIONAL NEURAL NETWORKS FOR
SUPER-RESOLUTION 3. Bicubic interpolation is also a convolutional operation, so it can be
3.1 Formulation formulated as a convolutional layer. However, the output size of this
layer is larger than the input size, so there is a fractional stride. To take
Consider a single low-resolution image, we first upscale it to advantage of the popular well-optimized implementations such as
the desired size using bicubic interpolation, which is the cuda-convnet [25], we exclude this “layer” from learning.
298 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO. 2, FEBRUARY 2016
where W1 and B1 represent the filters and biases respec- Motivated by this, we define a convolutional layer to
tively, and ‘’ denotes the convolution operation. Here, W1 produce the final high-resolution image:
corresponds to n1 filters of support c f1 f1 , where c is
the number of channels in the input image, f1 is the spatial F ðYÞ ¼ W3 F2 ðYÞ þ B3 : (3)
size of a filter. Intuitively, W1 applies n1 convolutions on the
image, and each convolution has a kernel size c f1 f1 . Here W3 corresponds to c filters of a size n2 f3 f3 , and
The output is composed of n1 feature maps. B1 is an B3 is a c-dimensional vector.
n1 -dimensional vector, whose each element is associated If the representations of the high-resolution patches are in
with a filter. We apply the ReLU (maxð0; xÞ) [32] on the filter the image domain (i.e., we can simply reshape each represen-
responses.4 tation to form the patch), we expect that the filters act like an
averaging filter; if the representations of the high-resolution
3.1.2 Non-Linear Mapping patches are in some other domains (e.g., coefficients in terms
of some bases), we expect that W3 behaves like first projec-
The first layer extracts an n1 -dimensional feature for each
ting the coefficients onto the image domain and then averag-
patch. In the second operation, we map each of these
ing. In either way, W3 is a set of linear filters.
n1 -dimensional vectors into an n2 -dimensional one. This is
Interestingly, although the above three operations are
equivalent to applying n2 filters which have a trivial spatial
motivated by different intuitions, they all lead to the same
support 1 1. This interpretation is only valid for 1 1 fil-
form as a convolutional layer. We put all three operations
ters. But it is easy to generalize to larger filters like 3 3 or
together and form a convolutional neural network (Fig. 2).
5 5. In that case, the non-linear mapping is not on a patch
In this model, all the filtering weights and biases are to be
of the input image; instead, it is on a 3 3 or 5 5 “patch”
optimized. Despite the succinctness of the overall structure,
of the feature map. The operation of the second layer is:
our SRCNN model is carefully developed by drawing
F2 ðYÞ ¼ maxð0; W2 F1 ðYÞ þ B2 Þ: (2) extensive experience resulted from significant progresses in
super-resolution [47], [48]. We detail the relationship in the
Here W2 contains n2 filters of size n1 f2 f2 , and B2 is next section.
n2 -dimensional. Each of the output n2 -dimensional vectors
is conceptually a representation of a high-resolution patch 3.2 Relationship to Sparse-Coding-Based Methods
that will be used for reconstruction. We show that the sparse-coding-based SR methods [47], [48]
It is possible to add more convolutional layers to increase can be viewed as a convolutional neural network. Fig. 3
the non-linearity. But this can increase the complexity of the shows an illustration.
model (n2 f2 f2 n2 parameters for one layer), and thus In the sparse-coding-based methods, let us consider that
demands more training time. We will explore deeper struc- an f1 f1 low-resolution patch is extracted from the input
tures by introducing additional non-linear mapping layers image. Then the sparse coding solver, like Feature-Sign [28],
in Section 4.3.3. will first project the patch onto a (low-resolution) dictio-
nary. If the dictionary size is n1 , this is equivalent to apply-
3.1.3 Reconstruction ing n1 linear filters (f1 f1 ) on the input image (the mean
In the traditional methods, the predicted overlapping high- subtraction is also a linear operation so can be absorbed).
resolution patches are often averaged to produce the final This is illustrated as the left part of Fig. 3.
full image. The averaging can be considered as a pre- The sparse coding solver will then iteratively process the
defined filter on a set of feature maps (where each position n1 coefficients. The outputs of this solver are n2 coefficients,
is the “flattened” vector form of a high-resolution patch). and usually n2 ¼ n1 in the case of sparse coding. These n2
coefficients are the representation of the high-resolution
patch. In this sense, the sparse coding solver behaves as a
4. The ReLU can be equivalently considered as a part of the second
operation (Non-linear mapping), and the first operation (Patch extrac- special case of a non-linear mapping operator, whose spatial
tion and representation) becomes purely linear convolution. support is 1 1. See the middle part of Fig. 3. However, the
DONG ET AL.: IMAGE SUPER-RESOLUTION USING DEEP CONVOLUTIONAL NETWORKS 299
sparse coding solver is not feed-forward, i.e., it is an itera- metric for quantitatively evaluating image restoration qual-
tive algorithm. On the contrary, our non-linear operator is ity, and is at least partially related to the perceptual quality.
fully feed-forward and can be computed efficiently. If we It is worth noticing that the convolutional neural networks
set f2 ¼ 1, then our non-linear operator can be considered do not preclude the usage of other kinds of loss functions, if
as a pixel-wise fully-connected layer. It is worth noting that only the loss functions are derivable. If a better perceptually
“the sparse coding solver” in SRCNN refers to the first two motivated metric is given during training, it is flexible for the
layers, but not just the second layer or the activation func- network to adapt to that metric. On the contrary, such a flexi-
tion (ReLU). Thus the nonlinear operation in SRCNN is also bility is in general difficult to achieve for traditional “hand-
well optimized through the learning process. crafted” methods. Despite that the proposed model is
The above n2 coefficients (after sparse coding) are then trained favoring a high PSNR, we still observe satisfactory
projected onto another (high-resolution) dictionary to pro- performance when the model is evaluated using alternative
duce a high-resolution patch. The overlapping high-resolu- evaluation metrics, e.g., SSIM, MSSIM (see Section 4.4.1).
tion patches are then averaged. As discussed above, this is The loss is minimized using stochastic gradient descent
equivalent to linear convolutions on the n2 feature maps. If with the standard backpropagation [27]. In particular, the
the high-resolution patches used for reconstruction are of weight matrices are updated as
size f3 f3 , then the linear filters have an equivalent spatial
support of size f3 f3 . See the right part of Fig. 3. @L ‘
Diþ1 ¼ 0:9 Di þ h ; Wiþ1 ¼ Wi‘ þ Diþ1 ; (5)
The above discussion shows that the sparse-coding- @Wi‘
based SR method can be viewed as a kind of convolutional
neural network (with a different non-linear mapping). But where ‘ 2 f1; 2; 3g and i are the indices of layers and itera-
@L
not all operations have been considered in the optimization tions, h is the learning rate, and @W ‘ is the derivative. The fil-
i
in the sparse-coding-based SR methods. On the contrary, in ter weights of each layer are initialized by drawing
our convolutional neural network, the low-resolution dictio- randomly from a Gaussian distribution with zero mean and
nary, high-resolution dictionary, non-linear mapping, standard deviation 0.001 (and 0 for biases). The learning
together with mean subtraction and averaging, are all
rate is 104 for the first two layers, and 105 for the last
involved in the filters to be optimized. So our method opti-
layer. We empirically find that a smaller learning rate in the
mizes an end-to-end mapping that consists of all operations.
last layer is important for the network to converge (similar
The above analogy can also help us to design hyper-
to the denoising case [21]).
parameters. For example, we can set the filter size of the last
In the training phase, the ground truth images fXi g are
layer to be smaller than that of the first layer, and thus we
prepared as fsub fsub c-pixel sub-images randomly
rely more on the central part of the high-resolution patch (to
cropped from the training images. By “sub-images” we
the extreme, if f3 ¼ 1, we are using the center pixel with no
mean these samples are treated as small “images” rather
averaging). We can also set n2 < n1 because it is expected
than “patches”, in the sense that “patches” are overlapping
to be sparser. A typical and basic setting is f1 ¼ 9, f2 ¼ 1,
and require some averaging as post-processing but “sub-
f3 ¼ 5, n1 ¼ 64, and n2 ¼ 32 (we evaluate more settings in
images” need not. To synthesize the low-resolution samples
the experiment section). On the whole, the estimation of
fYi g, we blur a sub-image by a Gaussian kernel, sub-sample
a high resolution pixel utilizes the information of
it by the upscaling factor, and upscale it by the same factor
ð9 þ 5 1Þ2 ¼ 169 pixels. Clearly, the information exploited via bicubic interpolation.
for reconstruction is comparatively larger than that used in To avoid border effects during training, all the convolu-
existing external example-based approaches, e.g., using tional layers have no padding, and the network produces a
ð5 þ 5 1Þ2 ¼ 81 pixels5 [15], [48]. This is one of the reasons smaller output (ðfsub f1 f2 f3 þ 3Þ2 c). The MSE loss
why the SRCNN gives superior performance. function is evaluated only by the difference between the cen-
tral pixels of Xi and the network output. Although we use a
3.3 Training fixed image size in training, the convolutional neural network
Learning the end-to-end mapping function F requires the can be applied on images of arbitrary sizes during testing.
estimation of network parameters Q ¼ fW1 ; W2 ; W3 ; B1 ; We implement our model using the cuda-convnet package
B2 ; B3 g. This is achieved through minimizing the loss [25]. We have also tried the Caffe package [23] and observed
between the reconstructed images F ðY; QÞ and the corre- similar performance.
sponding ground truth high-resolution images X. Given a
set of high-resolution images fXi g and their corresponding 4 EXPERIMENTS
low-resolution images fYi g, we use mean squared error
We first investigate the impact of using different datasets on
(MSE) as the loss function:
the model performance. Next, we examine the filters
learned by our approach. We then explore different archi-
1X n
LðQÞ ¼ jjF ðYi ; QÞ Xi jj2 ; (4) tecture designs of the network, and study the relations
n i¼1 between super-resolution performance and factors like
depth, number of filters, and filter sizes. Subsequently, we
where n is the number of training samples. Using MSE as the compare our method with recent state-of-the-arts both
loss function favors a high PSNR. The PSNR is a widely-used quantitatively and qualitatively. Following [40], super-reso-
lution is only applied on the luminance channel (Y channel
5. The patches are overlapped with 4 pixels at each direction. in YCbCr color space) in Sections 4.1-4.4, so c ¼ 1 in the
300 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO. 2, FEBRUARY 2016
Fig. 4. Training with the much larger ImageNet dataset improves the
performance over the use of 91 images.
TABLE 1
The Results of Using Different Filter Numbers in SRCNN
n1 ¼ 128 n1 ¼ 64 n1 ¼ 32
n2 ¼ 64 n2 ¼ 32 n2 ¼ 16
PSNR Time (sec) PSNR Time (sec) PSNR Time (sec)
32.60 0.60 32.52 0.18 32.26 0.05
Fig. 7. A larger filter size leads to better results. Fig. 9. Deeper structure does not always lead to better results.
302 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO. 2, FEBRUARY 2016
TABLE 2
The Average Results of PSNR (dB), SSIM, IFC, NQM, WPSNR (dB) and MSSIM on the Set5 Dataset
Eval. Mat Scale Bicubic SC [48] NE+LLE [4] KK [24] ANR [39] A+ [39] SRCNN
2 33.66 - 35.77 36.20 35.83 36.54 36.66
PSNR 3 30.39 31.42 31.84 32.28 31.92 32.59 32.75
4 28.42 - 29.61 30.03 29.69 30.28 30.49
2 0.9299 - 0.9490 0.9511 0.9499 0.9544 0.9542
SSIM 3 0.8682 0.8821 0.8956 0.9033 0.8968 0.9088 0.9090
4 0.8104 - 0.8402 0.8541 0.8419 0.8603 0.8628
2 6.10 - 7.84 6.87 8.09 8.48 8.05
IFC 3 3.52 3.16 4.40 4.14 4.52 4.84 4.58
4 2.35 - 2.94 2.81 3.02 3.26 3.01
2 36.73 - 42.90 39.49 43.28 44.58 41.13
NQM 3 27.54 27.29 32.77 32.10 33.10 34.48 33.21
4 21.42 - 25.56 24.99 25.72 26.97 25.96
2 50.06 - 58.45 57.15 58.61 60.06 59.49
WPSNR 3 41.65 43.64 45.81 46.22 46.02 47.17 47.10
4 37.21 - 39.85 40.40 40.01 41.03 41.13
2 0.9915 - 0.9953 0.9953 0.9954 0.9960 0.9959
MSSSIM 3 0.9754 0.9797 0.9841 0.9853 0.9844 0.9867 0.9866
4 0.9516 - 0.9666 0.9695 0.9672 0.9720 0.9725
it converges, the network may fall into a bad local minimum, Test set. The Set5 [2] (5 images), Set14 [49] (14 images) and
and the learned filters are of less diversity even given enough BSD200 [31] (200 images)8 are used to evaluate the perfor-
training time. This phenomenon is also observed in [16], mance of upscaling factors 2, 3, and 4.
where improper increase of depth leads to accuracy satura- Evaluation metrics. Apart from the widely used PSNR and
tion or degradation for image classification. Why “deeper is SSIM [41] indices, we also adopt another four evaluation
not better” is still an open question, which requires investiga- matrices, namely IFC [37], noise quality measure (NQM)
tions to better understand gradients and training dynamics [8], weighted peak signal-to-noise ratio (WPSNR) and
in deep architectures. Therefore, we still adopt three-layer multi-scale structure similarity index (MSSSIM) [42], which
networks in the following experiments. obtain high correlation with the human perceptual scores as
reported in [44].
4.4 Comparisons to State-of-the-Arts
In this section, we show the quantitative and qualitative 4.4.1 Quantitative and Qualitative Evaluation
results of our method in comparison to state-of-the-art As shown in Tables 2, 3 and 4, the proposed SRCNN yields
methods. We adopt the model with good performance- the highest scores in most evaluation matrices in all experi-
speed trade-off: a three-layer network with f1 ¼ 9, f2 ¼ 5, ments.9 Note that our SRCNN results are based on the
f3 ¼ 5, n1 ¼ 64, and n2 ¼ 32 trained on the ImageNet. For checkpoint of 8 108 backpropagations. Specifically, for the
each upscaling factor 2 f2; 3; 4g, we train a specific network upscaling factor 3, the average gains on PSNR achieved by
for that factor.7 SRCNN are 0.15, 0.17, and 0.13 dB, higher than the next best
Comparisons. We compare our SRCNN with the state-of- approach, A+ [40], on the three datasets. When we take a
the-art SR methods: look at other evaluation metrics, we observe that SC, to our
surprise, gets even lower scores than the bicubic interpola-
SC - sparse coding-based method of Yang et al. [48]
tion on IFC and NQM. It is clear that the results of SC are
NE+LLE - neighbour embedding + locally linear
more visually pleasing than that of bicubic interpolation.
embedding method [4]
This indicates that these two metrics may not truthfully
ANR - Anchored Neighbourhood Regression
reveal the image quality. Thus, regardless of these two met-
method [39]
rics, SRCNN achieves the best performance among all meth-
A+ - Adjusted Anchored Neighbourhood Regression
ods and scaling factors.
method [40], and
It is worth pointing out that SRCNN surpasses the bicubic
KK - the method described in [24], which achieves
baseline at the very beginning of the learning stage (see
the best performance among external example-based
Fig. 1), and with moderate training, SRCNN outperforms
methods, according to the comprehensive evaluation
existing state-of-the-art methods (see Fig. 4). Yet, the perfor-
conducted in Yang et al.’s work [44]
mance is far from converge. We conjecture that better results
The implementations are all from the publicly available
can be obtained given longer training time (see Fig. 10).
codes provided by the authors, and all images are down-
sampled using the same bicubic kernel.
8. We use the same 200 images as in [44].
9. The PSNR value of each image can be found in the supplementary
7. In the area of denoising [3], for each noise level a specific network file, which can be found on the Computer Society Digital Library at
is trained. http://doi.ieeecomputersociety.org/10.1109/TPAMI.2015.2439281.
DONG ET AL.: IMAGE SUPER-RESOLUTION USING DEEP CONVOLUTIONAL NETWORKS 303
TABLE 3
The Average Results of PSNR (dB), SSIM, IFC, NQM, WPSNR (dB) and MSSIM on the Set14 Dataset
Eval. Mat Scale Bicubic SC [48] NE+LLE [4] KK [24] ANR [39] A+ [39] SRCNN
2 30.23 - 31.76 32.11 31.80 32.28 32.45
PSNR 3 27.54 28.31 28.60 28.94 28.65 29.13 29.30
4 26.00 - 26.81 27.14 26.85 27.32 27.50
2 0.8687 - 0.8993 0.9026 0.9004 0.9056 0.9067
SSIM 3 0.7736 0.7954 0.8076 0.8132 0.8093 0.8188 0.8215
4 0.7019 - 0.7331 0.7419 0.7352 0.7491 0.7513
2 6.09 - 7.59 6.83 7.81 8.11 7.76
IFC 3 3.41 2.98 4.14 3.83 4.23 4.45 4.26
4 2.23 - 2.71 2.57 2.78 2.94 2.74
2 40.98 - 41.34 38.86 41.79 42.61 38.95
NQM 3 33.15 29.06 37.12 35.23 37.22 38.24 35.25
4 26.15 - 31.17 29.18 31.27 32.31 30.46
2 47.64 - 54.47 53.85 54.57 55.62 55.39
WPSNR 3 39.72 41.66 43.22 43.56 43.36 44.25 44.32
4 35.71 - 37.75 38.26 37.85 38.72 38.87
2 0.9813 - 0.9886 0.9890 0.9888 0.9896 0.9897
MSSSIM 3 0.9512 0.9595 0.9643 0.9653 0.9647 0.9669 0.9675
4 0.9134 - 0.9317 0.9338 0.9326 0.9371 0.9376
TABLE 4
The Average Results of PSNR (dB), SSIM, IFC, NQM, WPSNR (dB) and MSSIM on the BSD200 Dataset
Eval. Mat Scale Bicubic SC [48] NE+LLE [4] KK [24] ANR [39] A+ [39] SRCNN
2 28.38 - 29.67 30.02 29.72 30.14 30.29
PSNR 3 25.94 26.54 26.67 26.89 26.72 27.05 27.18
4 24.65 - 25.21 25.38 25.25 25.51 25.60
2 0.8524 - 0.8886 0.8935 0.8900 0.8966 0.8977
SSIM 3 0.7469 0.7729 0.7823 0.7881 0.7843 0.7945 0.7971
4 0.6727 - 0.7037 0.7093 0.7060 0.7171 0.7184
2 5.30 - 7.10 6.33 7.28 7.51 7.21
IFC 3 3.05 2.77 3.82 3.52 3.91 4.07 3.91
4 1.95 - 2.45 2.24 2.51 2.62 2.45
2 36.84 - 41.52 38.54 41.72 42.37 39.66
NQM 3 28.45 28.22 34.65 33.45 34.81 35.58 34.72
4 21.72 - 25.15 24.87 25.27 26.01 25.65
2 46.15 - 52.56 52.21 52.69 53.56 53.58
WPSNR 3 38.60 40.48 41.39 41.62 41.53 42.19 42.29
4 34.86 - 36.52 36.80 36.64 37.18 37.24
2 0.9780 - 0.9869 0.9876 0.9872 0.9883 0.9883
MSSSIM 3 0.9426 0.9533 0.9575 0.9588 0.9581 0.9609 0.9614
4 0.9005 - 0.9203 0.9215 0.9214 0.9256 0.9261
Figs. 14, 15 and 16 show the super-resolution results than other approaches without any obvious artifacts
of different approaches by an upscaling factor 3. As can across the image.
be observed, the SRCNN produces much sharper edges In addition, we report to another recent deep learning
method for image super-resolution (DNC) of Cui et al.
[5]. As they employ a different blur kernel (a Gaussian
filter with a standard deviation of 0.55), we train a spe-
cific network (9-5-5) using the same blur kernel as DNC
for fair quantitative comparison. The upscaling factor is
3 and the training set is the 91-image dataset. From the
convergence curve shown in Fig. 11, we observe that our
SRCNN surpasses DNC with just 2:7 107 backprops,
and a larger margin can be obtained given longer train-
ing time. This also demonstrates that the end-to-end
Fig. 10. The test convergence curve of SRCNN and results of other learning is superior to DNC, even if that model is
methods on the Set5 dataset. already “deep”.
304 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO. 2, FEBRUARY 2016
TABLE 5
Average PSNR (dB) of Different Channels and Training
Strategies on the Set5 Dataset
Fig. 14. The “butterfly” image from Set5 with an upscaling factor 3.
Fig. 15. The “ppt3” image from Set14 with an upscaling factor 3.
Fig. 16. The “zebra” image from Set14 with an upscaling factor 3.
306 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 38, NO. 2, FEBRUARY 2016
RGB: Training is performed on the three channels of problems, such as image deblurring or simultaneous SR+
the RGB space. denoising. One could also investigate a network to cope
The results are shown in Table 5, where we have the fol- with different upscaling factors.
lowing observations. (i) If we directly train on the YCbCr
channels, the results are even worse than that of bicubic REFERENCES
interpolation. The training falls into a bad local minimum, [1] M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An algorithm for
due to the inherently different characteristics of the Y and designing overcomplete dictionaries for sparse representation,”
Cb, Cr channels. (ii) If we pre-train on the Y or Cb, Cr chan- IEEE Trans. Signal Process., vol. 54, no. 11, pp. 4311–4322, Nov.
2006.
nels, the performance finally improves, but is still not better [2] M. Bevilacqua, A. Roumy, C. Guillemot, and M. L. A. Morel,
than “Y only” on the color image (see the last column of “Low-complexity single-image super-resolution based on non-
Table 5, where PSNR is computed in RGB color space). This negative neighbor embedding,” in Proc. Brit. Mach. Vis. Conf.,
suggests that the Cb, Cr channels could decrease the perfor- 2012, pp. 1–10.
[3] H. C. Burger, C. J. Schuler, and S. Harmeling, “Image denoising:
mance of the Y channel when training is performed in a uni- Can plain neural networks compete with BM3D?” in Proc. IEEE
fied network. (iii) We observe that the Cb, Cr channels have Conf. Comput. Vis. Pattern Recog., 2012, pp. 2392–2399.
higher PSNR values for “Y pre-train” than for “CbCr pre- [4] H. Chang, D. Y. Yeung, and Y. Xiong, “Super-resolution through
neighbor embedding,” presented at the IEEE Conf. Comput. Vis.
train”. The reason lies on the differences between the Cb, Cr Pattern Recog., Washington, DC, USA, 2004.
channels and the Y channel. Visually, the Cb, Cr channels are [5] Z. Cui, H. Chang, S. Shan, B. Zhong, and X. Chen, “Deep network
more blurry than the Y channel, thus are less affected by the cascade for image super-resolution,” in Proc. Eur. Conf. Comput.
downsampling process. When we pre-train on the Cb, Cr Vis., 2014, pp. 49–64.
[6] D. Dai, R. Timofte, and L. Van Gool, “Jointly optimized regressors
channels, there are only a few filters being activated. Then for image super-resolution,” Eurographics, vol. 7, p. 8, 2015.
the training will soon fall into a bad local minimum during [7] S. Dai, M. Han, W. Xu, Y. Wu, Y. Gong, and A. K. Katsaggelos,
fine-tuning. On the other hand, if we pre-train on the Y chan- “Softcuts: A soft edge smoothness prior for color image super-
resolution,” IEEE Trans. Image Process., vol. 18, no. 11, pp. 969–981,
nel, more filters will be activated, and the performance on May 2009.
Cb, Cr channels will be pushed much higher. Fig. 13 shows [8] N. Damera-Venkata, T. D. Kite, W. S. Geisler, B. L. Evans, and A. C.
the Cb, Cr channels of the first-layer filters with “Y pre- Bovik, “Image quality assessment based on a degradation model,”
train”, of which the patterns largely differ from that shown IEEE Trans. Image Process., vol. 9, no. 11, pp. 636–650, Apr. 2000.
[9] J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. Fei-Fei,
in Fig. 5. (iv) Training on the RGB channels achieves the best “ImageNet: A large-scale hierarchical image database,” in Proc.
result on the color image. Different from the YCbCr channels, IEEE Conf. Comput. Vis. Pattern Recog., 2009, pp. 248–255.
the RGB channels exhibit high cross-correlation among each [10] E. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus,
other. The proposed SRCNN is capable of leveraging such “Exploiting linear structure within convolutional networks for
efficient evaluation,” in Proc. Adv. Neural Inf. Process. Syst., 2014,
natural correspondences between the channels for recon- pp. 1269–1277.
struction. Therefore, the model achieves comparable result [11] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convo-
on the Y channel as “Y only”, and better results on Cb, Cr lutional network for image super-resolution,” in Proc. Eur. Conf.
Comput. Vis., 2014, pp. 184–199.
channels than bicubic interpolation. (v) In KK [24], super-res- [12] D. Eigen, D. Krishnan, and R. Fergus, “Restoring an image taken
olution is applied on each RGB channel separately. When we through a window covered with dirt or rain,” in Proc. IEEE Int.
transform its results to YCbCr space, the PSNR value of Y Conf. Comput. Vis., 2013, pp. 633–640.
channel is similar as “Y only”, but that of Cb, Cr channels are [13] G. Freedman and R. Fattal, “Image and video upscaling from local
self-examples,” ACM Trans. Graph., vol. 30, no. 11, p. 12, 2011.
poorer than bicubic interpolation. The result suggests that [14] W. T. Freeman, T. R. Jones, and E. C. Pasztor, “Example-based
the algorithm is biased to the Y channel. On the whole, our super-resolution,” Comput. Graph. Appl., vol. 22, no. 11, pp. 56–65,
method trained on RGB channels achieves better perfor- 2002.
[15] W. T. Freeman, E. C. Pasztor, and O. T. Carmichael, “Learning
mance than KK and the single-channel network (“Y only”). It low-level vision,” Int. J. Comput. Vis., vol. 40, no. 11, pp. 25–47,
is also worth noting that the improvement compared with 2000.
the single-channel network is not that significant (i.e., 0.07 [16] D. Glasner, S. Bagon, and M. Irani, “Super-resolution from a sin-
dB). This indicates that the Cb, Cr channels barely help in gle image,” in Proc. IEEE Int. Conf. Comput. Vis., 2009, pp. 349–356.
[17] K. He and J. Sun, “Convolutional neural networks at constrained
improving the performance. time cost,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2015,
pp. 3791–3799.
[18] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in
5 CONCLUSION deep convolutional networks for visual recognition,” in Proc. Eur.
Conf. Comput. Vis., 2014, pp. 346–361.
We have presented a novel deep learning approach for sin- [19] J. B. Huang, A. Singh, and N. Ahuja, “Single image super-resolu-
gle image SR. We show that conventional sparse-coding- tion from transformed self-exemplars,” in Proc. IEEE Conf. Com-
based SR methods can be reformulated into a deep convolu- put. Vis. Pattern Recog., 2015, pp. 5197–5206.
[20] M. Irani and S. Peleg, “Improving resolution by image regis-
tional neural network. The proposed approach, SRCNN, tration,” Graph. Models Image Process., vol. 53, no. 11, pp. 231–239,
learns an end-to-end mapping between low- and high-reso- 1991.
lution images, with little extra pre/post-processing beyond [21] V. Jain and S. Seung, “Natural image denoising with convolu-
the optimization. With a lightweight structure, the tional networks,” in Proc. Adv. Neural Inf. Process. Syst., 2008,
pp. 769–776.
SRCNN has achieved superior performance than the [22] K. Jia, X. Wang, and X. Tang, “Image transformation based on
state-of-the-art methods. We conjecture that additional learning dictionaries across image spaces,” IEEE Trans. Pattern
performance can be further gained by exploring more fil- Anal. Mach. Intell., vol. 35, no. 11, pp. 367–380, Feb. 2013.
[23] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,
ters and different training strategies. Besides, the pro- S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture
posed structure, with its advantages of simplicity and for fast feature embedding,” in Proc. ACM Int. Conf. Multimedia,
robustness, could be applied to other low-level vision 2014, pp. 675–678.
DONG ET AL.: IMAGE SUPER-RESOLUTION USING DEEP CONVOLUTIONAL NETWORKS 307
[24] K. I. Kim and Y. Kwon, “Single-image super-resolution using [48] J. Yang, J. Wright, T. S. Huang, and Y. Ma, “Image super-resolution
sparse regression and natural image prior,” IEEE Trans. Pattern via sparse representation,” IEEE Trans. Image Process., vol. 19,
Anal. Mach. Intell., vol. 32, no. 6, pp. 1127–1133, Jun. 2010. no. 11, pp. 2861–2873, Nov. 2010.
[25] A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet classifica- [49] R. Zeyde, M. Elad, and M. Protter, “On single image scale-up
tion with deep convolutional neural networks,” in Proc. Adv. Neu- using sparse-representations,” in Proc. 7th Int. Conf. Curves Surfa-
ral Inf. Process. Syst., 2012, pp. 1097–1105. ces, 2012, pp. 711–730.
[26] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. [50] N. Zhang, J. Donahue, R. Girshick, and T. Darrell, “Part-based R-
Hubbard, and L. D. Jackel, “Backpropagation applied to hand- CNNs for fine-grained category detection,” in Proc. Eur. Conf.
written zip code recognition,” Neural Comput., vol. 1, no. 4, Comput. Vis., 2014, pp. 834–849.
pp. 541–551, 1989.
[27] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based Chao Dong received the BS degree in informa-
learning applied to document recognition,” Proc. IEEE, vol. 86, no. tion engineering from the Beijing Institute of
11, pp. 2278–2324, Nov. 1998. Technology, China, in 2011. He is currently work-
[28] H. Lee, A. Battle, R. Raina, and A. Y. Ng, “Efficient sparse coding ing toward the PhD degree in the Department of
algorithms,” in Proc. Adv. Neural Inf. Process. Syst., 2006, pp. 801– Information Engineering, Chinese University of
808. Hong Kong. His research interests include image
[29] C. Liu, H. Y. Shum, and W. T. Freeman, “Face hallucination: Theory super-resolution and denoising.
and practice,” Int. J. Comput. Vis., vol. 75, no. 11, pp. 115–134, 2007.
[30] F. Mamalet and C. Garcia, “Simplifying convNets for fast
learning,” in Proc. Int. Conf. Artif. Neural Netw., 2012, pp. 58–65.
[31] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human
segmented natural images and its application to evaluating seg-
mentation algorithms and measuring ecological statistics,” in Chen Change Loy received the PhD degree in
Proc. IEEE Int. Conf. Comput. Vis., 2001, vol. 2, pp. 416–423. computer science from the Queen Mary Univer-
sity of London in 2010. He is currently a
[32] V. Nair and G. E. Hinton, “Rectified linear units improve
research assistant professor in the Department
restricted Boltzmann machines,” in Proc. Int. Conf. Mach. Learn.,
2010, pp. 807–814. of Information Engineering, Chinese University
[33] W. Ouyang, X. Wang, X. Zeng, S. Qiu, P. Luo, Y. Tian, H. Li, S. of Hong Kong. Previously, he was a postdoc-
Yang, Z. Wang, C.-C. Loy, and X. Tang, “Deepid-net: Deformable toral researcher at Vision Semantics Ltd. His
deep convolutional neural networks for object detection,” in Proc. research interests include computer vision and
IEEE Conf. Comput. Vis. Pattern Recogn., 2015, pp. 2403–2412. pattern recognition, with focus on face analysis,
deep learning, and visual surveillance. He is a
[34] W. Ouyang and X. Wang, “Joint deep learning for pedestrian
detection,” in Proc. IEEE Int. Conf. Comput. Vis., 2013, pp. 2056– member of the IEEE.
2063.
[35] C. J. Schuler, H. C. Burger, S. Harmeling, and B. Scholkopf, “A Kaiming He received the BS degree from Tsing-
machine learning approach for non-blind image deconvolution,” hua University in 2007, and the PhD degree
in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2013, pp. 1067–1074. from the Chinese University of Hong Kong in 2011.
[36] S. Schulter, C. Leistner, and H. Bischof, “Fast and accurate image He is a researcher at Microsoft Research Asia
upscaling with super-resolution forests,” in Proc. IEEE Conf. Com- (MSRA) since 2011. His research interests include
put. Vis. Pattern Recog., 2015, pp. 3791–3799. computer vision and computer graphics. He has
[37] H. R. Sheikh, A. C. Bovik, and G. De Veciana, “An information won the Best Paper Award at the IEEE Conference
fidelity criterion for image quality assessment using natural scene on Computer Vision and Pattern Recognition
statistics,” IEEE Trans. Image Process., vol. 14, no. 11, pp. 2117– (CVPR) 2009. He is a member of the IEEE.
2128, Dec. 2005.
[38] Y. Sun, Y. Chen, X. Wang, and X. Tang, “Deep learning face repre-
sentation by joint identification-verification,” in Proc. Adv. Neural
Inf. Process. Syst., 2014, pp. 1988–1996. Xiaoou Tang (S’93-M’96-SM’02-F’09) received
[39] R. Timofte, V. De Smet, and L. Van Gool, “Anchored neighbor- the BS degree from the University of Science and
hood regression for fast example-based super-resolution,” in Proc. Technology of China, Hefei, in 1990, the MS
IEEE Int. Conf. Comput. Vis., 2013, pp. 1920–1927. degree from the University of Rochester, New
York, in 1991, and the PhD degree from the Mas-
[40] R. Timofte, V. De Smet, and L. Van Gool, “A+: Adjusted anchored
sachusetts Institute of Technology, Cambridge,
neighborhood regression for fast super-resolution,” in Proc. IEEE
Asian Conf. Comput. Vis., 2014, pp. 111–126. in 1996. He is a professor in the Department of
[41] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, Information Engineering and an associate dean
“Image quality assessment: From error visibility to structural (Research) of the Faculty of Engineering, Chi-
similarity,” IEEE Trans. Image Process., vol. 13, no. 11, pp. 600– nese University of Hong Kong. He was the group
manager of the Visual Computing Group, Micro-
612, Apr. 2004.
soft Research Asia, from 2005 to 2008. His research interests include
[42] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural
similarity for image quality assessment,” in Proc. IEEE Conf. Rec. computer vision, pattern recognition, and video processing. He received
37th Asilomar Conf. Signals, Syst. Comput., 2003, vol. 2, pp. 1398– the Best Paper Award at the IEEE Conference on Computer Vision and
1402. Pattern Recognition (CVPR. 2009. He was a program chair of the IEEE
[43] C. Y. Yang, J. B. Huang, and M. H. Yang, “Exploiting self-similari- International Conference on Computer Vision (ICCV) 2009 and he is an
ties for single frame super-resolution,” in Proc. IEEE Asian Conf. associate editor of the IEEE Transactions on Pattern Analysis and
Machine Intelligence and the International Journal of Computer Vision.
Comput. Vis., 2010, pp. 497–510.
He is a fellow of the IEEE.
[44] C. Y. Yang, C. Ma, and M. H. Yang, “Single-image super-resolution:
A benchmark,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 372–386.
[45] J. Yang, Z. Lin, and S. Cohen, “Fast image super-resolution based
on in-place example regression,” in Proc. IEEE Conf. Comput. Vis. " For more information on this or any other computing topic,
Pattern Recog., 2013, pp. 1059–1066. please visit our Digital Library at www.computer.org/publications/dlib.
[46] J. Yang, Z. Wang, Z. Lin, S. Cohen, and T. Huang, “Coupled dic-
tionary training for image super-resolution,” IEEE Trans. Image
Process., vol. 21, no. 11, pp. 3467–3478, Aug. 2012.
[47] J. Yang, J. Wright, T. Huang, and Y. Ma, “Image super-resolution
as sparse representation of raw image patches,” in Proc. IEEE
Conf. Comput. Vis. Pattern Recog., 2008, pp. 1–8.