MADNet A Fast and Lightweight Network For Single-Image Super Resolution
MADNet A Fast and Lightweight Network For Single-Image Super Resolution
Abstract—Recently, deep convolutional neural networks is related to reconstructing a visually high-resolution (HR)
(CNNs) have been successfully applied to the single-image super- image from its low-resolution (LR) input. In practice, SISR
resolution (SISR) task with great improvement in terms of is generally difficult to process due to its ill-posed nature,
both peak signal-to-noise ratio (PSNR) and structural similarity
(SSIM). However, most of the existing CNN-based SR mod- wherein multiple HR images can map to the same LR ver-
els require high computing power, which considerably limits sion. Addressing SISR has proven to be useful in many
their real-world applications. In addition, most CNN-based meth- practical cases, such as video streaming [44], [50]; remote
ods rarely explore the intermediate features that are helpful sensing [16], [58]; and medical imaging [31], [45], [48].
for final image recovery. To address these issues, in this arti- To mitigate this problem, numerous SR approaches
cle, we propose a dense lightweight network, called MADNet,
for stronger multiscale feature expression and feature correla- have been proposed from different perspectives, includ-
tion learning. Specifically, a residual multiscale module with an ing interpolation-based [17], reconstruction-based [33], and
attention mechanism (RMAM) is developed to enhance the infor- example-based methods [23], [25], [40], [41], [49]. The for-
mative multiscale feature representation ability. Furthermore, mer two kinds of methods are simple and efficient but suffer a
we present a dual residual-path block (DRPB) that utilizes the dramatic drop in restoration performance as the scale factors
hierarchical features from original low-resolution images. To
take advantage of the multilevel features, dense connections are increase, and the example-based methods that try to analyze
employed among blocks. The comparative results demonstrate relationships between LR and HR pairs achieve satisfactory
the superior performance of our MADNet model while employing performance but involve time-consuming operations.
considerably fewer multiadds and parameters. Recently, due to the powerful feature representation capa-
Index Terms—Channel attention, dense connections, image bility of the deep convolutional neural network (CNN), CNN-
super resolution, lightweight, multiscale mechanism. based methods have been proposed to learn a nonlinear
mapping from an interpolated or LR version to its corre-
I. I NTRODUCTION sponding high-quality output. By entirely utilizing the inherent
relations among images in training datasets, these models have
INGLE-IMAGE super resolution (SISR) is an essential
S and classical problem in low-level computer vision that
provided outstanding performance in SR tasks [5], [7], [18],
[22], [27], [30], [56], [57]. Ranging from the SRCNN [5],
Manuscript received June 23, 2019; revised November 5, 2019; accepted which has only three convolution layers (Conv layers), to
January 17, 2020. Date of publication March 4, 2020; date of current ver- the recent RCAN [56], which has over 400 layers, these
sion February 17, 2021. This work was supported in part by the National approaches obviously illustrate that as the model becomes
Natural Science Foundation of China under Grant 61702129, Grant 61772149,
Grant U1701267, and Grant 61866009, in part by the National Key Research deeper, the performance improves.
and Development Program of China under Grant 2018AAA0100305, in part Although CNN-based models have achieved state-of-the-art
by the China Postdoctoral Science Foundation under Grant 2018M633047, performance, these methods face some limitations.
in part by the Guangxi Science and Technology Project under Grant
2019GXNSFAA245014, Grant AD18281079, Grant AD18216004, Grant 1) Most CNN-based frameworks gain improvement by sub-
2017GXNFDA198025, and Grant AA18118039, and in part by the Innovation stantially increasing the depth or width of the network;
Project of GUET Graduate Education under Grant 2019YCXS048. This arti- this means that they rely heavily on computation
cle was recommended by Associate Editor H. Lu. (Corresponding author:
Zhenbing Liu.) to produce the HR images, limiting their real-world
Rushi Lan is with the Guangxi Key Laboratory of Image and Graphic applications.
Intelligent Processing, Guilin University of Electronic Technology, Guilin 2) Most existing CNN-based SR models seldom utilize the
541004, China, and also with the School of Computer Science and
Engineering, South China University of Technology, Guangzhou 510006, multiscale representation for image super resolution and
China. do not fully use the hierarchical features.
Long Sun, Zhenbing Liu, and Cheng Pang are with the Guangxi Consequently, it is important to design a lightweight archi-
Key Laboratory of Images and Graphics Intelligent Processing, Guilin
University of Electronic Technology, Guilin 541004, China (e-mail: tecture that is practical to solve the mentioned problems. The
[email protected]). general way to build a lightweight network is to reduce the
Huimin Lu is with the Department of Mechanical and Control of number of model parameters and computational operations
Engineering, Kyushu Institute of Technology, Kitakyushu 8048550, Japan.
Xiaonan Luo is with the National Local Joint Engineering Research Center (multiadds). Based on this concept, we provide a feasible solu-
of Satellite Navigation and Location Service, Guilin University of Electronic tion for the challenge that combines the multiscale mechanism
Technology, Guilin 541004, China. and the dense connection. Specifically, an efficient feature
Color versions of one or more of the figures in this article are available
online at http://ieeexplore.ieee.org. extraction network (EFEN) is proposed for exploring feature
Digital Object Identifier 10.1109/TCYB.2020.2970104 maps, and an upsampling network (UN) is used for enlarging
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
1444 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 51, NO. 3, MARCH 2021
Fig. 2. Architecture of our proposed model (MADNet), which contains two subnetworks: an EFEN and a UN. The former includes three DRPBs; the latter
is constructed by three sets of Conv layers and a pixel-shuffle layer.
In the Res2Net module, the input features are divided into A. Network Framework
several groups, and each group of the parallel groups utilizes As shown in Fig. 2, the proposed MADNet consists of two
a smaller filter to extract features and connects with others via components: 1) an EFEN and 2) a UN.
residual shortcuts. The EFEN utilizes two successive Conv layers with kernel
Recently, Li et al. [26] introduced a multiscale residual sizes of 3 × 3 and 1 × 1 for simply detecting low-level fea-
network to exploit the image features to achieve a significant tures from the input image. Then, to extract the global and
performance gain for image super resolution. However, they local image features, the output is fed to the DRPBs, and
simply concatenate the information with two different filter all the results of the intermediary block are connected to the
sizes while ignoring the granular-level multiscale feature and following block as dense connections. Let ILR represent the
thus cannot cover a large range of receptive fields and cause original input image and ISR be the output; then, this stage
a computational burden. Importantly, for image SR, features can be formulated as
with more multiscale information are more accurate for recon-
struction, while an SR model with fewer parameters is more FFEA = HEFEN (ILR ) = HDRPB (HLL (ILR )) (1)
feasible for real applications. where HEFEN (·) is the feature extraction function and can be
divided into the shallow feature extraction step HLL (·) and the
representation learning step HDRPB (·). FFEA denotes the output
C. Attention Mechanism feature map from EFEN.
Attention in human perception refers to how visual systems Finally, we concatenate all of the feature maps for further
adaptively exploit a sequence of information items and feature fusion. After fusing, these features are processed by two
selectively focus on salient areas [12]. Recently, several Conv layers and a pixel-shuffle layer to generate the HR image
attempts have introduced attention processing to improve
ISR = HUP (FFEA ) = HGEN (HCON (FFEA )) (2)
the performance of CNNs for various computer vision
tasks [12], [43], [47], [56]. where HUP (·) denotes the upsampling procedure and contains
Hu et al. [12] employed an attention module to exploit two stages: 1) HCON (·) means feature concatenation and fusion
the interchannel relationship. In their work, the squeeze-and- and 2) HGEN (·) represents the subsequent processing.
excitation (SE) module utilizes global average-pooled features
to calculate channelwise attention and achieves considerable B. Efficient Feature Extraction Network
improvement for image classification. Woo et al. [47] further We now describe our EFEN (see Fig. 2) in detail. It is
exploited this schema for both spatial and channelwise atten- stacked with two Conv layers and three DRPBs, while a sin-
tion. In addition, Wang et al. [43] proposed a novel attention gle DRPB gains a sequence of our proposed residual module,
block for video classification in which nonlocal operations are that is, it operates with the multiscale module and attention
used to capture spatial attention. mechanism. The details regarding this structure are presented
as follows.
DRPB: The DRPB contains M = 3 proposed multiscale
III. M ETHODOLOGY modules. To utilize different level features and enhance the
In this section, we first present the network framework of representation capability of our model, we adopt a dense
MADNet in detail, and then suggest the multiscale module, connection structure for the EFEN, that is, the dth DPRB
which is the core of the proposed method. After that, the relays intermediate features to all of the next blocks. The mth
loss functions are illustrated and the discussions among the multiscale module [see Fig. 3(c)] in the dth DPRB can be
proposed method and other related algorithms are provided at represented as
the end of this section. Fd,m = Hd,m Fd,m−1 (3)
1446 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 51, NO. 3, MARCH 2021
where Fd−1 and Fd are the outputs of the (d − 1)th and dth
DPRB, respectively. Such a connection schema allows more
low-frequency information to be bypassed during training. In
fact, to confirm the effectiveness of this combination form, we
compare several types of residual blocks and elaborate on the
details in Section IV.
RMAM: Multiscale representations are essential for vari-
ous vision tasks [9], such as semantic segmentation, object
detection, and image classification. The multiscale feature Fig. 4. Comparison of different multiscale modules. From top to bottom are:
extraction ability of CNNs leads to effective representations. (a) inception module (simplified form) [34], (b) RFB module [28], (c) MSRB
In addition, we focus on solving the efficiency limitation that module [26], and (d) RMAM module.
is essentially presented in real-world SR applications. To bal-
ance the performance and computational budgets, the channel where Conv3 (·) denotes the process of the left branch, and Fi
split strategy is introduced in the residual layer. Meanwhile, is the output. Then
the channel attention mechanism [12] is employed to learn Fi i=1
FUi = (6)
discriminative representations. It was empirically found that F1 + · · · + Fi−1 1 < i ≤ 4
our multiscale module is not only efficient but also accurate. where FUi (·) means the mixed features that potentially receive
Multiscale Structure: Most previous CNN-based SR mod- feature information from all preceding feature splits.
els do not consider multiscale representations. To exploit such After extracting the feature maps, we fuse these features at
information, MSRN [26] was introduced to detect features different scale spaces. The feature maps from all branches are
at different scales for accurate super-resolution construction. concatenated and sent to the SE block for exploring discrim-
However, the receptive fields within MSRN are limited, and inative representations among channels. For better preserving
the computational complexity is fairly higher. Inspired by the inherent information, the output features are then fused
Inception [34] and RFB [28], we propose a multiscale module with the original input tensors in a residual-like manner. From
[see Fig. 4(d)] to learn the multiscale representation ability. our observation, this schema is useful for utilizing features at
First, we apply a 1 × 1 Conv to reduce the dimension of the different scale spaces.
input data for lessening computational burden and then send Channel Attention Mechanism: The attention mechanism
them to the following four parallel branches. Except for the is popular in numerous vision tasks since it adaptively
left (i.e., it includes a 3 × 3 convolution layer), other branches recalibrates the channelwise feature responses by explicitly
contain two normal convolutional layers (e.g., 1 × 1, 3 × 3) modeling interdependencies between channels [12]. Recently,
and a depthwise convolution with a dilation rate r = 2, 3, this strategy was introduced to further improve CNN-based
and 5, respectively, denoted by MS(·). These smaller filters SR performance [56].
first obtain features from the processed input feature maps fi Let V = [v1 , . . . , vn ] denote the input data that contain n
and then use a large range of receptive fields to describe the feature maps, and the spatial shape of each feature map is
information. Specifically, the output of the previous branch is H × W. Then, the statistic Sc of the cth feature map fc is
connected to the next branch via an elementwise sum. This calculated as
procedure is repeated several times until the outputs from all H W
branches are processed. This procedure can be defined as i=1 j=1 fc (i, j)
Sc = HAVGP (fc ) = (7)
H×W
Conv3 (fi ) i = 1 where HAVGP (·) means the global average pooling opera-
Fi = (5)
MSi (fi ) 1<i≤4 tion, and fc (i, j) represents the corresponding value of fc .
LAN et al.: MADNet: FAST AND LIGHTWEIGHT NETWORK FOR SINGLE-IMAGE SUPER RESOLUTION 1447
The attention statistic of the feature fc is where ∇h (·) and ∇v (·) denote the gradient operator among the
Ac = F(w1 δ(w2 Sc )) (8) horizontal and vertical direction, respectively.
Thus, the second loss function is defined as follows:
where F(·) is the ReLU activation function, and δ(·) represents
the sigmoid function and can be treated as a gating mechanism. LF = L1 + λLTV . (13)
w1 is the weight of a dimension-increasing layer (i.e., 1 × 1
convolution layer for upscaling) and w2 denotes the weight of We train our model with these losses, empirically finding
a dimension-reduction layer (i.e., 1 × 1 convolution layer for that the LF loss can obtain a better performance than the L1
downscaling). The downscaling layer first reduces the number loss and λ = 1e−5 works well. As shown in Fig. 7, the LF
of input channels by a reduction factor r with w2 , activated loss enables our model to produce sharper SR results.
by an activation function δ, and then upscaling to the original
spatial space with w1 . The attention statistic Ac that is used to E. Relation to Other CNN Methods
rescale the input feature map fc Relation to Res2Net: The motivation for exploiting the
fˆc = Ac · fc . (9) multiscale potential is similar between the Res2Net [9] mod-
Densely Connected Structure: Due to our DRPB and the ule and our RMAM. However, there are three main differences
multiscale module, the information can be perceived from very in our mechanism.
different scales. To go a further step to assimilate multilevel 1) In general, Res2Net is used in high-level computer
features, we densely connect each DRPB. The mth block vision tasks (e.g., semantic segmentation and object
DPRBm (see Fig. 2) can be represented as recognition), and some inherent operations of this model
are not suitable for image SR such as batch normal-
DPRBm = Concate(HLL , DPRB1 , . . . , DPRBm−1 ). (10) ization (BN) layers, which increase the computational
Concatenating the preceding features as the input of DPRBm , complexity and hinder the reconstructed performance of
the output is also connected to the subsequent block employ- the network. Thus, we remove these layers.
ing the same process. Such a dense connection struc- 2) The procedure of extracting features is different. In
ture [13] allows more abundant low-frequency information to Res2Net, the input features are evenly split into several
be bypassed during training. groups, and each group is processed by a correspond-
ing 3 × 3 convolution except for the first part, where
the convolutional output is added to the preceding
C. Upsampling Network
feature and then fed into the next. However, in our
As stated in Section II, our proposed model directly model, we stack three convolutional layers with different
processes original input images so that it can extract features kernel sizes and dilation rates for effectively extracting
efficiently. The final high-quality image ISR is reconstructed in information. All of the previous outputs are added to the
the UN, and all of the features from EFEN are concatenated following group for integrating multiscale features.
at the input layer of the UN; thus, the dimension of the input 3) For learning the discriminative representation, the SE
data is rather large. Therefore, we use 1 × 1 to reduce the block [12] is embedded to recalibrate the channelwise
input dimension before generating the HR pixels. feature.
Then, the magnification layer reshapes the feature maps to a Relation to MSRN: We summarize the main differences
high-level space and outputs nine channels where each channel between MSRN [26] and our MADNet. The first one is the
represents each real-valued tensor of the upsampled pixel. design of the basic module. In MSRN, the multiscale resid-
ual block (MSRB) mainly combines parallel convolutions with
D. Loss Function multiscale feature fusion by residual learning [11], operating
We consider two types of loss functions that measure the on all feature channels. Such an approach leads to heavy com-
difference between the HR output ISR and its correspond- putations. However, our multiscale module is based on several
ing ground truth IGT . The first one is the mean absolute convolutional branches and introduces a split and concatena-
error (MAE), also called the l1 -norm, which is formulated as tion strategy to effectively process features and reduce the
follows: number of parameters. The second one is the activation func-
tion. MSRN uses the ReLU function, whereas we utilize the
L1 = ISR − IGT 1 . (11)
PReLU activation function. From the comparisons in Fig. 5,
Alternatively, the mean-square error (MSE) can be used; how- in the negative part, PReLU introduces a learnable param-
ever, in previous work [27], it was experimentally found to be eter that can counterweigh the positive mean of the ReLU,
a poor choice to recover clear images. making it slightly symmetric; moreover, previous experiments
Given the perception that MAE or MSE tends to lead a have proven that PReLU converges faster than ReLU and
smooth result, we additionally introduce a total variation (TV) obtains better performance [55]. Thus, our proposed multiscale
regularizer [10], [29] to constrain the smoothness of ISR module possesses more powerful representational ability.
Relation to MemNet: MemNet [38] uses a dense block and
LTV = ∇h (ISR )2 + ∇v (ISR )2
2 2
various shortcuts. The differences in our method are listed as
= ISRi,j+1 − ISRi,j + ISRi+1,j − ISRi,j (12) follows. First, Lim et al. trained the network with the L2 loss,
i,j while it was empirically found that training with the L1 loss
1448 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 51, NO. 3, MARCH 2021
TABLE I
E FFECTS OF D IFFERENT R ESIDUAL S TRUCTURES M EASURED ON THE
S ET 14 × 3 DATASET IN 200 E POCHS
TABLE II
R ESULTS OF AN A BLATION S TUDY ON THE E FFECT OF THE SE B LOCK .
T HE E VALUATION I S ON THE S ET 5 AND B100 T EST S ETS
Fig. 5. (a) ReLU versus (b) PReLU. PReLU introduces a learnable parameter
that can counterweigh the positive mean of the ReLU, making it slightly
symmetric.
Fig. 6. Effect of MADNet with different residual structures. The curves are
based on the PSNR (dB) on DIV2K (val) with an upsampling factor of 3 in
200 epochs.
A. Training Details
IV. E XPERIMENTAL R ESULTS As shown in Fig. 2, the input and output data of our network
In this section, we first briefly depict the experimental are RGB images. During training, in each mini-batch, we ran-
implementation as well as the training and testing datasets; domly crop 16 color patches with a specific size (i.e., 96 × 96
the ablation studies follow this step. Finally, we compare our for ×2, 144 × 144 for ×3, and 192 × 192 for ×4) from the LR
LAN et al.: MADNet: FAST AND LIGHTWEIGHT NETWORK FOR SINGLE-IMAGE SUPER RESOLUTION 1449
TABLE IV
Q UANTITATIVE C OMPARISONS OF THE S TATE - OF - THE -A RT S UPER -R ESOLUTION M ODELS ON P UBLIC B ENCHMARKS .
R ED /B LUE T EXT M EANS THE B EST /S ECOND−B EST P ERFORMANCE
TABLE V
AVERAGE I NFERENCE T IME (S ECOND ) AND R ECONSTRUCTE P ERFORMANCE . T HE R ESULTS A RE E VALUATED ON THE S ET 14, B100, AND DIV2K
DATASETS FOR ×4 SR
different scenes; the Urban100 set includes 100 urban build- are illustrated in Table III. The LF achieves better results with
ing images in the real world. Both peak signal-to-noise regard to both PSNR and SSIM. For example, LF gains a
ratio (PSNR) and structural similarity (SSIM) [46] results PSNR improvement of 0.05 dB on the Set14 dataset with a
are calculated on the final SR images on the Y channel scaling factor 4.
of the transformed YCbCr color space. The LR image is
downscaled from the corresponding HR one using bicubic
downsampling. D. Comparison With State-of-the-Art Methods
We compare the proposed method with benchmark SR
C. Ablation Study models on two commonly used image quality metrics, namely,
To provide a better understanding of the proposed method, PSNR and SSIM. Note that we use the number of parameters
an ablation study is first conducted here from the following and multiadds to measure the model size. The multiadds
perspectives, that is, residual-path block, SE block, and loss is defined as follows [1], that is, the number of multiply
function. accumulate operations and we assume the SR outputs size
1) Study of the Residual-Path Block: Fig. 3 illustrates three to 1280 × 720 to calculate multiadds. The geometric self-
different residual structures. We first conduct the ablation ensembling strategy [27], [41] is used for further evaluation
experiment on these structures and the corresponding results and marked with “+” in this article. Note that we reimple-
are presented in Fig. 6 and Table I. In Table I, the base- ment IDN [15] with PyTorch, and the official TensorFlow
line is a plain structure without any shortcuts, the RPB1 implementation is at https://github.com/Zheng222/IDN
utilizes the residual learning between the first and last mod- -tensorflow.
ule, the RPB2 connects the first two modules via adding As shown in Fig. 1, we compare our model against
shortcuts, and the DRPB is as illustrated in the previous the various state-of-the-art algorithms in terms of the mul-
section. tiadds on the Urban100 dataset with an upscaling factor
It can be seen that the block with residual learning shows of 3. Here, our MADNet method outperforms all state-of-
better performance than the baseline because the residual path the-art lightweight models that have less than 2M param-
allows the earlier feature to pass into later layers. It also can eters. Specifically, MADNet has similar model size to
be observed that the DRPB form depicts a better and stable those of DRCN [19], MemNet [38], and SRMDNF [54],
performance as the training epochs increase. This result mainly while we achieve a better performance than all of
occurs because the dual residual path effectively promotes the them.
information propagation. The quantitative comparisons with several state-of-the-art
2) Study of the SE Block: To evaluate the performance of methods are listed in Table IV. Our model outperforms the
the SE block components in RMAM, we remove the SE block, existing models by a large margin on different scaling fac-
such that the entire network does not take account of the atten- tors except for CARN [1]. It can be seen that although our
tion mechanism. Observing the results shown in Table II, the method has quite a few parameters and multiadds, it gains
attention schema can bring absolute improvements, and the completely similar performance or even better. Considering
PSNR value improves by approximately 0.9 and 0.8 dB on the GPU runtime, we mainly compare the proposed method
Set5 and B100, respectively. with the latest CARN model and use the official codes
3) Study of the Loss Function: To examine the effect of to test their running time. As shown in Table V, our
the mentioned loss functions, we trained two versions of our proposed model averagely spends 0.0455, 0.0162, and 0.1117
network. Expressed formally, let the first model be “L1 ” (i.e., s to reconstruct an image on the Set14, B100, and DIV2K
using L1 loss for training) and other be “LF ” (i.e., using the (100 validation pictures in total) datasets for scale fac-
enhanced LF loss for training). We tried different linear com- tor 4, respectively, and totally running as fast as the CARN
binations of L1 and LF with different weights. Moreover, it series.
was found that λ = 1e−5 achieves a tradeoff between PSNR Fig. 8 presents the visual comparisons on the B100 and
and visual quality. Fig. 7 shows this perception that LF loss Urban100 datasets for the ×4 scale. The figure shows that
leads to sharper images with more details. In addition, we test our method works better than other comparative ones, and the
the performance on benchmarks. The corresponding results reconstructed SR images are closer to the HR ones in detail.
LAN et al.: MADNet: FAST AND LIGHTWEIGHT NETWORK FOR SINGLE-IMAGE SUPER RESOLUTION 1451
Fig. 8. Visual qualitative comparisons with the bicubic degradation model for ×4 SR on benchmarks.
R EFERENCES [25] B. Li, R. Liu, J. Cao, J. Zhang, Y.-K. Lai, and X. Liu, “Online low-rank
representation learning for joint multi-subspace recovery and clustering,”
[1] N. Ahn, B. Kang, and K.-A. Sohn, “Fast, accurate, and lightweight IEEE Trans. Image Process., vol. 27, no. 1, pp. 335–348, Jan. 2018.
super-resolution with cascading residual network,” in Proc. Eur. Conf. [26] J. Li, F. Fang, K. Mei, and G. Zhang, “Multi-scale residual network
Comput. Vis. (ECCV), Sep. 2018, pp. 256–272. for image super-resolution,” in Proc. Eur. Conf. Comput. Vis. (ECCV),
[2] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik, “Contour detection Sep. 2018, pp. 527–542.
and hierarchical image segmentation,” IEEE Trans. Pattern Anal. Mach. [27] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee, “Enhanced deep
Intell., vol. 33, no. 5, pp. 898–916, May 2011. residual networks for single image super-resolution,” in Proc. IEEE
[3] M. Bevilacqua, A. Roumy, C. Guillemot, and M.-L. A. Morel, “Low- Conf. Comput. Vis. Pattern Recognit. (CVPR) Workshops, Jul. 2017,
complexity single-image super-resolution based on nonnegative neighbor pp. 1132–1140.
embedding,” in Proc. Brit. Mach. Vis. Conf., 2012, pp. 1–10. [28] S. Liu, D. Huang, and Y. Wang, “Receptive field block net for accurate
[4] L. Chen, J. Pan, and Q. Li, “Robust face image super-resolution via joint and fast object detection,” in Proc. Eur. Conf. Comput. Vis. (ECCV),
learning of subdivided contextual model,” IEEE Trans. Image Process., Sep. 2018, pp. 404–419.
vol. 28, no. 12, pp. 5897–5909, Dec. 2019. [29] A. Marquina and S. J. Osher, “Image super-resolution by
[5] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolutional TV-regularization and Bregman iteration,” J. Sci. Comput., vol. 37,
network for image super-resolution,” in Proc. Eur. Conf. Comput. Vis., no. 3, pp. 367–382, 2008.
2014, pp. 184–199. [30] J. Pan et al., “Learning dual convolutional neural networks for low-level
[6] C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using vision,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
deep convolutional networks,” IEEE Trans. Pattern Anal. Mach. Intell., Jun. 2018, pp. 3070–3079.
vol. 38, no. 2, pp. 295–307, Jan. 2016. [31] S. C. Park, K. K. Park, and M. G. Kang, “Super-resolution image recon-
[7] C. Dong, C. C. Loy, and X. Tang, “Accelerating the super-resolution con- struction: A technical overview,” IEEE Signal Process. Mag., vol. 20,
volutional neural network,” in Computer Vision—ECCV 2016, B. Leibe, no. 3, pp. 21–36, May 2003.
J. Matas, N. Sebe, and M. Welling, Eds. Cham, Switzerland: Springer [32] W. Shi et al., “Real-time single image and video super-resolution using
Int., 2016, pp. 391–407. an efficient sub-pixel convolutional neural network,” in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 1874–1883.
[8] Y. Fan et al., “Balanced two-stage residual networks for image super-
[33] J. Sun, Z. Xu, and H.-Y. Shum, “Image super-resolution using gradient
resolution,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR)
profile prior,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2008,
Workshops, Jul. 2017, pp. 1157–1164.
pp. 1–8.
[9] S. Gao, M. Cheng, K. Zhao, X. Zhang, M. Yang, and P. H. S. Torr, [34] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4,
“Res2Net: A new multi-scale backbone architecture,” CoRR, inception-ResNet and the impact of residual connections on learning,”
vol. abs/1904.01169, pp. 1–10, Sep. 2019. [Online]. Available: in Proc. ICLR Workshop, 2016, pp. 4278–4284. [Online]. Available:
http://arxiv.org/abs/1904.01169 https://arxiv.org/abs/1602.07261
[10] S. Guo, Z. Yan, K. Zhang, W. Zuo, and L. Zhang, “Toward convolutional [35] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE
blind denoising of real photographs,” in Proc. IEEE Conf. Comput. Vis. Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 1–9.
Pattern Recognit. (CVPR), 2019, pp. 1712–1722. [36] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking
[11] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for the inception architecture for computer vision,” in Proc. IEEE Conf.
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 2818–2826.
(CVPR), Jun. 2016, pp. 770–778. [37] Y. Tai, J. Yang, and X. Liu, “Image super-resolution via deep recursive
[12] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in residual network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2018, (CVPR), Jul. 2017, pp. 2790–2798.
pp. 7132–7141. [38] Y. Tai, J. Yang, X. Liu, and C. Xu, “MemNet: A persistent memory
[13] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely network for image restoration,” in Proc. IEEE Int. Conf. Comput. Vis.
connected convolutional networks,” in Proc. IEEE Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 4549–4557.
Pattern Recognit. (CVPR), Jul. 2017, pp. 2261–2269. [39] R. Timofte et al., “NTIRE 2017 challenge on single image super-
[14] J.-B. Huang, A. Singh, and N. Ahuja, “Single image super-resolution resolution: Methods and results,” in Proc. IEEE Conf. Comput. Vis.
from transformed self-exemplars,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR) Workshops, Jul. 2017, pp. 1110–1121.
Pattern Recognit. (CVPR), Jun. 2015, pp. 5197–5206. [40] R. Timofte, V. De Smet, and L. Van Gool, “A+: Adjusted anchored
[15] Z. Hui, X. Wang, and X. Gao, “Fast and accurate single image super- neighborhood regression for fast super-resolution,” in Computer Vision—
resolution via information distillation network,” in Proc. Conf. Comput. ACCV 2014, D. Cremers, I. Reid, H. Saito, and M.-H. Yang, Eds. Cham,
Vis. Pattern Recognit., 2018, pp. 723–731. Switzerlands: Springer Int., 2015, pp. 111–126.
[16] K. Jiang, Z. Wang, P. Yi, G. Wang, T. Lu, and J. Jiang, “Edge-enhanced [41] R. Timofte, R. Rothe, and L. Van Gool, “Seven ways to improve
gan for remote sensing image super-resolution,” IEEE Trans. Geosci. example-based single image super resolution,” in Proc. IEEE Conf.
Remote Sens., vol. 57, no. 8, pp. 5799–5812, Aug. 2019. Comput. Vis. Pattern Recognit., 2016, pp. 1865–1873.
[17] R. Keys, “Cubic convolution interpolation for digital image processing,” [42] C. Wang, Z. Li, and J. Shi, “Lightweight image super-resolution
IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-29, no. 6, with adaptive weighted learning network,” 2019. [Online]. Available:
pp. 1153–1160, Dec. 1981. arXiv:1904.02358.
[18] J. Kim, J. K. Lee, and K. M. Lee, “Accurate image super-resolution [43] X. Wang, R. B. Girshick, A. Gupta, and K. He, “Non-local neural
using very deep convolutional networks,” in Proc. IEEE Conf. Comput. networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 1646–1654. Jun. 2018, pp. 7794–7803.
[44] Z. Wang et al., “Multi-memory convolutional neural network for
[19] J. Kim, J. K. Lee, and K. M. Lee, “Deeply-recursive convolutional
video super-resolution,” IEEE Trans. Image Process., vol. 28, no. 5,
network for image super-resolution,” in Proc. IEEE Conf. Comput. Vis.
pp. 2530–2544, May 2019.
Pattern Recognit. (CVPR), Jun. 2016, pp. 1637–1645.
[45] Z. Wang, J. Chen, and S. C. H. Hoi, “Deep learning for image super-
[20] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” resolution: A survey,” CoRR, vol. abs/1902.06068, pp. 1–24, Feb. 2019.
in Proc. Int. Conf. Learn. Represent., 2015. [Online]. Available: http://arxiv.org/abs/1902.06068
[21] W.-S. Lai, J.-B. Huang, N. Ahuja, and M.-H. Yang, “Deep Laplacian [46] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image
pyramid networks for fast and accurate super-resolution,” in Proc. IEEE quality assessment: From error visibility to structural similarity,” IEEE
Conf. Comput. Vis. Pattern Recognit., 2017, pp. 624–632. Trans. Image Process., vol. 13, no. 4, pp. 600–612, Apr. 2004.
[22] R. Lan et al., “Cascading and enhanced residual networks for accu- [47] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “CBAM: Convolutional
rate single image super-resolution,” IEEE Trans. Cybern., early access, block attention module,” in Proc. Eur. Conf. Comput. Vis. (ECCV),
doi: 10.1109/TCYB.2019.2952710 Sep. 2018, pp. 3–19.
[23] R. Lan, Y. Zhou, Z. Liu, and X. Luo, “Prior knowledge-based proba- [48] B. Wronski et al., “Handheld multi-frame super-resolution,”
bilistic collaborative representation for visual recognition,” IEEE Trans. ACM Trans. Graph., vol. 38, no. 4, pp. 1–18, Jul. 2019.
Cybern., early access, doi: 10.1109/TCYB.2018.2880290. [Online]. Available: http://doi.acm.org/10.1145/3306346.3323024
[24] C. Ledig et al., “Photo-realistic single image super-resolution using [49] J. Yang, J. Wright, T. S. Huang, and Y. Ma, “Image super-resolution
a generative adversarial network,” in Proc. IEEE Conf. Comput. Vis. via sparse representation,” IEEE Trans. Image Process., vol. 19, no. 11,
Pattern Recognit. (CVPR), Jul. 2017, pp. 105–114. pp. 2861–2873, Nov. 2010.
LAN et al.: MADNet: FAST AND LIGHTWEIGHT NETWORK FOR SINGLE-IMAGE SUPER RESOLUTION 1453
[50] P. Yi, Z. Wang, K. Jiang, Z. Shao, and J. Ma, “Multi-temporal ultra dense Zhenbing Liu received the B.S. degree from Qufu
memory network for video super-resolution,” IEEE Trans. Circuits Syst. Normal University, Qufu, China, and the M.S. and
Video Technol., early access, doi: 10.1109/TCSVT.2019.2925844. Ph.D. degrees from the Huazhong University of
[51] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated Science and Technology, Wuhan, China.
convolutions,” in Proc. Int. Conf. Learn. Represent., 2016. He was a Visiting Scholar with the Department of
[52] M. D. Zeiler, G. W. Taylor, and R. Fergus, “Adaptive deconvolutional Radiology, University of Pennsylvania, Philadelphia,
networks for mid and high level feature learning,” in Proc. IEEE Int. PA, USA, in 2015. He is currently a Professor and a
Conf. Comput. Vis. (ICCV), Nov. 2011, pp. 2018–2025. Doctoral Supervisor with the School of Computer
[53] R. Zeyde, M. Elad, and M. Protter, “On single image scale-up using and Information Security, Guilin University of
sparse-representations,” in Curves and Surfaces, J.-D. Boissonnat et al., Electronic Technology, Guilin, China. His main
Eds. Heidelberg, Germany: Springer, 2012, pp. 711–730. research interests include image processing, machine
[54] K. Zhang, W. Zuo, and L. Zhang, “Learning a single convolutional learning, and pattern recognition.
super-resolution network for multiple degradations,” in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2018, pp. 3262–3271.
[55] Y. Zhang, L. Sun, C. Yan, X. Ji, and Q. Dai, “Adaptive residual networks
for high-quality image restoration,” IEEE Trans. Image Process., vol. 27,
no. 7, pp. 3150–3163, Jul. 2018. Huimin Lu received the M.S. degrees in elec-
[56] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu, “Image super- trical engineering from the Kyushu Institute of
resolution using very deep residual channel attention networks,” in Proc. Technology, Kitakyushu, Japan, and Yangzhou
Eur. Conf. Comput. Vis. (ECCV), Sep. 2018, pp. 294–310. University, Yangzhou, China, in 2011, and the Ph.D.
[57] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu, “Residual dense degree in electrical engineering from the Kyushu
network for image super-resolution,” in Proc. IEEE Conf. Comput. Vis. Institute of Technology in 2014.
Pattern Recognit. (CVPR), Jun. 2018, pp. 2472–2481. From 2013 to 2016, he was a JSPS Research
[58] L. Zhou, Z. Wang, Y. Luo, and Z. Xiong, “Separability and compactness Fellow with the Kyushu Institute of Technology,
network for image recognition and superresolution,” IEEE Trans. Neural where he is currently an Associate Professor. He
Netw. Learn. Syst., vol. 30, no. 11, pp. 3275–3286, Nov. 2019. is an Excellent Young Researcher with MEXT,
Tokyo, Japan. His research interests include com-
puter vision, robotics, artificial intelligence, and ocean observation.