0% found this document useful (0 votes)
26 views10 pages

Wang 等。 - Exploring Sparsity in Image Super-Resolution for E

Uploaded by

zjd13969
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views10 pages

Wang 等。 - Exploring Sparsity in Image Super-Resolution for E

Uploaded by

zjd13969
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Exploring Sparsity in Image Super-Resolution for Efficient Inference

Longguang Wang1 , Xiaoyu Dong2,3 , Yingqian Wang1 , Xinyi Ying1 , Zaiping Lin1 , Wei An1 , Yulan Guo1∗
1
National University of Defense Technology 2 The University of Tokyo 3
RIKEN AIP
{wanglongguang15,yulan.guo}@nudt.edu.cn

Abstract # Params.
38.00
SMSR (Ours)
Current CNN-based super-resolution (SR) methods pro- 37.95
cess all locations equally with computational resources be- 250K 500K 1M 2M
37.90
ing uniformly assigned in space. However, since missing

PSNR
details in low-resolution (LR) images mainly exist in re- 37.85 IDN
FALSR-A MemNet
gions of edges and textures, less computational resources
37.80
are required for those flat regions. Therefore, existing CNN- CARN
SRFBN-S
based methods involve redundant computation in flat re- 37.75
gions, which increases their computational cost and lim-
37.70
its their applications on mobile devices. In this paper, we 100 200 300 400 500 600
explore the sparsity in image SR to improve inference effi- GFLOPs

ciency of SR networks. Specifically, we develop a Sparse Figure 1. Trade-off between PSNR performance, number of pa-
rameters and FLOPs. Results are achieved on Set5 for ×2 SR.
Mask SR (SMSR) network to learn sparse masks to prune
redundant computation. Within our SMSR, spatial masks Since the pioneering work of SRCNN [8], deeper net-
learn to identify “important” regions while channel masks works have been extensively studied for image SR. In
learn to mark redundant channels in those “unimportant” VDSR [19], SR network is first deepened to 20 layers.
regions. Consequently, redundant computation can be ac- Then, a very deep and wide architecture with over 60 lay-
curately localized and skipped while maintaining compa- ers is introduced in EDSR [29]. Later, Zhang et al. further
rable performance. It is demonstrated that our SMSR increased the network depth to over 100 and 400 in RDN
achieves state-of-the-art performance with 41%/33%/27% [51] and RCAN [50], respectively. Although a deep net-
FLOPs being reduced for ×2/3/4 SR. Code is available at: work usually improves SR performance, it also leads to high
https://github.com/LongguangWang/SMSR. computational cost and limits the applications on mobile de-
vices. To address this problem, several efforts have been
made to reduce model size through information distillation
1. Introduction [17] and efficient feature reuse [2]. Nevertheless, these net-
works still involve redundant computation. Compared to an
The goal of single image super-resolution (SR) is to HR image, missing details in its LR image mainly exist in
recover a high-resolution (HR) image from a single low- regions of edges and textures. Consequently, less computa-
resolution (LR) observation. Due to the powerful feature tional resources are required in those flat regions. However,
representation and model fitting capabilities of deep neu- these CNN-based SR methods process all locations equally,
ral networks, CNN-based SR methods have achieved sig- resulting in redundant computation within flat regions.
nificant performance improvements over traditional ones. In this paper, we explore the sparsity in image SR to im-
Recently, many efforts have been made towards real-world prove inference efficiency of SR networks. We first study
applications, including few-shot SR [38, 39], blind SR the intrinsic sparsity of the image SR task and then investi-
[12, 49, 42], and scale-arbitrary SR [15, 43]. With the pop- gate the feature sparsity in existing SR networks. To fully
ularity of intelligent edge devices (such as smartphones and exploit the sparsity for efficient inference, we propose a
VR glasses), performing SR on these devices is highly de- sparse mask SR (SMSR) network to dynamically skip re-
manded. Due to the limited resources of edge devices1 , ef- dundant computation at a fine-grained level. Our SMSR
ficient SR is crucial to the applications on these devices. learns spatial masks to identify “important” regions (e.g.,
1 For example, the computational performance of Kirin 990 and RTX edge and texture regions) and uses channel masks to mark
2080Ti are 0.9 and 13.4 tFLOPS, respectively. redundant channels in those “unimportant” regions. These

4917
two kinds of masks work jointly to accurately localize re- In contrast to many existing works that focus on compact
dundant computation. During network training, we soften architecture designs, few efforts have been made to exploit
these binary masks using the Gumbel softmax trick to make the redundancy in SR networks for efficient inference.
them differentiable. During inference, we use sparse con- Adaptive Inference. Adaptive inference techniques [44,
volution to skip redundant computation. It is demonstrated 37, 36, 11, 26] have attracted increasing interests since they
that our SMSR can effectively localize and prune redun- can adapt the network structure according to the input. One
dant computation to achieve better efficiency while produc- active branch of adaptive inference techniques is to dynam-
ing promising results (Fig. 1). ically select an inference path at the levels of layers. Specif-
Our main contributions can be summarized as: 1) We ically, Wu et al. [45] proposed a BlockDrop approach for
develop an SMSR network to dynamically skip redundant ResNets to dynamically drop several residual blocks for ef-
computation for efficient image SR. In contrast to existing ficiency. Mullapudi et al. [36] proposed an HydraNet with
works that focus on lightweight network designs, we ex- multiple branches and used a gating approach to dynami-
plore a different route by pruning redundant computation cally choose a set of them at test time. Another popular
to improve inference efficiency. 2) We propose to local- branch is early stopping techniques that skip the computa-
ize redundant computation by learning spatial and channel tion at a location whenever it is deemed to be unnecessary
masks. These two kinds of masks work jointly for fine- [46]. On top of ResNets, Figurnov et al. [9] proposed a
grained localization of redundant computation. 3) Experi- spatially adaptive computation time (SACT) mechanism to
mental results show that our SMSR achieves state-of-the-art stop computation for a spatial position when the features
performance with better inference efficiency. For example, become “good enough”. Liu et al. [31] introduced adap-
our SMSR outperforms previous methods on Set14 for ×2 tive inference for SR by producing a map of local network
SR with a significant speedup on mobile devices (Table 2). depth to adapt the number of convolutional layers imple-
mented at different locations. However, these adaptive in-
2. Related Work ference methods only focus on spatial redundancy without
considering redundancy in channel dimension.
In this section, we first review several major works for Network Pruning. Network pruning [13, 32, 33] is widely
CNN-based single image SR. Then, we discuss CNN accel- used to remove a set of redundant parameters for network
eration techniques related to our work, including adaptive acceleration. As a popular branch of network pruning
inference and network pruning. methods, structured pruning approaches are usually used to
Single Image SR. CNN-based methods have dominated the prune the network at the level of channels and even lay-
research of single image SR due to their strong representa- ers [25, 32, 33, 14]. Specifically, Li et al. [25] used L1
tion and fitting capabilities. Dong et al. [8] first introduced a norm to measure the importance of different filters and then
three-layer network to learn an LR-to-HR mapping for sin- pruned less important ones. Liu et al. [32] imposed a spar-
gle image SR. Then, a deep network with 20 layers was pro- sity constraint on scaling factors of the batch normalization
posed in VDSR [19]. Recently, deeper networks are exten- layers and identified channels with lower scaling factors as
sively studied for image SR. Lim et al. [29] proposed a very less informative ones. Different from these static structured
deep and wide network (namely, EDSR) by cascading mod- pruning methods, Lin et al. [30] conducted runtime neural
ified residual blocks. Zhang et al. [51] further combined network pruning according to the input image. Recently,
residual learning and dense connection to build RDN with Gao et al. [10] introduced a feature boosting and suppres-
over 100 layers. Although these networks achieve state-of- sion method to dynamically prune unimportant channels at
the-art performance, the high computational cost and mem- inference time. Nevertheless, these network pruning meth-
ory footprint limit their applications on mobile devices. ods treat all spatial locations equally without taking their
To address this problem, several lightweight networks different importance into consideration.
were developed [22, 17, 2]. Specifically, distillation blocks
were proposed for feature learning in IDN [17], while a 3. Sparsity in Image Super-Resolution
cascading mechanism was introduced to encourage efficient
feature reuse in CARN [2]. Different from these manually In this section, we first illustrate the intrinsic sparsity of
designed networks, Chu et al. [6] developed a compact ar- the single image SR task and then investigate the feature
chitecture using neural architecture search (NAS). Recently, sparsity in state-of-the-art SR networks.
Lee et al. [24] introduced a distillation framework to lever- Given an HR image I HR and its LR version I LR (e.g.,
age knowledge learned by powerful teacher SR networks to ×4 downsampled), we super-resolve I LR using Bicubic
SR SR
boost the performance of lightweight student SR networks. and RCAN to obtain IBicubic and IRCAN , respectively.
SR
Although these lightweight SR networks successfully re- Figure 2 shows the absolute difference between IBicubic ,
SR HR
duce the model size, redundant computation is still involved IRCAN and I in the luminance channel. It can be ob-
SR
and hinders them to achieve better computational efficiency. served from Fig. 2(b) that IBicubic is “good enough” for flat

4918
75%

(a) I HR (b) I HR  I Bicubic


SR sparsity: ~85%

sparsity: ~9%
(c) I HR  I RCAN
SR SR
(d) I RCAN  I Bicubic
SR

Figure 3. Visualization of feature maps after the ReLU layer in the first back-
SR
Figure 2. Absolute difference between IBicubic , bone block of RCAN. Note that, sparsity is defined as the ratio of zeros in the
SR
IRCAN and I HR in the luminance channel. corresponding channels.
regions, with noticeable missing details in only a small pro- Within each SMM, spatial and channel masks are first
portion of regions (∼ 17% pixels with |I HR − IBicubic
SR
| > generated to localize redundant computation, as shown in
0.1). That is, the SR task is intrinsically sparse in spatial Fig. 4. Then, the redundant computation is dynamically
domain. Compared to Bicubic, RCAN performs better in skipped using L densely-connected sparse mask convolu-
edge regions while achieving comparable performance in tions. Since only necessary computation is performed, our
flat regions (Fig. 2(c)). Although RCAN focuses on re- SMSR can achieve better efficiency while maintaining com-
covering high-frequency details in edge regions (Fig. 2(d)), parable performance.
those flat regions are equally processed at the same time.
Consequently, redundant computation is involved. 4.1. Sparse Mask Generation
Figure 3 illustrates the feature maps after the ReLU layer
in a backbone block of RCAN. It can be observed that 1) Training Phase
the spatial sparsity varies significantly for different chan- Spatial Mask. The goal of spatial mask is to identify “im-
nels. Moreover, a considerable number of channels are portant” regions in feature maps (i.e., 0 for “unimportant”
quite sparse (sparsity ≥ 0.8), with only edge and texture regions and 1 for “important” ones). To make the binary
regions being activated. That is, computation in those flat spatial mask learnable, we use Gumbel softmax distribu-
regions is redundant since these regions are not activated tion to approximate the one-hot distribution [18]. Specifi-
after the ReLU layer. In summary, RCAN activates only a cally, input feature F ∈ RC×H×W is first fed to an hourglass
few channels for “unimportant” regions (e.g., flat regions) block to produce F spa ∈ R2×H×W , as shown in Fig. 5(a).
and more channels for “important” regions (e.g., edge re- Then, the Gumbel softmax trick is used to obtain a softened
gions). More results achieved with different SR networks spatial mask Mkspa ∈ RH×W :
and backbone blocks are provided in the supplemental ma-   
terial. exp F spa [1, x, y]+Gspa k [1, x, y] /τ
Motivated by these observations, we learn sparse masks Mkspa [x, y] = P   ,
2
i=1 exp F spa [i, x, y]+Gspa
k [i, x, y] /τ
to localize and skip redundant computation for efficient in-
ference. Specifically, our spatial masks dynamically iden- (1)
tify “important” regions while the channel masks mark re- where x, y are vertical and horizontal indices, Gspa k ∈
dundant channels in those “unimportant” regions. Com- R2×H×W is a Gumbel noise tensor with all elements fol-
pared to network pruning methods [10, 30, 14], we take lowing Gumbel(0, 1) distribution and τ is a temperature
region redundancy into consideration and only prune chan- parameter. When τ → ∞, samples from Gumbel softmax
nels for “unimportant” regions. Different from adaptive in- distribution become uniform. That is, all elements in Mkspa
ference networks [37, 27], we further investigate the redun- are 0.5. When τ → 0, samples from Gumbel softmax dis-
dancy in channel dimension to localize redundant computa- tribution become one-hot. That is, Mkspa becomes binary.
tion at a finer-grained level. In practice, we start at a high temperature and anneal to a
small one to obtain binary spatial masks.
4. Our SMSR Network Channel Mask. In addition to spatial masks, channel
masks are used to mark redundant channels in those “unim-
Our SMSR network uses sparse mask modules (SMM) portant” regions (i.e., 0 for redundant channels and 1 for
to prune redundant computation for efficient image SR. preserved ones). Here, we also use Gumbel softmax trick

4919
𝑠𝑝𝑎
sparse mask 𝑀𝑘
generation 𝑐ℎ
𝑀𝑘,𝑙 (l=1,…L)

3x3 conv

upscaler
3x3 conv
1x1 conv
3x3 conv

attention
1x1 conv

channel
𝐹

ReLU

ReLU

𝐿


1 C SMM C

sparse mask module (SMM)

sparse mask sparse mask sparse mask


module generation conv C concatenation summation upsampling

Figure 4. An overview of our SMSR network.


Sparse Mask Generation Sparse Mask Convolution
𝑐ℎ
(𝐶 × 𝐻 × 𝑊) 𝑀𝑘,𝑙



𝑐ℎ
𝑀𝑘,𝑙−1 𝑐ℎ
1−𝑀𝑘,𝑙 𝑀𝑘
𝑠𝑝𝑎
Training Phase

ReLU

… …
… …

… …



shared
… …

ReLU

ReLU
𝑠𝑝𝑎


𝑐ℎ
(2 × 𝐶) 𝑀𝑘,𝑙 𝑀𝑘
… …

… …


… …


… …

… …


gumbel gumbel 𝑐ℎ 𝑠𝑝𝑎
1−𝑀𝑘,𝑙 𝑀𝑘
𝑐ℎ
1−𝑀𝑘,𝑙−1


𝑐ℎ 𝑠𝑝𝑎
𝑀𝑘,𝑙 𝑀𝑘


(𝐶) (𝐻 × 𝑊)
(a) (b)
(𝐶 × 𝐻 × 𝑊)
𝐹

𝑐ℎ
𝑀𝑘,𝑙−1 𝑠𝑝𝑎

ReLU 𝑐ℎ
𝑀𝑘
𝑀𝑘,𝑙−1


Inference Phase


output channel

① ③


ReLU 𝑠𝑝𝑎

ReLU
split

𝑀𝑘
… …


……
𝑐ℎ
𝑀𝑘,𝑙

C

(2 × 𝐶) ② ④



𝑆𝑘,𝑙

𝐹 𝑠𝑝𝑎

𝑠𝑝𝑎
𝑀𝑘

argmax argmax input channel



kernel split

𝑠𝑝𝑎

𝑐ℎ
𝑀𝑘,𝑙 𝑀𝑘
(𝐶) (𝐻 × 𝑊)
(c) (d)
gubmel
gumbel

argmax

3x3 3x3 sparse argmax summation multiplication C concatenation


conv conv softmax

Figure 5. An illustration of sparse mask generation and sparse mask convolution.

[c])×Mkspa [x, y]
ch
!
to produce binary channel masks. For the lth convolu- 1 X (1−Mk,l
tional layer in the k th SMM, we feed auxiliary parameter ηk,l = ,
C ×H ×W c,x,y +Mk,l ch
[c]×I[x, y]
Sk,l ∈ R2×C to a Gumbel softmax layer to generate soft-
ch (3)
ened channel masks Mk,l ∈ RC :
where I ∈ RH×W is a tensor with all ones. Note that, ηk,l
   represents the ratio of activated locations in the output fea-
Sk,l [1, c] + Gch
exp k,l [1, c] /τ ture maps. To encourage the output features to be more
ch
Mk,l [c] = P   , (2) sparse with fewer locations being activated, we further in-
2
i=1 exp Sk,l [i, c] + Gch k,l [i, c] /τ troduce a sparsity regularization loss:
1 X
Lreg = ηk,l , (4)
K ×L
where c is the channel index and Gch
k,l ∈ R
2×C
is a Gumbel k,l
noise tensor. In our experiments, Sk,l is initialized using where K is the number of SMMs and L is the number of
random values drawn from a Gaussian distribution N(0, 1). sparse mask convolutional layers within each SMM.
Sparsity Regularization. Based on spatial and channel Training Strategy. During the training phase, the tempera-
masks, we define a sparsity term ηk,l : ture parameter τ in Gumbel softmax layers is annealed us-

4920
Table 1. Comparative results achieved on Set14 by our SMSR with different settings for ×2 SR.
Model Spatial Mask Channel Mask Conv #Params. Sparsity FLOPs PSNR SSIM
1 ✗ ✗ Vanilla 926K 0 1.00× 33.65 0.9180
2 ✗ ✓ Vanilla 587K 0.46 0.60× 33.53 0.9169
3 ✓ ✗ Sparse 985K 0.42 0.65× 33.60 0.9176
4 (Ours) ✓ ✓ Sparse 985K 0.46 0.61× 33.64 0.9179

t
ing the schedule τ = max(0.4, 1− Ttemp ), where t is the tation for redundant channels within those “unimportant”
number of epochs and Ttemp is empirically set to 500 in our regions can be skipped for efficient inference.
experiments. As τ gradually decreases, Gumbel softmax
4.3. Discussion
distribution is forced to approach an one-hot distribution to
produce binary spatial and channel masks. Different from many recent works that use lightweight
2) Inference Phase network designs [17, 2, 6] or knowledge distillation [24]
for efficient SR, we speedup SR networks by pruning re-
During training, Gumbel softmax distributions are
dundant computation. Previous adaptive inference and net-
forced to approach one-hot distributions as τ decreases.
work pruning methods focus on redundant computation in
Therefore, we replace the Gumbel softmax layers with
spatial and channel dimensions independently. Directly ap-
argmax layers after training to obtain binary spatial and
plying these approaches cannot fully exploit the redundancy
channel masks, as shown in Fig. 5(c).
in SR networks and suffers notable performance drop, as
4.2. Sparse Mask Convolution demonstrated in Sec. 5.2. In contrast, our SMSR provides
a unified framework to consider redundancy in both spatial
1) Training Phase and channel dimensions. It is demonstrated that our spatial
To enable backpropagation of gradients at all locations, and channel masks are well compatible to each other and
we do not explicitly perform sparse convolution during facilitate our SMSR to obtain fine-grained localization of
training. Instead, we multiply the results of a vanilla redundant computation.
“dense” convolution with predicted spatial and channel
masks, as shown in Fig. 5(b). Specifically, input feature F is 5. Experiments
ch ch
first multiplied with Mk,l−1 and (1−Mk,l−1 ) to obtain F D 5.1. Implementation Details
S
and F , respectively. That is, channels with “dense” and
“sparse” feature maps in F are separated. Next, F D and F S We used 800 training images and 100 validation im-
are passed to two convolutions with shared weights. The ages from the DIV2K dataset [1] as training and validation
resulting features are then multiplied with different combi- sets. For evaluation, we used five benchmark datasets in-
nations of (1 − Mk,lch ch
), Mk,l and Mkspa to activate different cluding Set5 [4], Set14 [48], B100 [34], Urban100 [16],
parts of the features. Finally, all these features are summed and Manga109 [35]. Peak signal-to-noise ratio (PSNR) and
up to generate the output feature F out . Thanks to Gumbel structural similarity index (SSIM) were used as evaluation
softmax trick used in mask generation, gradients at all lo- metrics to measure SR performance. Following the evalua-
cations can be preserved to optimize the kernel weights of tion protocol in [50, 51], we cropped borders and calculated
convolutional layers. the metrics in the luminance channel.
During training, 16 LR patches of size 96 × 96 and their
2) Inference Phase corresponding HR patches were randomly cropped. Data
During the inference phase, sparse convolution is per- augmentation was then performed through random rotation
formed based on the predicted spatial and channel masks, and flipping. We set C = 64, L = 4, K = 5 for our SMSR.
as shown in Fig. 5(d). Take the lth layer in the k th SMM as We used the Adam method [21] with β1 = 0.9 and β2 =
an example, its kernel is first splitted into four sub-kernels 0.999 for optimization. The initial learning rate was set to
ch ch
according to Mk,l−1 and Mk,l to obtain four convolutions. 2 × 10−4 and reduced to half after every 200 epochs. The
Meanwhile, input feature F is splitted into F D and F S training was stopped after 1000 epochs. The overall loss for
ch
based on Mk,l−1 . Then, F D is fed to convolutions ➀ and training is defined as L = LSR + λLreg , where LSR is the
➁ to produce F D2D
and F D2S , while F S is fed to convo- L1 loss between SR results and HR images, Lreg is defined
lutions ➂ and ➃ to produce F S2D and F S2S . Note that, in Eq. 4. To maintain training stability, we used a warmup
t
F D2D is produced by a vanilla “dense” convolution while strategy λ = λ0 × min( Twarm , 1), where t is the number of
F D2S , F S2D and F S2S are generated by sparse convolu- epochs, Twarm is empirically set to 50 and λ0 is set to 0.1.
tions with only “important” regions (marked by Mkspa ) be-
5.2. Model Analysis
ing computed. Finally, features obtained from these four
branches are summed and concatenated to produce the out- We first conduct experiments to demonstrate the effec-
put feature F out . Using sparse mask convolution, compu- tiveness of sparse masks. Then, we investigate the effect of

4921
Table 2. Comparative results achieved on Set14 by our SMSR with different sparsities for ×2 SR.
Time
Model Conv λ0 Sparsity #Params. FLOPs Memory PSNR SSIM
GPU CPU Kirin 990 Kirin 810
baseline Vanilla 0 0 926K 1.00× 1.00× 1.00× 1.00× 1.00× 1.00× 33.65 0.9180
5 Sparse 0.1 0.46 985K 0.61× 0.89× 1.22× 0.79× 0.64× 0.57× 33.64 0.9179
6 Sparse 0.2 0.64 985K 0.46× 0.87× 1.11× 0.73× 0.55× 0.50× 33.61 0.9174
7 Sparse 0.3 0.73 985K 0.38× 0.85× 1.04× 0.68× 0.54× 0.45× 33.52 0.9169
IDN [17] - - - 553K 0.57× 0.91× 1.04× 0.73× 0.71× 0.60× 33.30 0.9148
CARN [2] - - - 1592K 0.99× 1.01× 1.00× 0.89× 0.96× 1.15× 33.52 0.9166
FALSR-A [6] - - - 1021K 1.04× 2.02× 1.11× 1.05× 1.02× 0.92× 33.55 0.9168
layers
channels

0.50 SMSR Baseline


x2 x3 x4 33.65 (Ours)
0.45
33.60
0.40 FALSR-A
0.35 33.55

sparsity
0.30 33.50

PSNR
CARN
0.25
33.45 α=30
0.20
0.15 33.40
α=50 # Params.
0.10 33.35
Mch Mspa
sparsity: 0.45 0.05 33.30 IDN 25 5
0K 00K 1M 2M
0.00
Figure 6. Visualization of sparse masks. SMM1 SMM2 SMM3 SMM4 SMM5
50 100 150 200 250 300
GFLOPs
Blue and green regions in M ch represent
channels with “dense” and “sparse” fea- Figure 7. Comparison of sparsities achieved Figure 8. Comparison between learning-
ture maps, respectively. In M spa , “im- in different SMMs on butterfly for different based masks (red circles) and gradient-based
portant” locations are shown in yellow. scale factors. masks (yellow and green circles) on Set14.

sparsity and visualize sparse masks for discussion. Finally, based convolutions since different implementation methods
we compare our learning-based masks with heuristic ones. (e.g., Winograd [23] and FFT [41]) have different computa-
tional costs. Comparative results are presented in Table 2.
Effectiveness of Sparse Masks. To demonstrate the effec-
tiveness of our sparse masks, we first introduced variant 1 As λ0 increases, our SMSR produces higher sparsities
by removing both spatial and channel masks. Then, we de- with more FLOPs and memory consumption being reduced.
veloped variants 2 and 3 by adding channel masks and spa- Further, our network also achieves significant speedup on
tial masks, respectively. Comparative results are shown in CPU and mobile processors. Due to the irregular and frag-
Table 1. Without spatial and channel masks, all locations mented memory patterns, sparse convolution cannot make
and all channels are processed equally. Therefore, variant 1 full use of the characteristics of general GPUs (e.g., mem-
has a high computational cost. Using channel masks, redun- ory coalescing) and relies on specialized designs to im-
dant channels are pruned at all spatial locations. Therefore, prove memory locality and cache hit rate for acceleration
variant 2 can be considered as a pruned version of variant [47]. Therefore, the advantage of our SMSR cannot be fully
1. Although variant 2 has fewer parameters and FLOPs, exploited on GPUs without specific optimization. Com-
it suffers a notable performance drop (33.53 vs. 33.65) pared to other state-of-the-art methods, our SMSR (variant
since beneficial information in “important” regions of these 5) obtains better performance with lower memory consump-
pruned channels are discarded. With only spatial masks, tion and shorter inference time on mobile processors. This
variant 3 suffers from a conflict between efficiency and per- clearly demonstrates the great potential of our SMSR for
formance since redundant computation in channel dimen- applications on mobile devices.
sion cannot be well handled. Consequently, its FLOPs is
Visualization of Sparse Masks. We visualize the sparse
reduced with a performance drop (33.60 vs. 33.65). Using
masks generated in the first SMM for ×2 SR in Fig. 6. More
both spatial and channel masks, our SMSR can effectively
results are provided in the supplemental material. It can be
localize and skip redundant computation at a finer-grained
seen that locations around edges and textures in M spa are
level to reduce FLOPs by 39% while maintaining compara-
considered as “important” ones, which is consistent with
ble performance (33.64 vs. 33.65).
our observations in Sec. 3. Moreover, we can see that there
Effect of Sparsity. To investigate the effect of sparsity, we are more sparse channels (i.e., green regions in M ch ) in
retrained our SMSR with large λ0 to encourage high spar- deep layers than shallow layers. This means that a subset
sity. Nvidia RTX 2080Ti, Intel I9-9900K and Kirin 990/810 of channels in shallow layers are informative enough for
were used as platforms of GPU, CPU and mobile processor “unimportant” regions and our network progressively fo-
for evaluation. For fair comparison of memory consumption cuses more on “important” regions as the depth increases.
and inference time, all convolutional layers in the backbone Overall, our spatial and channel masks work jointly for fine-
of different networks were implemented using im2col [5] grained localization of redundant computation.

4922
Table 4. Comparative results achieved for ×2/3/4 SR. PSNR/SSIM results of previous works are directly copied from corresponding
papers. FLOPs is computed based on HR images with a resolution of 720p (1280 × 720). For SMSR, average sparsities on all datasets
(0.49/0.39/0.33 for ×2/3/4 SR) are used to calculate FLOPs, with full FLOPs being shown in brackets. Best and second best results are
highlighted and underlined.
Model Scale #Params FLOPs Set5 Set14 B100 Urban100 Manga109
Bicubic ×2 - - 33.66/0.9299 30.24/0.8688 29.56/0.8431 26.88/0.8403 30.80/0.9339
SRCNN [8] ×2 57K 52.7G 36.66/0.9542 32.45/0.9067 31.36/0.8879 29.50/0.8946 35.60/0.9663
VDSR [19] ×2 665K 612.6G 37.53/0.9590 33.05/0.9130 31.90/0.8960 30.77/0.9140 37.22/0.9750
DRCN [20] ×2 1774K 9788.7G 37.63/0.9588 33.04/0.9118 31.85/0.8942 30.75/0.9133 37.55/0.9732
LapSRN [22] ×2 813K 29.9G 37.52/0.9591 33.08/0.9130 31.08/0.8950 30.41/0.9101 37.27/0.9740
MemNet [40] ×2 677K 623.9G 37.78/0.9597 33.28/0.9142 32.08/0.8978 31.31/0.9195 37.72/0.9740
SRFBN-S [28] ×2 282K 574.4G 37.78/0.9597 33.35/0.9156 32.00/0.8970 31.41/0.9207 38.06/0.9757
IDN [17] ×2 553K 127.7G 37.83/0.9600 33.30/0.9148 32.08/0.8985 31.27/0.9196 38.01/0.9749
CARN [2] ×2 1592K 222.8G 37.76/0.9590 33.52/0.9166 32.09/0.8978 31.92/0.9256 38.36/0.9765
FALSR-A [6] ×2 1021K 234.7G 37.82/0.9595 33.55/0.9168 32.12/0.8987 31.93/0.9256 -/-
SMSR ×2 985K (224.1G)131.6G 38.00/0.9601 33.64/0.9179 32.17/0.8990 32.19/0.9284 38.76/0.9771
Bicubic ×3 - - 30.39/0.8682 27.55/0.7742 27.21/0.7385 24.46/0.7349 26.95/0.8556
SRCNN [8] ×3 57K 52.7G 32.75/0.9090 29.30/0.8215 28.41/0.7863 26.24/0.7989 30.48/0.9117
VDSR [19] ×3 665K 612.6G 33.67/0.9210 29.78/0.8320 28.83/0.7990 27.14/0.8290 32.01/0.9340
DRCN [20] ×3 1774K 9788.7G 33.82/0.9226 29.76/0.8311 28.80/0.7963 27.14/0.8279 32.24/0.9343
MemNet [40] ×3 677K 623.9G 34.09/0.9248 30.01/0.8350 28.96/0.8001 27.56/0.8376 32.51/0.9369
SRFBN-S [28] ×3 375K 686.4G 34.20/0.9255 30.10/0.8372 28.96/0.8010 27.66/0.8415 33.02/0.9404
IDN [17] ×3 553K 57.0G 34.11/0.9253 29.99/0.8354 28.95/0.8013 27.42/0.8359 32.71/0.9381
CARN [2] ×3 1592K 118.8G 34.29/0.9255 30.29/0.8407 29.06/0.8034 28.06/0.8493 33.50/0.9440
SMSR ×3 993K (100.5G)67.8G 34.40/0.9270 30.33/0.8412 29.10/0.8050 28.25/0.8536 33.68/0.9445
Bicubic ×4 - - 28.42/0.8104 26.00/0.7027 25.96/0.6675 23.14/0.6577 24.89/0.7866
SRCNN [8] ×4 57K 52.7G 30.48/0.8628 27.50/0.7513 26.90/0.7101 24.52/0.7221 27.58/0.8555
VDSR [19] ×4 665K 612.6G 31.35/0.8830 28.02/0.7680 27.29/0.7260 25.18/0.7540 28.83/0.8870
DRCN [20] ×4 1774K 9788.7G 31.53/0.8854 28.02/0.7670 27.23/0.7233 25.18/0.7524 28.93/0.8854
LapSRN [22] ×4 813K 149.4G 31.54/0.8850 28.19/0.7720 27.32/0.7270 25.21/0.7560 29.09/0.8900
MemNet [40] ×4 677K 623.9G 31.74/0.8893 28.26/0.7723 27.40/0.7281 25.50/0.7630 29.42/0.8942
SRFBN-S [28] ×4 483K 852.9G 31.98/0.8923 28.45/0.7779 27.44/0.7313 25.71/0.7719 29.91/0.9008
IDN [17] ×4 553K 32.3G 31.82/0.8903 28.25/0.7730 27.41/0.7297 25.41/0.7632 29.41/0.8942
CARN [2] ×4 1592K 90.9G 32.13/0.8937 28.60/0.7806 27.58/0.7349 26.07/0.7837 30.47/0.9084
SMSR ×4 1006K (57.2G)41.6G 32.12/0.8932 28.55/0.7808 27.55/0.7351 26.11/0.7868 30.54/0.9085

Table 3. Comparison between learning-based masks and gradient- masks in our SMSR, we introduced a variant with gradient-
based masks. Results are achieved on Set14 for ×2 SR.
Set14
induced masks. Specifically, we consider locations with
M spa #Params. α Sparsity gradients larger than a threshold α as important ones and
PSNR SSIM
926K 30 0.51 33.48 0.9163 keep the spatial mask fixed within the network. The per-
926K 30 0.62 33.42 0.9155
926K 30 0.72 33.33 0.9151
formance of this variant is compared to our SMSR in Ta-
Gradient-based ble 3. Compared to learning-based masks, the variant with
926K 50 0.50 33.45 0.9162
926K 50 0.61 33.39 0.9153 gradient-based masks suffers a notable performance drop
926K 50 0.71 33.30 0.9150
985K - 0.46 33.64 0.9179 with comparable sparsity (e.g., 33.52 vs. 33.33/33.30). Fur-
Learning-based 985K - 0.64 33.61 0.9174 ther, we can see from Fig. 8 that learning-based masks facil-
985K - 0.73 33.52 0.9169 itate our SMSR to achieve better trade-off between SR per-
formance and computational efficiency. Using fixed heuris-
We further investigate the sparsities achieved by our
tic masks, it is difficult to obtain fine-grained localization of
SMMs for different scale factors. Specifically, we feed an
redundant computation. In contrast, learning-based masks
LR image (×2 downsampled) to ×2/3/4 SMSR networks
enable our SMSR to accurately localize redundant compu-
and compare the sparsities in their SMMs. As shown in
tation to produce better results.
Fig. 7, the sparsities decrease for larger scale factors in
most SMMs. Since more details need to be reconstructed
5.3. Comparison with State-of-the-art Methods
for larger scale factors, more locations are marked as “im-
portant” ones (with sparsities being decreased). We compare our SMSR with nine state-of-the-art meth-
Learning-based Masks vs. Heuristic Masks. As regions ods, including SRCNN [8], VDSR [19], DRCN [20], Lap-
of edges are usually identified as important ones in our spa- SRN [22], MemNet [40], SRFBN-S [28], IDN [17], CARN
tial masks (Fig. 6), another straightforward choice is to use [2], and FALSR-A [6]. As this paper focuses on lightweight
heuristic masks. KernelGAN [3] follows this idea to iden- SR networks (< 2M), several recent works with large mod-
tify regions with large gradients as important ones when ap- els (e.g., EDSR [29] (∼40M), RCAN [50] (∼15M) and
plying ZSSR [38] and uses a masked loss to focus on these SAN [7] (∼15M)) are not included for comparison. Quan-
regions. To demonstrate the effectiveness of learning-based titative results are presented in Table 4 and visualization re-

4923
GT Bicubic VDSR LapSRN

img_004 SRFBN-S IDN CARN Ours

GT Bicubic VDSR LapSRN

img_033 SRFBN-S IDN CARN Ours


Figure 9. Visual comparison on the Urban100 dataset for ×4 SR.

LR Image Bicubic SRFBN-S CARN SMSR


Figure 10. Visual comparison on a real-world image.

sults are shown in Figs. 9 and 10. sults achieved on Urban100. Compared to other methods,
Quantitative Results. As shown in Table 4, our SMSR our SMSR produces better visual results with fewer arti-
outperforms the state-of-the-art methods on most datasets. facts, such as the lattices in img 004 and the stripes on the
For example, our SMSR achieves much better performance building in img 033. We further tested our SMSR on a real-
than CARN for ×2 SR, with the number of parameters and world image to demonstrate its effectiveness. As shown in
FLOPs being reduced by 38% and 41%, respectively. With Fig. 10, our SMSR achieves better perceptual quality while
a comparable model size, our SMSR performs favorably other methods suffer notable artifacts.
against FALSR-A and achieves better inference efficiency
in terms of FLOPs (131.6G vs. 234.7G). With compara- 6. Conclusion
ble computational complexity in terms of FLOPs (131.6G In this paper, we explore the sparsity in image SR to
vs. 127.7G), our SMSR achieves much higher PSNR val- improve inference efficiency of SR networks. Specifically,
ues than IDN. Using sparse masks to skip redundant com- we develop a sparse mask SR network to prune redundant
putation, our SMSR reduces 41%/33%/27% FLOPs for computation. Our spatial and channel masks work jointly
×2/3/4 SR while maintaining the state-of-the-art perfor- to localize redundant computation at a fine-grained level
mance. We further show the trade-off between perfor- such that our network can effectively reduce computational
mance, number of parameters and FLOPs in Fig. 1. We can cost while maintaining comparable performance. Exten-
see that our SMSR achieves the best PSNR performance sive experiments demonstrate that our network achieves the
with low computational cost. state-of-the-art performance with significant FLOPs reduc-
Qualitative Results. Figure 9 compares the qualitative re- tion and a speedup on mobile devices.

4924
References rameterization with gumbel-softmax. In ICLR, 2017. 3
[19] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate
[1] Eirikur Agustsson and Radu Timofte. NTIRE 2017 chal- image super-resolution using very deep convolutional net-
lenge on single image super-resolution: Dataset and study. works. In CVPR, pages 1646–1654, 2016. 1, 2, 7
In CVPRW, pages 1122–1131, 2017. 5 [20] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Deeply-
[2] Namhyuk Ahn, Byungkon Kang, and Kyung-Ah Sohn. Fast, recursive convolutional network for image super-resolution.
accurate, and lightweight super-resolution with cascading In CVPR, pages 1637–1645, 2016. 7
residual network. In ECCV, pages 252–268, 2018. 1, 2, [21] Diederik P. Kingma and Jimmy Ba. Adam: A method for
5, 6, 7 stochastic optimization. In ICLR, 2015. 5
[3] Sefi Bell-Kligler, Assaf Shocher, and Michal Irani. Blind [22] Wei-Sheng Lai, Jia-Bin Huang, Narendra Ahuja, and Ming-
super-resolution kernel estimation using an internal-gan. In Hsuan Yang. Deep laplacian pyramid networks for fast and
NeurIPS, pages 284–293, 2019. 7 accurate super-resolution. In CVPR, pages 5835–5843, 2017.
[4] Marco Bevilacqua, Aline Roumy, Christine Guillemot, and 2, 7
Marie-Line Alberi-Morel. Low-complexity single-image [23] Andrew Lavin and Scott Gray. Fast algorithms for convolu-
super-resolution based on nonnegative neighbor embedding. tional neural networks. In CVPR, pages 4013–4021, 2016.
In BMVC, pages 1–10, 2012. 5 6
[5] Kumar Chellapilla, Sidd Puri, and Patrice Simard. High per- [24] Wonkyung Lee, Junghyup Lee, Dohyung Kim, and Bumsub
formance convolutional neural networks for document pro- Ham. Learning with privileged information for efficient im-
cessing. In IWFHR, 2006. 6 age super-resolution. In ECCV, 2020. 2, 5
[6] Xiangxiang Chu, Bo Zhang, Hailong Ma, Ruijun Xu, Jixiang [25] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and
Li, and Qingyuan Li. Fast, accurate and lightweight super- Hans Peter Graf. Pruning filters for efficient convnets. In
resolution with neural architecture search. In ICPR, 2020. 2, ICLR, 2017. 2
5, 6, 7 [26] Hao Li, Hong Zhang, Xiaojuan Qi, Ruigang Yang, and Gao
[7] Tao Dai, Jianrui Cai, Yongbing Zhang, Shu-Tao Xia, and Huang. Improved techniques for training adaptive deep net-
Lei Zhang. Second-order attention network for single image works. In ICCV, pages 1891–1900, 2019. 2
super-resolution. In CVPR, 2019. 7 [27] Xiaoxiao Li, Ziwei Liu, Ping Luo, Chen Change Loy, and
[8] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Xiaoou Tang. Not all pixels are equal: Difficulty-aware se-
Tang. Learning a deep convolutional network for image mantic segmentation via deep layer cascade. In CVPR, pages
super-resolution. In ECCV, pages 184–199, 2014. 1, 2, 7 6459–6468, 2017. 3
[9] Michael Figurnov, Maxwell D. Collins, Yukun Zhu, Li [28] Zhen Li, Jinglei Yang, Zheng Liu, Xiaomin Yang, Gwang-
Zhang, Jonathan Huang, Dmitry P. Vetrov, and Ruslan gil Jeon, and Wei Wu. Feedback network for image super-
Salakhutdinov. Spatially adaptive computation time for resolution. In CVPR, pages 3867–3876, 2018. 7
residual networks. In CVPR, pages 1790–1799, 2017. 2 [29] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and
[10] Xitong Gao, Yiren Zhao, Lukasz Dudziak, Robert D. Kyoung Mu Lee. Enhanced deep residual networks for single
Mullins, and Cheng-Zhong Xu. Dynamic channel pruning: image super-resolution. In CVPR, 2017. 1, 2, 7
Feature boosting and suppression. In ICLR, 2019. 2, 3 [30] Ji Lin, Yongming Rao, Jiwen Lu, and Jie Zhou. Runtime
[11] Benjamin Graham, Martin Engelcke, and Laurens van der neural pruning. In NeurIPS, pages 2181–2191, 2017. 2, 3
Maaten. 3d semantic segmentation with submanifold sparse [31] Ming Liu, Zhilu Zhang, Liya Hou, Wangmeng Zuo, and Lei
convolutional networks. In CVPR, pages 9224–9232, 2018. Zhang. Deep adaptive inference networks for single image
2 super-resolution. In ECCVW, 2020. 2
[12] Jinjin Gu, Hannan Lu, Wangmeng Zuo, and Chao Dong. [32] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang,
Blind super-resolution with iterative kernel correction. In Shoumeng Yan, and Changshui Zhang. Learning efficient
CVPR, 2019. 1 convolutional networks through network slimming. In ICCV,
[13] Song Han, Jeff Pool, John Tran, and William Dally. Learning pages 2755–2763, 2017. 2
both weights and connections for efficient neural network. In [33] Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A filter
NeurIPS, pages 1135–1143, 2015. 2 level pruning method for deep neural network compression.
[14] Yang He, Ping Liu, Ziwei Wang, Zhilan Hu, and Yi Yang. In ICCV, pages 5068–5076, 2017. 2
Filter pruning via geometric median for deep convolutional [34] David Martin, Charless Fowlkes, Doron Tal, Jitendra Malik,
neural networks acceleration. In CVPR, pages 4340–4349, et al. A database of human segmented natural images and its
2019. 2, 3 application to evaluating segmentation algorithms and mea-
[15] Xuecai Hu, Haoyuan Mu, Xiangyu Zhang, Zilei Wang, Jian suring ecological statistics. In ICCV, 2001. 5
Sun, and Tieniu Tan. Meta-SR: A magnification-arbitrary [35] Yusuke Matsui, Kota Ito, Yuji Aramaki, Azuma Fujimoto,
network for super-resolution. In CVPR, 2019. 1 Toru Ogawa, Toshihiko Yamasaki, and Kiyoharu Aizawa.
[16] Jia-Bin Huang, Abhishek Singh, and Narendra Ahuja. Single Sketch-based manga retrieval using manga109 dataset. Mul-
image super-resolution from transformed self-exemplars. In timedia Tools Appl., 76(20):21811–21838, 2017. 5
CVPR, pages 5197–5206, 2015. 5 [36] Ravi Teja Mullapudi, William R. Mark, Noam Shazeer, and
[17] Zheng Hui, Xiumei Wang, and Xinbo Gao. Fast and accu- Kayvon Fatahalian. Hydranets: Specialized dynamic archi-
rate single image super-resolution via information distilla- tectures for efficient inference. In CVPR, pages 8080–8089,
tion network. In CVPR, 2018. 1, 2, 5, 6, 7 2018. 2
[18] Eric Jang, Shixiang Gu, and Ben Poole. Categorical repa- [37] Mengye Ren, Andrei Pokrovsky, Bin Yang, and Raquel Ur-

4925
tasun. Sbnet: Sparse blocks network for fast inference. In
CVPR, pages 8711–8720, 2018. 2, 3
[38] Assaf Shocher, Nadav Cohen, and Michal Irani. ’Zero-shot”
super-resolution using deep internal learning. In CVPR,
2018. 1, 7
[39] Jae Woong Soh, Sunwoo Cho, and Nam Ik Cho. Meta-
transfer learning for zero-shot super-resolution. In CVPR,
2020. 1
[40] Ying Tai, Jian Yang, Xiaoming Liu, and Chunyan Xu. Mem-
net: A persistent memory network for image restoration. In
ICCV, pages 4549–4557, 2017. 7
[41] Nicolas Vasilache, Jeff Johnson, Michaël Mathieu, Soumith
Chintala, Serkan Piantino, and Yann LeCun. Fast convolu-
tional nets with fbfft: A GPU performance evaluation. In
ICLR, 2015. 6
[42] Longguang Wang, Yingqian Wang, Xiaoyu Dong, Qingyu
Xu, Jungang Yang, Wei An, and Yulan Guo. Unsuper-
vised degradation representation learning for blind super-
resolution. In CVPR, 2021. 1
[43] Longguang Wang, Yingqian Wang, Zaiping Lin, Jungang
Yang, Wei An, and Yulan Guo. Learning for scale-arbitrary
super-resolution from scale-specific networks. arXiv, 2020.
1
[44] Xin Wang, Fisher Yu, Zi-Yi Dou, Trevor Darrell, and
Joseph E. Gonzalez. Skipnet: Learning dynamic routing
in convolutional networks. In ECCV, volume 11217, pages
420–436, 2018. 2
[45] Zuxuan Wu, Tushar Nagarajan, Abhishek Kumar,
Steven Rennie, Larry S. Davis, Kristen Grauman, and
Rogério Schmidt Feris. Blockdrop: Dynamic inference
paths in residual networks. In CVPR, pages 8817–8826,
2018. 2
[46] Zhenda Xie, Zheng Zhang, Xizhou Zhu, Gao Huang, and
Stephen Lin. Spatially adaptive inference with stochastic
feature sampling and interpolation. In ECCV, 2020. 2
[47] Jiecao Yu, Andrew Lukefahr, David Palframan, Ganesh
Dasika, Reetuparna Das, and Scott Mahlke. Scalpel: Cus-
tomizing dnn pruning to the underlying hardware paral-
lelism. In ISCA, 2017. 6
[48] Roman Zeyde, Michael Elad, and Matan Protter. On sin-
gle image scale-up using sparse-representations. In Inter-
national Conference on Curves and Surfaces, volume 6920,
pages 711–730, 2010. 5
[49] Kai Zhang, Luc Van Gool, and Radu Timofte. Deep unfold-
ing network for image super-resolution. In CVPR, 2020. 1
[50] Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng
Zhong, and Yun Fu. Image super-resolution using very deep
residual channel attention networks. In ECCV, pages 1646–
1654, 2018. 1, 5, 7
[51] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and
Yun Fu. Residual dense network for image super-resolution.
In CVPR, pages 2472–2481, 2018. 1, 2, 5

4926

You might also like