0% found this document useful (0 votes)
102 views14 pages

Deep Convolution Networks For Compression Artifact

This document describes a deep learning approach called AR-CNN that uses convolutional neural networks to reduce artifacts from lossy image compression. The AR-CNN model achieves state-of-the-art performance on benchmark datasets for reducing blocking artifacts, ringing effects, and blurring caused by JPEG compression. It is also effective on a real-world case of compression artifacts from Twitter. The authors further accelerate the model using layer decomposition and joint convolutional/deconvolutional layers, achieving a 7.5x speedup with similar performance. They also explore transfer learning by training deeper models using features from shallow pretrained networks.

Uploaded by

ai redna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
102 views14 pages

Deep Convolution Networks For Compression Artifact

This document describes a deep learning approach called AR-CNN that uses convolutional neural networks to reduce artifacts from lossy image compression. The AR-CNN model achieves state-of-the-art performance on benchmark datasets for reducing blocking artifacts, ringing effects, and blurring caused by JPEG compression. It is also effective on a real-world case of compression artifacts from Twitter. The authors further accelerate the model using layer decomposition and joint convolutional/deconvolutional layers, achieving a 7.5x speedup with similar performance. They also explore transfer learning by training deeper models using features from shallow pretrained networks.

Uploaded by

ai redna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/306185963

Deep Convolution Networks for Compression Artifacts Reduction

Article · August 2016

CITATIONS READS
16 113

4 authors, including:

Chen Change Loy Xiaoou Tang


Nanyang Technological University The Chinese University of Hong Kong
164 PUBLICATIONS   8,224 CITATIONS    414 PUBLICATIONS   32,425 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Steering Angle Prediction in Autonomous Driving View project

Camera Network Space-Time Structure Analysis View project

All content following this page was uploaded by Xiaoou Tang on 28 March 2017.

The user has requested enhancement of the downloaded file.


1

Deep Convolution Networks for Compression


Artifacts Reduction
Ke Yu, Chao Dong, Chen Change Loy, Member, IEEE, Xiaoou Tang, Fellow, IEEE,

Abstract—Lossy compression introduces complex compression


artifacts, particularly blocking artifacts, ringing effects and
blurring. Existing algorithms either focus on removing blocking
artifacts and produce blurred output, or restore sharpened
arXiv:1608.02778v1 [cs.CV] 9 Aug 2016

images that are accompanied with ringing effects. Inspired by


the success of deep convolutional networks (DCN) on super-
resolution [6], we formulate a compact and efficient network
for seamless attenuation of different compression artifacts. To
meet the speed requirement of real-world applications, we further
accelerate the proposed baseline model by layer decomposition
and joint use of large-stride convolutional and deconvolutional
layers. This also leads to a more general CNN framework that has (a) Left: the JPEG-compressed image, where we could see blocking artifacts,
a close relationship with the conventional Multi-Layer Perceptron ringing effects and blurring on the eyes, abrupt intensity changes on the face.
(MLP). Finally, the modified network achieves a speed up of 7.5× Right: the restored image by the proposed deep model (AR-CNN), where we
with almost no performance loss compared to the baseline model. remove these compression artifacts and produce sharp details.
We also demonstrate that a deeper model can be effectively
trained with features learned in a shallow network. Following
a similar “easy to hard” idea, we systematically investigate
three practical transfer settings and show the effectiveness of
transfer learning in low-level vision problems. Our method shows
superior performance than the state-of-the-art methods both on
benchmark datasets and a real-world use case.
Index Terms—Convolutional Network, Deconvolution, Com-
pression artifacts, JPEG compression

(b) Left: the Twitter-compressed image, which is first re-scaled to a small


I. I NTRODUCTION image and then compressed on the server-side. Right: the restored image by
the proposed deep model (AR-CNN)
Lossy compression (e.g., JPEG, WebP and HEVC-MSP)
is one class of data encoding methods that uses inexact ap- Fig. 1. Example compressed images and our restoration results on the JPEG
compression scheme and the real use case – Twitter.
proximations for representing the encoded content. In this age
of information explosion, lossy compression is indispensable
in Figure 1(a). As an improved version of JPEG, JPEG 2000
and inevitable for companies (e.g., Twitter and Facebook) to
adopts wavelet transform to avoid blocking artifacts, but still
save bandwidth and storage space. However, compression in
exhibits ringing effects and blurring. Apart from the widely-
its nature will introduce undesired complex artifacts, which
adopted compression standards, commercials also introduced
will severely reduce the user experience (e.g., Figure 1). All
their own compression schemes to meet specific require-
these artifacts not only decrease perceptual visual quality,
ments. For example, Twitter and Facebook will compress
but also adversely affect various low-level image processing
the uploaded high-resolution images by first re-scaling and
routines that take compressed images as input, e.g., contrast
then compression. The combined compression strategies also
enhancement [19], super-resolution [6], [39], and edge de-
introduce severe ringing effects and blurring, but in a different
tection [4]. Despite the huge demand, effective compression
manner (see Figure 1(b)).
artifacts reduction remains an open problem.
Various compression schemes bring different kinds of com- To cope with various compression artifacts, different ap-
pression artifacts, which are all complex and signal-dependent. proaches have been proposed, some of which are designed
Take JPEG compression as an example, the discontinuities for a specific compression standard, especially JPEG. For
between adjacent 8×8 pixel blocks will result in blocking instance, deblocking oriented approaches [21], [27], [35]
artifacts, while the coarse quantization of the high-frequency perform filtering along the block boundaries to reduce only
components will bring ringing effects and blurring, as depicted blocking artifacts. Liew et al. [20] and Foi et al. [8] use
thresholding by wavelet transform and Shape-Adaptive DCT
Ke Yu is with the Department of Electronic Engineering, Tsinghua Univer- transform, respectively. With the help of problem-specific pri-
sity, Beijing, P.R.China, e-mail: ([email protected]). ors (e.g., the quantization table), Liu et al. [22] exploit residual
Chao Dong, Chen Change Loy (corresponding author), and Xiaoou Tang
are with the Department of Information Engineering, The Chinese University redundancies in the DCT domain and propose a sparsity-based
of Hong Kong, e-mail: ( {dc012,ccloy,xtang}@ie.cuhk.edu.com). dual-domain (DCT and pixel domains) approach. Wang et
2

al. [45] further introduce deep sparse-coding networks to the vision problems. We also reveal its close relationship with the
DCT and pixel domains and achieve superior performance. conventional Multi-Layer Perceptron [3].
This kind of methods can be referred to as soft decoding Another issue we met is how to effectively train a deeper
for a specific compression standard (e.g., JPEG), and can be DCN. As pointed out in SRCNN [7], training a five-layer
hardly extended to other compression schemes. Alternatively, network becomes a bottleneck. The difficulty of training is
data-driven learning-based methods have better generalization partially due to the sub-optimal initialization settings. The
ability. Jung et al. [15] propose restoration method based on aforementioned difficulty motivates us to investigate a better
sparse representation. Kwon et al. [18] adopt the Gaussian way to train a deeper model for low-level vision problems.
process (GP) regression to achieve both super-resolution and We find that this can be effectively solved by transferring
compression artifact removal. The adjusted anchored neighbor- the features learned in a shallow network to a deeper one
hood regression (A+) approach [29] is also used to enhance and fine-tuning simultaneously1 . This strategy has also been
JPEG 2000 images. These methods can be easily generalized proven successful in learning a deeper CNN for image classi-
for different tasks. fication [32]. Following a similar general intuitive idea, easy
Deep learning has shown impressive results on both high- to hard, we discover other interesting transfer settings in our
level and low-level vision problems. In particular, the Super- low-level vision task: (1) We transfer the features learned in
Resolution Convolutional Neural Network (SRCNN) proposed a high-quality compression model (easier) to a low-quality
by Dong et al. [6] shows the great potential of an end-to- one (harder), and find that it converges faster than random
end DCN in image super-resolution. The study also points initialization. (2) In the real use case, companies tend to
out that conventional sparse-coding-based image restoration apply different compression strategies (including re-scaling)
model can be equally seen as a deep model. However, if we according to their purposes (e.g., Figure 1(b)). We transfer the
directly apply SRCNN in compression artifact reduction, the features learned in a standard compression model (easier) to
features extracted by its first layer could be noisy, leading to a real use case (harder), and find that it performs better than
undesirable noisy patterns in reconstruction. Thus the three- learning from scratch.
layer SRCNN is not well suited for restoring compressed
images, especially in dealing with complex artifacts. The contributions of this study are four-fold: (1) We formu-
To eliminate the undesired artifacts, we improve SRCNN late a new deep convolutional network for efficient reduction
by embedding one or more “feature enhancement” layers of various compression artifacts. Extensive experiments, in-
after the first layer to clean the noisy features. Experiments cluding that on real use cases, demonstrate the effectiveness of
show that the improved model, namely Artifacts Reduction our method over state-of-the-art methods [8] both perceptually
Convolutional Neural Networks (AR-CNN), is exceptionally and quantitatively. (2) We progressively modify the baseline
effective in suppressing blocking artifacts while retaining edge model AR-CNN and present a more efficient network struc-
patterns and sharp details (see Figure 1). Different from the ture, which achieves a speed up of 7.5× compared to the
JPEG-specific models, AR-CNN is equally effective in coping baseline AR-CNN while still maintaining the state-of-the-art
with different compression schemes, including JPEG, JPEG performance. (3) We verify that reusing the features in shallow
2000, Twitter and so on. networks is helpful in learning a deeper model for compression
However, the network scale increases significantly when artifacts reduction. Under the same intuitive idea – easy to
we add another layer, making it hard to be applied in real- hard, we reveal a number of interesting and practical transfer
world applications. Generally, the high computational cost settings.
has been a major bottleneck for most previous methods [45]. The preliminary version of this work was published ear-
When delving into the network structure, we find two key lier [5]. In this work, we make significant improvements in
factors that restrict the inference speed. First, the added both methodology and experiments. First, in the methodology,
“feature enhancement” layer accounts for almost 95% of the we add analysis on the computational cost of the proposed
total parameters. Second, when we adopt a fully-convolution model, and point out two key factors that affect the time
structure, the time complexity will increase quadratically with efficiency. Then we propose the corresponding acceleration
the spatial size of the input image. strategies, and extend the baseline model to a more general and
To accelerate the inference process while still maintaining efficient network structure. In the experiments, we adopt data
good performance, we investigate a more efficient framework augmentation to further push the performance. In addition,
with two main modifications. For the redundant parameters, we conduct experiments on JPEG 2000 images and show
we insert another “shrinking” layer with 1 × 1 filters between superior performance to the state-of-the-art methods [18], [28],
the first two layers. For the large computation load of con- [29]. A detailed investigation of network settings of the new
volution, we use large-stride convolution filters in the first framework is presented afterwards.
layer and the corresponding deconvolution filters in the last
layer. Then the convolution operation in the middle layers
will be conducted on smaller feature maps, leading to much
faster inference. Experiments show that the modified network, 1 Generally, the transfer learning method will train a base network first, and

namely Fast AR-CNN, can be 7.5 times faster than the baseline copy the learned parameters or features of several layers to the corresponding
layers of a target network. These transferred layers can be left frozen or fine-
AR-CNN with almost no performance loss. This further helps tuned to the target dataset. The remaining layers are randomly initialized and
us formulate a more general CNN framework for low-level trained to the target task.
3

“noisy” feature maps “cleaner” feature maps “restored” feature maps


Compressed image Reconstructed image
(Input) (Output)

Feature extraction Feature enhancement Mapping Reconstruction

Fig. 2. The framework of the Artifacts Reduction Convolutional Neural Network (AR-CNN). The network consists of four convolutional layers, each of
which is responsible for a specific operation. Then it optimizes the four operations (i.e., feature extraction, feature enhancement, mapping and reconstruction)
jointly in an end-to-end framework. Example feature maps shown in each step could well illustrate the functionality of each operation. They are normalized
for better visualization.

II. R ELATED WORK On their basis, Wang et al. [45] replace the sparse-coding
steps with deep neural networks in both domains and achieve
Existing algorithms can be classified into deblocking ori- superior performance. These methods all require the problem-
ented and restoration oriented methods. The deblocking ori- specific prior knowledge (e.g., the quantization table) and
ented methods focus on removing blocking and ringing ar- process on the 8×8 pixel blocks, thus cannot be generalized to
tifacts. In the spatial domain, different kinds of filters [21], other compression schemes, such as JPEG 2000 and Tiwtter.
[27], [35] have been proposed to adaptively deal with blocking
Super-Resolution Convolutional Neural Network (SR-
artifacts in specific regions (e.g., edge, texture, and smooth
CNN) [6] is closely related to our work. In the study, indepen-
regions). In the frequency domain, Liew et al. [20] utilize
dent steps in the sparse-coding-based method are formulated
wavelet transform and derive thresholds at different wavelet
as different convolutional layers and optimized in a unified
scales for denoising. The most successful deblocking oriented
network. It shows the potential of deep model in low-level
method is perhaps the Pointwise Shape-Adaptive DCT (SA-
vision problems like super-resolution. However, the problem
DCT) [8], which is widely acknowledged as the state-of-the-
of compression is different from super-resolution in that the
art approach [13], [19]. However, as most deblocking oriented
former consists of different kinds of artifacts. Designing a deep
methods, SA-DCT could not reproduce sharp edges, and tend
model for compression restoration requires a deep understand-
to overly smooth texture regions.
ing into the different artifacts. We show that directly applying
The restoration oriented methods regard the compression the SRCNN architecture for compression restoration will result
operation as distortion and aim to reduce such distortion. in undesired noisy patterns in the reconstructed image.
These methods include projection on convex sets based
Transfer learning in deep neural networks becomes popular
method (POCS) [41], solving an MAP problem (FoE) [33],
since the success of deep learning in image classification [17].
sparse-coding-based method [15], semi-local Gassian pro-
The features learned from the ImageNet show good general-
cess model [18], the Regression Tree Fields based method
ization ability [44] and become a powerful tool for several
(RTF) [13] and adjusted anchored neighborhood regression
high-level vision problems, such as Pascal VOC image classi-
(A+) [29]. The RTF takes the results of SA-DCT [8] as bases
fication [25] and object detection [9], [30]. Yosinski et al. [43]
and produces globally consistent image reconstructions with
have also tried to quantify the degree to which a particular
a regression tree field model. It could also be optimized for
layer is general or specific. Overall, transfer learning has
any differentiable loss functions (e.g., SSIM), but often at the
been systematically investigated in high-level vision problems,
cost of performing sub-optimally on other evaluation metrics.
but not in low-level vision tasks. In this study, we explore
As a recent method for image super-resolution [34], A+ [29]
several transfer settings on compression artifacts reduction and
has also been successfully applied for compression artifacts
show the effectiveness of transfer learning in low-level vision
reduction. In their method, the input image is decomposed into
problems.
overlapping patches and sparsely represented by a dictionary
of anchoring points. Then the uncompressed patches are pre-
dicted by multiplying with the corresponding linear regressors. III. M ETHODOLOGY
They obtain impressive results on JPEG 2000 image, but have
not tested on other compression schemes. Our proposed approach is based on the current successful
To deal with a specific compression standard, specially low-level vision model – SRCNN [6]. To have a better
JPEG, some recent progresses incorporate information from understanding of our work, we first give a brief overview of
dual-domains (DCT and pixel domains) and achieve impres- SRCNN. Then we explain the insights that lead to a deeper
sive results. Specifically, Liu et al. [22] apply sparse-coding in network and present our new model. Subsequently, we explore
the DCT-domain to eliminate the quantization error, then re- three types of transfer learning strategies that help in training
store the lost high frequency components in the pixel domain. a deeper and better network.
4

A. Review of SRCNN we extract new features from the n1 feature maps of the
The SRCNN aims at learning an end-to-end mapping, first layer, and combine them to form another set of feature
which takes the low-resolution image Y (after interpolation) maps. Overall, the AR-CNN consists of four layers, namely
as input and directly outputs the high-resolution one F (Y). the feature extraction, feature enhancement, mapping and
The network contains three convolutional layers, each of reconstruction layer.
which is responsible for a specific task. Specifically, the Different from SRCNN that adopts ReLU as the acti-
first layer performs patch extraction and representation, vation function, we use Parametric Rectified Linear Unit
which extracts overlapping patches from the input image and (PReLU) [11] in the new networks. To distinguish ReLU and
represents each patch as a high-dimensional vector. Then the PReLU, we define a general activation function as:
non-linear mapping layer maps each high-dimensional vector
P ReLU (xj ) = max(xj , 0) + aj · min(0, xj ), (4)
of the first layer to another high-dimensional vector, which
is conceptually the representation of a high-resolution patch. where xj is the input signal of the activation f on the j-th
At last, the reconstruction layer aggregates the patch-wise channel, and aj is the coefficient of the negative part. The
representations to generate the final output. The network can parameter aj is set to be zero for ReLU, but is learnable
be expressed as: for PReLU. We choose PReLU mainly to avoid the “dead
F0 (Y) = Y; (1) features” [44] caused by zero gradients in ReLU. We represent
the whole network as:
Fi (Y) = max (0, Wi ∗ Fi−1 (Y) + Bi ) , i ∈ {1, 2}; (2)
F (Y) = W3 ∗ F2 (Y) + B3 , (3) F0 (Y) = Y; (5)
Fi (Y) = P ReLU (Wi ∗ Fi−1 (Y) + Bi ) , i ∈ {1, 2, 3}; (6)
where Wi and Bi represent the filters and biases of the ith
layer respectively, Fi is the output feature maps and “∗” F (Y) = W4 ∗ F3 (Y) + B4 . (7)
denotes the convolution operation. The Wi contains ni filters
where the meaning of the variables is the same as that in
of support ni−1 × fi × fi , where fi is the spatial support of
Equation 1, and the second layer (W2 , B2 ) is the added feature
a filter, ni is the number of filters, and n0 is the number of
enhancement layer.
channels in the input image. Note that there is no pooling or
It is worth noticing that AR-CNN is not equal to a deeper
full-connected layers in SRCNN, so the final output F (Y)
SRCNN that contains more than one non-linear mapping
is of the same size as the input image. Rectified Linear Unit
layers2 . A deeper SRCNN imposes more non-linearity in
(ReLU, max(0, x)) [24] is applied on the filter responses.
the mapping stage, which equals to adopting a more robust
These three steps are analogous to the basic operations in the
regressor between the low-level features and the final output.
sparse-coding-based super-resolution methods [40], and this
Similar ideas have been proposed in some sparse-coding-
close relationship lays theoretical foundation for its successful
based methods [2], [16]. However, as compression artifacts
application in super-resolution. Details can be found in the
are complex, low-level features extracted by a single layer are
paper [6].
noisy. Thus the performance bottleneck lies on the features but
not the regressor. AR-CNN improves the mapping accuracy by
B. Convolutional Neural Network for Compression Artifacts enhancing the extracted low-level features, and the first two
Reduction layers together can be regarded as a better feature extractor.
Insights. In sparse-coding-based methods and SRCNN, the This leads to better performance than a deeper SRCNN. Ex-
first step – feature extraction – determines what should be perimental results of AR-CNN, SRCNN and deeper SRCNN
emphasized and restored in the following stages. However, will be shown in Section VI-A2.
as various compression artifacts are coupled together, the ex-
tracted features are usually noisy and ambiguous for accurate C. Model Learning
mapping. In the experiments of reducing JPEG compression
artifacts (see Section VI-A2), we find that some quantiza- Given a set of ground truth images {Xi } and their cor-
tion noises coupled with high frequency details are further responding compressed images {Yi }, we use Mean Squared
enhanced, bringing unexpected noisy patterns around sharp Error (MSE) as the loss function:
edges. Moreover, blocking artifacts in flat areas are misrec- n
1X
ognized as normal edges, causing abrupt intensity changes L(Θ) = ||F (Yi ; Θ) − Xi ||2 , (8)
in smooth regions. Inspired by the feature enhancement step n i=1
in super-resolution [38], we introduce a feature enhancement
where Θ = {W1 , W2 , W3 , W4 , B1 , B2 , B3 , B4 }, n is the
layer after the feature extraction layer in SRCNN to form a
number of training samples. The loss is minimized using
new and deeper network – AR-CNN. This layer maps the
stochastic gradient descent with the standard backpropagation.
“noisy” features to a relatively “cleaner” feature space, which
We adopt a batch-mode learning method with a batch size of
is equivalent to denoising the feature maps.
128.
Formulation. The overview of the new network AR-CNN
is shown in Figure 2. The three layers of SRCNN remain 2 Adding non-linear mapping layers has been suggested as an extension of
unchanged in the new model. To conduct feature enhancement, SRCNN in [6].
5

IV. ACCELERATING AR-CNN Equivalent


operations
Although AR-CNN is already much smaller than most of the
existing deep models (e.g., AlexNet [17] and Deepid-net [26]),
it is still unsatisfactory for practical or even real-time on-line
applications. Specifically, with an additional layer, AR-CNN Convolution: stride=1 Deconvolution: stride=1
has been several times larger than SRCNN in the network
(a) When the stride is 1, the convolution and deconvolution can be regarded as
scale. In this section, we progressively accelerate the proposed equivalent operations. Each output pixel is determined by the same number of
baseline model while preserving its reconstruction quality. input pixels (in the orange circle) in convolution and deconvolution.
First, we analyze the computational complexity of AR-CNN
and find out the most influential factors. Then we re-design the Opposite
network by layer decomposition and joint use of large-stride operations
convolutional and deconvolutional layers. We further make it a
more general framework, and compare it with the conventional
Multi-Layer Perceptron (MLP).
Convolution: stride=2 Deconvolution: stride=2
(Downsampling) (Upsampling)
A. Complexity Analysis (b) When the stride is larger than 1, the convolution performs downsampling,
As AR-CNN consists of purely convolutional layers, The and the deconvolution performs upsampling.
total number of parameters can be calculated as: Fig. 3. The illustration of convolution and deconvolution process.
d
X B. Acceleration Strategies
N= ni−1 · ni · fi2 . (9)
Layer decomposition. We first reduce the complexity of
i=1
the “feature enhancement” layer. This layer plays two roles
where i is the layer index, d is the number of layers and fi is simultaneously. One is to denoise the input feature maps with
the spatial size of the filters. The number of filters of the i-th a set of large filters (i.e., 7×7), and the other is to map the high
layer is denoted by ni , and the number of input channels is dimensional features to a relatively low dimensional feature
ni−1 . If we include the spatial size of the output feature maps space (i.e., from 64 to 32). This indicates that we can replace
mi , we obtain the expression for time complexity: it with two connected layers, each of which is responsible
d
X for a single task. To be specific, we decompose the “feature
O{ ni−1 · ni · fi2 · m2i }, (10) enhancement” layer into a “shrinking” layer with 32 1 × 1
i=1 filters and an “enhancement” layer with 32 7 × 7 filters, as
For our baseline model AR-CNN, we set d = 4, n0 = 1, shown in Figure 4. Note that the 1 × 1 filters are widely used
n1 = 64, n2 = 32, n3 = 16, n4 = 1, f1 = 9, f2 = 7, f3 = 1, to reduce the feature dimensions in deep models [23]. Then
f4 = 5, namely 64(9)-32(7)-16(1)-1(5). First, we analyze we can calculate the parameters as follows:
the parameters of each layer in Table I. We find that the
“feature enhancement” layer accounts for almost 95% of total 32·72 ·64 = 100, 352 → 32·12 ·32+32·72 ·32 = 51, 200. (11)
parameters. Obviously, if we want to reduce the parameters,
the second layer should be the breakthrough point. It is clear that the parameters are reduced almost by half.
On the other hand, the spatial size of the output feature maps Correspondingly, the overall network scale also decreases by
mi also plays an important role in the overall time complexity 46.17%. We denote the modified network as 64(9)-32(1)-
(see Equation 11). In conventional low-level vision models 32(7)-16(1)-1(5). In Section VI-D1, we will show that this
like SRCNN, the spatial size of all intermediate feature maps model achieves almost the same restoration quality as the
remains the same as that of the input image. However, this is baseline model 64(9)-32(7)-16(1)-1(5).
not the case for high-level vision models like AlexNet [17], Large-stride convolution and deconvolution. Another ac-
which consists of some large-stride (stride > 1) convolution celeration strategy is to increase the stride size (e.g., stride
filters. Generally, a reasonable larger stride can significantly s > 1) in the first convolutional layer. In AR-CNN, the first
speed up the convolution operation with little cost on accuracy, layer plays a similar role (i.e., feature extractor) as in high-
thus the stride size should be another key factor to improve level vision deep models, thus it is a worthy attempt to increase
our network. Based on the above observations, we explore a the stride size, e.g., from 1 to 2.
more efficient network structure in the next subsection. However, this will result in a smaller output and affect the
end-to-end mapping structure. To address this problem, we
replace the last convolutional layer of AR-CNN (Figure 2)
TABLE I with a deconvolutional layer. The deconvolution can be re-
A NALYSIS OF NETWORK PARAMETERS IN AR-CNN. garded as an opposite operation of convolution. Specially, if
we set the stride s = 1, the function of a deconvolution
layer No. 1 2 3 4 total filter is equal to that of a convolution filter (see Figure 3(a)).
Number 5184 100,352 512 400 106,448
Percentage 4.87% 94.27% 0.48% 0.38% 100% For a larger stride s > 1, the convolution performs sub-
sampling, while the deconvolution performs up-sampling (see
6

Hourglass
Input image Output image

Stride: s>1 Stride: s>1

Feature extraction Shrinking Enhancement Mapping Reconstruction

Large-stride convolution Layer decomposition Expanding Large-stride deconvolution

Fig. 4. The framework of the Fast AR-CNN. There are two main modifications based on the original AR-CNN. First, the layer decomposition splits the
original “feature enhancement” layer into a “shrinking” layer and an “enhancement” layer. Then the large-stride convolutional and deconvolutional layers
significantly decrease the spatial size of the feature maps of the middle layers. The overall shape of the framework is like an hourglass, which is thick at the
ends and thin in the middle.

Figure 3(b)). Therefore, if we use the same stride for the first deconvolutional layer is equal to a convolutional layer. When
and the last layer, the output will remain the same size as the s > 1, the time complexity will decrease s2 times at the cost
input, as depicted in Figure 4. After joint use of large-stride of the reconstruction quality.
convolutional and deconvolutional layers, the spatial size of (3) When we adopt all 1 × 1 filters in the middle layer, it
the feature maps mi will become mi /s, which will reduce will work very similar to a Multi-Layer Perception (MLP) [3].
the overall time complexity significantly. The MLP processes each patch individually. Input patches are
Although the above modifications will improve the time extracted from the image with a stride s, and the output patches
efficiency, they may also influence the restoration quality. To are aggregated (i.e., averaging) on the overlapping areas. While
further improve the performance, we can expand the mapping for our framework, the patches are also extracted with a stride
layer (i.e., use more mapping filters) and enlarge the filter s, but in a convolution manner. The output patches are also
size of the deconvolutional layer. For instance, we can set the aggregated (i.e., summation) on overlapping areas, but in a
number of mapping filters to be same as that of the first-layer deconvolution manner. If the filter size of the middle layers
filters (i.e., from 16 to 64), and use the same filter size for the is set to 1, then each output patch is determined purely by
first and the last layer (i.e., f1 = f5 = 9). This is a feasible a single input patch, which is almost the same as a MLP.
solution but not a strict rule. In general, it can be seen as a However, when we set a larger filter size for middle layers,
compensation for the low time complexity. In Section VI-D1, the receptive field of an output patch will increase, leading
we investigate different settings through a series of controlled to much better performance. This also reveals why the CNN
experiments, and find a good trade-off between performance structure can outperform the conventional MLP theoretically.
and complexity. Here, we present the general framework as
Fast AR-CNN. Through the above modifications, we reach
n1 (f1 ) − n2 (1) − n3 (f3 ) × m − n4 (1) − n5 [f5 ] − s, (12)
to a more efficient network structure. If we set s = 2, the modi-
fied model can be represented as 64(9)-32(1)-32(7)-64(1)-1[9]- where f and n represent the filter size and the number of filters
s2, where the square bracket refers to the deconvolution filter. respectively. The number of middle layers is denoted as m, and
We name the new model as Fast AR-CNN. The number of its can be used to design a deeper network. As we focus more
overall parameters is 56,496 by Equation 9. Then the acceler- on speed, we just set m = 1 in the following experiments.
ation ratio can be calculated as 106448/56496·22 = 7.5. Note Figure 4 shows the overall structure of the new framework.
that this network could achieve similar results as the baseline We believe that this framework can be applied to more low-
model as shown in Section VI-D1. level vision problems, such as denoising and deblurring, but
this is beyond the scope of this paper.
C. A General Framework
When we relax the network settings, such as the filter V. E ASY-H ARD T RANSFER
number, filter size, and stride, we can obtain a more general Transfer learning in deep models provides an effective way
framework with some appealing properties as follows. of initialization. In fact, conventional initialization strategies
(1) The overall “shape” of the network is like an “hour- (i.e., randomly drawn from Gaussian distributions with fixed
glass”, which is thick at the ends and thin in the middle. The standard deviations [17]) are found not suitable for training a
shrinking and the mapping layers control the width of the very deep model, as reported in [11]. To address this issue,
network. They are all 1 × 1 filters and contribute little to the He et al. [11] derive a robust initialization method for rectifier
overall complexity. nonlinearities, Simonyan et al. [32] propose to use the pre-
(2) The choice of the stride can be very flexible. The trained features on a shallow network for initialization.
previous low-level vision CNNs, such as SRCNN and AR- In low-level vision problems (e.g., super resolution), it
CNN, can be seen as a special case of s = 1, where the is observed that training a network beyond 4 layers would
7

input 𝑊𝐴1 𝑊𝐴2 𝑊𝐴3 𝑊𝐴4 output


base𝐴
data𝐴- 𝑞𝐴
input 𝑊𝐴1 𝑊𝐴2 output
target𝐵1 (a) High compression quality (quality 20 in MATLAB encoder)
data𝐴- 𝑞𝐴

input 𝑊𝐴1 output


target𝐵2
data𝐴-𝑞𝐵
output (b) Low compression quality (quality 10 in MATLAB encoder)
input 𝑊𝐴1

target𝐵3 Fig. 6. First layer filters of AR-CNN learned under different JPEG compres-
T𝑤𝑖𝑡𝑡𝑒𝑟 sion qualities.

contain more complex artifacts due to different levels of re-


Fig. 5. Easy-hard transfer settings. First row: The baseline 4-layer network
trained with dataA-qA. Second row: The 5-layer AR-CNN targeted at dataA- scaling and compression. We transfer the first layer of baseA to
qA. Third row: The AR-CNN targeted at dataA-qB. Fourth row: The AR- the network targetB3 , and train all layers on the new dataset.
CNN targeted at Twitter data. Green boxes indicate the transferred features Discussion. Why are the features learned from relatively
from the base network, and gray boxes represent random initialization. The
ellipsoidal bars between weight vectors represent the activation functions. easy tasks helpful? First, features from a well-trained network
can provide a good starting point. Then the rest of a deeper
encounter the problem of convergence, even that a large model can be regarded as shallow one, which is easier to
number of training images (e.g., ImageNet) are provided [6]. converge. Second, features learned in different tasks always
We are also met with this difficulty during the training have a lot in common. For instance, Figure 6 shows the
process of AR-CNN. To this end, we systematically investigate features learned under different JPEG compression qualities.
several transfer settings in training a low-level vision network Obviously, filters a, b, c of high quality are very similar to
following an intuitive idea of “easy-hard transfer”. Specifically, filters a0 , b0 , c0 of low quality. This kind of features can be
we attempt to reuse the features learned in a relatively easier reused or improved during fine-tuning, making the conver-
task to initialize a deeper or harder network. Interestingly, the gence faster and more stable. Furthermore, a deep network for
concept “easy-hard transfer” has already been pointed out in a hard problem can be seen as an insufficiently biased learner
neuro-computation study [10], where the prior training on an with overly large hypothesis space to search, and therefore is
easy discrimination can help learn a second harder one. prone to overfitting. These few transfer settings we investigate
Formally, we define the base (or source) task as A and introduce good bias to enable the learner to acquire a concept
the target tasks as Bi , i ∈ {1, 2, 3}. As shown in Figure 5, with greater generality. Experimental results in Section VI-C
the base network baseA is a four-layer AR-CNN trained validate the above analysis.
on a large dataset dataA, of which images are compressed
using a standard compression scheme with the compression VI. E XPERIMENTS
quality qA. All layers in baseA are randomly initialized from We use the BSDS500 dataset [1] as our training set.
a Gaussian distribution. We will transfer one or two layers of Specifically, its disjoint training set (200 images) and test set
baseA to different target tasks (see Figure 5). Such transfers (200 images) are all used for training, and its validation set
can be described as follows. (100 images) is used for validation. To use the dataset more
Transfer shallow to deeper model. As indicated by [7], a efficiently, we adopt data augmentation for the training images
five-layer network is sensitive to the initialization parameters in two steps. 1) Scaling: each image is scaled by a factor of
and learning rate. Thus we transfer the first two layers of 0.9, 0.8, 0.7 and 0.6. 2) Rotation: each image is rotated by a
baseA to a five-layer network targetB1 . Then we randomly degree of 90, 180 and 270. Then our augmented training set
initialize its remaining layers3 and train all layers toward the is 5 × 4 = 20 times of the original one. We only focus on the
same dataset dataA. This is conceptually similar to that applied restoration of the luminance channel (in YCrCb space) in this
in image classification [32], but this approach has never been paper.
validated in low-level vision problems. The training image pairs {Y, X} are prepared as follows.
Transfer high to low quality. Images of low compression Images in the training set are decomposed into 24 × 24
quality contain more complex artifacts. Here we use the sub-images4 X = {Xi }ni=1 . Then the compressed samples
features learned from high compression quality images as a Y = {Yi }ni=1 are generated from the training samples. The
starting point to help learn more complicated features in the sub-images are extracted from the ground truth images with
DCN. Specifically, the first layer of targetB2 are copied from a stride of 20. Thus the augmented 400 × 20 = 8000 training
baseA and trained on images that are compressed with a lower images could provide 1,870,336 training samples. We adopt
compression quality qB. zero padding for the layers with a filter size larger than 1. As
Transfer standard to real use case. We then explore the training is implemented with the Caffe package [14], the
whether the features learned under a standard compression deconvolution filter will output a feature map with (s − 1)-
scheme can be generalized to other real use cases, which often pixel cut on borders (s is the stride of the first convolutional
3 Random initialization on remaining layers are also applied similarly for 4 We use sub-images because we regard each sample as an image rather
tasks B2 , and B3 . than a big patch.
8

TABLE II TABLE III


T HE AVERAGE RESULTS OF PSNR ( D B), SSIM, PSNR-B ( D B) ON THE T HE AVERAGE RESULTS OF PSNR ( D B), SSIM, PSNR-B ( D B) ON 5
LIVE1 DATASET. CLASSICAL TEST IMAGES [8].

Eval. Mat Quality JPEG SA-DCT AR-CNN Eval. Mat Quality JPEG SA-DCT AR-CNN
10 27.77 28.65 29.13 10 27.82 28.88 29.04
PSNR 20 30.07 30.81 31.40 PSNR 20 30.12 30.92 31.16
30 31.41 32.08 32.69 30 31.48 32.14 32.52
40 32.35 32.99 33.63 40 32.43 33.00 33.34
10 0.7905 0.8093 0.8232 10 0.7800 0.8071 0.8111
SSIM 20 0.8683 0.8781 0.8886 SSIM 20 0.8541 0.8663 0.8694
30 0.9000 0.9078 0.9166 30 0.8844 0.8914 0.8967
40 0.9173 0.9240 0.9306 40 0.9011 0.9055 0.9101
10 25.33 28.01 28.74 10 25.21 28.16 28.75
PSNR-B 20 27.57 29.82 30.69 PSNR-B 20 27.50 29.75 30.60
30 28.92 30.92 32.15 30 28.94 30.83 31.99
40 29.96 31.79 33.12 40 29.92 31.59 32.80

TABLE IV
layer). Specifically, given a 24 × 24 input Yi , AR-CNN T HE AVERAGE RESULTS OF PSNR ( D B), SSIM, PSNR-B ( D B) ON THE
produces a (24 − s + 1) × (24 − s + 1) output. Hence, the LIVE1 DATASET WITH q = 10 .
loss (Eqn. (8)) was computed by comparing against the up-
Eval. JPEG SRCNN Deeper AR-CNN
left (24 − s + 1) × (24 − s + 1) pixels of the ground truth Mat SRCNN
sub-image Xi . In the training phase, we follow [6], [12] and PSNR 27.77 28.91 28.92 29.13
use a smaller learning rate (5 × 10−5 ) in the last layer and a SSIM 0.7905 0.8175 0.8189 0.8232
PSNR-B 25.33 28.52 28.46 28.74
comparably larger one (5 × 10−4 ) in the remaining layers.

A. Experiments on JPEG-compressed Images classical test images used in [8]6 , and observed the same trend.
We first compare our methods with some state-of-the-art The results are shown in Table III.
algorithms, including the deblocking oriented method SA- To compare the visual quality, we present some restored
DCT [8] and the deep model SRCNN [6] and the restoration images with q = 10, 20 in Figure 10. From the qualitative
based RTF [13], on restoring JPEG-compressed images. As in results, we could see that the result of AR-CNN could produce
other compression artifacts reduction methods (e.g., RTF [13]), much sharper edges with much less blocking and ringing
we apply the standard JPEG compression scheme, and use the artifacts compared with SA-DCT. The visual quality has been
JPEG quality settings q = 40, 30, 20, 10 (from high quality largely improved on all aspects compared with the state-of-the-
to very low quality) in MATLAB JPEG encoder. We use art method. Furthermore, AR-CNN is superior to SA-DCT on
the LIVE1 dataset [31] (29 images) as test set to evaluate the implementation speed. For SA-DCT, it needs 3.4 seconds
both the quantitative and qualitative performance. The LIVE1 to process a 256 × 256 image. While AR-CNN only takes 0.5
dataset contains images with diverse properties. It is widely second. They are all implemented using C++ on a PC with
used in image quality assessment [36] as well as in super- Intel I3 CPU (3.1GHz) with 16GB RAM.
resolution [39]. To have a comprehensive qualitative evalua- 2) Comparison with SRCNN: As discussed in Section III-B,
tion, we apply the PSNR, structural similarity (SSIM) [36]5 , SRCNN is not suitable for compression artifacts reduction.
and PSNR-B [42] for quality assessment. We want to empha- For comparison, we train two SRCNN networks with different
size the use of PSNR-B. It is designed specifically to assess settings. (i) The original SRCNN (9-1-5) with f1 = 9, f3 = 5,
blocky and deblocked images. n1 = 64 and n2 = 32. (ii) Deeper SRCNN (9-1-1-5) with an
We use the baseline network settings – f1 = 9, f2 = 7, additional non-linear mapping layer (f3 = 1, n3 = 16). They
f3 = 1, f4 = 5, n1 = 64, n2 = 32, n3 = 16 and n4 = all use the BSDS500 dataset for training and validation as in
1, denoted as 64(9)-32(7)-16(1)-1(5) or simply AR-CNN. A Section VI. The compression quality is q = 10.
specific network is trained for each JPEG quality. Parameters Quantitative results tested on LIVE1 dataset are shown in
are randomly initialized from a Gaussian distribution with a Table IV. We could see that the two SRCNN networks are
standard deviation of 0.001. inferior on all evaluation metrics. From convergence curves
1) Comparison with SA-DCT: We first compare AR-CNN shown in Figure 7, it is clear that AR-CNN achieves higher
with SA-DCT [8], which is widely regarded as the state-of-the- PSNR from the beginning of the learning stage. Furthermore,
art deblocking oriented method [13], [19]. The quantization from their restored images in Figure 11, we find out that the
results of PSNR, SSIM and PSNR-B are shown in Table II. two SRCNN networks all produce images with noisy edges
On the whole, our AR-CNN outperforms SA-DCT on all JPEG and unnatural smooth regions. These results demonstrate our
qualities and evaluation metrics by a large margin. Note that statements in Section III-B. The success of training a deep
the gains on PSNR-B are much larger than those on PSNR. model needs comprehensive understanding of the problem and
This indicates that AR-CNN could produce images with less careful design of the model structure.
blocking artifacts. We have also conducted evaluation on 5 3) Comparison with RTF: RTF [13] is a recent state-of-the-
art restoration oriented method. Without their deblocking code,
5 We use the unweighted structural similarity defined over fixed 8 × 8
windows as in [37]. 6 The 5 test images in [8] are baboon, barbara, boats, lenna and peppers.
9

AverageRtestRPSNRR(dB) 27.8 AR-CNN A+ SLGP FoE

27.7 0.8

27.6

AR−CNN 0.6
27.5

PSNRogaino(dB)
deeperRSRCNN
27.4 SRCNN
0.4
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
NumberRofRbackprops xR10
8

0.2
Fig. 7. Comparisons with SRCNN and Deeper SRCNN.
0
TABLE V
T HE AVERAGE RESULTS OF PSNR ( D B), SSIM, PSNR-B ( D B) ON THE
TEST SET BSDS500 DATASET. -0.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Eval. Quality JPEG RTF RTF AR-CNN
Mat +SA-DCT Fig. 9. PSNR gain comparison of the proposed AR-CNN against A+, SLGP
PSNR 10 26.62 27.66 27.71 27.79 , and FoE. The x axis corresponds to the image index. The average PSNR
20 28.80 29.84 29.87 30.00 gains across the dataset are marked with solid lines.
SSIM 10 0.7904 0.8177 0.8186 0.8228
20 0.8690 0.8864 0.8871 0.8899 package7 . We also adopt the same training strategy as A+.
PSNR-B 10 23.54 26.93 26.99 27.32 To test on images degraded at 0.1 bits per pixel (BPP), the
20 25.62 28.80 28.80 29.15
training images are compressed at 0.3 BPP instead of 0.1 BPP.
As indicated in [29], the regressors can more easily pick up the
we can only compare with the released deblocking results. artifact patterns at a lower compression rate, leading to better
Their model is trained on the training set (200 images) of the performance. We use the same AR-CNN network structure
BSDS500 dataset, but all images are down-scaled by a factor (64(9)-32(7)-16(1)-1(5)) as in the JPEG experiments. Figure 8
of 0.5 [13]. To have a fair comparison, we also train new AR- shows the patterns of the learned first-layer filters, which differ
CNN networks on the same half-sized 200 images. Testing a lot from that for JPEG images (see Figure 6).
is performed on the test set of the BSDS500 dataset (images
Apart from A+, we compare our results against another
scaled by a factor of 0.5), which is also consistent with [13].
two methods – SLGP [18] and FoE [28]. The PSNR gains
We compare with two RTF variants. One is the plain RTF,
of the 16 test images are shown in Figure 9. It is observed
which uses the filter bank and is optimized for PSNR. The
that our method outperforms others on most test images. For
other is the RTF+SA-DCT, which includes the SA-DCT as a
the average performance, we achieve a PSNR gain of 0.353
base method and is optimized for MAE. The later achieves
dB, better than A+ with 0.312 dB, SLGP with 0.192 dB and
the highest PSNR value among all RTF variants [13].
FoE with 0.115 dB. Note that the improvement is already
As shown in Table V, we obtain superior performance
significant in such a difficult scenario – JPEG 2000 at 0.1
than the plain RTF, and even better performance than the
BPP [29]. Figure 12 shows some qualitative results, where our
combination of RTF and SA-DCT, especially under the more
method achieves better PSNR and SSIM than A+. However,
representative PSNR-B metric. Moreover, training on such a
we also notice that AR-CNN is inferior to other methods
small dataset has largely restricted the ability of AR-CNN.
on the tenth image in Figure 9. The restoring results of this
The performance of AR-CNN will further improve given more
image are shown in Figure 13. It is observed that the result
training images.
of AR-CNN is still visually pleasant, and the lower PSNR
a' c' b' is mainly due to the chromatic aberration in smooth regions.
The above experiments demonstrate the generalization ability
of AR-CNN on handling different compression standards.
During training, we also find that AR-CNN is hard to con-
verge using random initialization mentioned in Section VI-A.
We solve the problem by adopting the transfer learning strat-
egy. To be specific, we can transfer the first-layer filters of
Fig. 8. First-layer filters of AR-CNN learned for JPEG 2000 at 0.3 BPP.
a well-trained three-layer network to the four-layer AR-CNN,
or we can reuse the features of AR-CNN trained on the JPEG
B. Experiments on JPEG 2000 Images
images. They refer to different ‘easy-hard transfer” strategies
As mentioned in the introduction, the proposed AR-CNN is – transfer shallow to deeper model and transfer standard to
effective in dealing with various compression schemes. In this real use case, which will be detailed in the following section.
section, we conduct experiments on the JPEG 2000 standard,
and compare with the state-of-the-art method – the adjusted
anchored regression (A+) [29]. To have a fair comparison, we C. Experiments on Easy-Hard Transfer
follow A+ on the choice of datasets and software. Specifically, We show the experimental results of different “easy-hard
we adopt the the 91-image dataset [40] for training and 16 transfer” settings on JPEG-compressed images. The details of
classical images [18] for testing. The images are compressed
using the JPEG 2000 encoder from the Kakadu software 7 http://www.kakadusoftware.com
10

Original JPEG SA-DCT AR-CNN


PSNRD/SSIMD/PSNR-B 32.46DdBD/0.8558D/29.64DdB 33.88DdBD/0.9015D/33.02DdB 34.37DdBD/0.9079D/34.10DdB

Fig. 10. Results on image “parrots” (q = 10) show that AR-CNN is better than SA-DCT on removing blocking artifacts.

JPEG SRCNN Deeper SRCNN AR-CNN


30.12 dB /0.8817 /26.86 dB 32.58 dB /0.9298 /31.52 dB 32.60 dB /0.9301 /31.47 dB 32.88 dB /0.9343 /32.22 dB

Fig. 11. Results on image “monarch” show that AR-CNN is better than SRCNN on removing ringing effects.

Original JPEG A+ AR-CNN


PSNR-/-SSIM- 29.86-dB-/-0.8258 30.52-dB-/-0.8349 30.61-dB-/-0.8394

Fig. 12. Results on image “lenna” compressed with JPEG 2000 at 0.1 BPP.

Original JPEG A+ AR-CNN


PSNR5/5SSIM5 29.695dB5/50.8010 30.485dB5/50.8137 29.575dB5/50.7997

Fig. 13. Results on image “pepper” compressed with JPEG 2000 at 0.1 BPP.

Original / PSNR Twitter / 25.42 dB Baseline / 28.20 dB Transfer q10 / 28.57 dB

Fig. 14. Restoration results of AR-CNN on Twitter-compressed images.


11

TABLE VI 27.8

AverageRtestRPSNRRqdB)
E XPERIMENTAL SETTINGS OF “ EASY- HARD TRANSFER ”. T HE “9-7-1-5”
AND “9-7-3-1-5” ARE SHORT FOR 64(9)-32(7)-16(1)-1(5) AND 27.7
64(9)-32(7)-16(3)-16(1)-1(5), RESPECTIVELY.
27.6
transferR1Rlayer
transfer short network training initialization transferR2Rlayers
strategy form structure dataset strategy 27.5 base−q10
base base-q10 9-7-1-5 BSDS-q10 Gaussian (0, 0.001) 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
network base-q20 9-7-1-5 BSDS-q20 Gaussian (0, 0.001) NumberRofRbackprops xR10
8

shallow base-q10 9-7-1-5 BSDS-q10 Gaussian (0, 0.001)


to transfer deeper 9-7-3-1-5 BSDS-q10 1,2 layers of base-q10 Fig. 16. Transfer high to low quality.
deep He [11] 9-7-3-1-5 BSDS-q10 He et al. [11]
high base-q10 9-7-1-5 BSDS-q10 Gaussian (0, 0.001) 25.2

Average(test(PSNR(idB)
to transfer 1 layer 9-7-1-5 BSDS-q10 1 layer of base-q20
low transfer 2 layers 9-7-1-5 BSDS-q10 1,2 layer of base-q20 25
standard base-Twitter 9-7-1-5 Twitter Gaussian (0, 0.001)
to transfer q10 9-7-1-5 Twitter 1 layer of base-q10 24.8
real transfer q20 9-7-1-5 Twitter 1 layer of base-q20 transfer(q10
24.6 transfer(q20
base−Twitter

2 4 6 8 10 12
Number(of(backprops x(10
7

27.8
AveragedtestdPSNRd]dB-

Fig. 17. Transfer standard to real use case.


27.7

27.6
convergence speed.
transferddeeper 3) Transfer standard to real use case – Twitter: Online
27.5 Hed[11]
base-q10 Social Media like Twitter are popular platforms for message
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
posting. However, Twitter will compress the uploaded images
Numberdofdbackprops × 10 8 on the server-side. For instance, a typical 8 mega-pixel (MP)
Fig. 15. Transfer shallow to deeper model. image (3264 × 2448) will result in a compressed and re-
scaled version with a fixed resolution of 600 × 450. Such re-
settings are shown in Table VI. Take the base network as scaling and compression will introduce very complex artifacts,
an example, the “base-q10” is a four-layer AR-CNN 64(9)- making restoration difficult for existing deblocking algorithms
32(7)-16(1)-1(5) trained on the BSDS500 [1] dataset (400 (e.g., SA-DCT). However, AR-CNN can fit to the new data
images) under the compression quality q = 10. Parameters are easily. Further, we want to show that features learned under
initialized by randomly drawing from a Gaussian distribution standard compression schemes could also facilitate training on
with zero mean and standard deviation 0.001. Figures 15 - 17 a completely different dataset. We use 40 photos of resolution
show the convergence curves on the validation set. 3264 × 2448 taken by mobile phones (totally 335,209 training
1) Transfer shallow to deeper model: In Table VI, we de- subimages) and their Twitter-compressed version8 to train
note a deeper (five-layer) AR-CNN 64(9)-32(7)-16(3)-16(1)- three networks with initialization settings listed in Table VI.
1(5) as “9-7-3-1-5”. Results in Figure 15 show that the From Figure 17, we observe that the “transfer q10” and
transferred features from a four-layer network enable us to “transfer q20” networks converge much faster than the “base-
train a five-layer network successfully. Note that directly Twitter” trained from scratch. Specifically, the “transfer q10”
training a five-layer network using conventional initialization takes 6 × 107 backprops to achieve 25.1dB, while the “base-
ways is unreliable. Specifically, we have exhaustively tried Twitter” uses 10×107 backprops. Despite of fast convergence,
different groups of learning rates, but still could not observe transferred features also lead to higher PSNR values compared
convergence. Furthermore, the “transfer deeper” converges with “base-Twitter”. This observation suggests that features
faster and achieves better performance than using He et al.’s learned under standard compression schemes are also trans-
method [11], which is also very effective in training a deep ferrable to tackle real use case problems. Some restoration
model. We have also conducted comparative experiments with results are shown in Figure 14. We could see that both
the structure 64(9)-32(7)-16(1)-16(1)-1(5) and 64(9)-32(1)- networks achieve satisfactory quality improvements over the
32(7)-16(1)-1(5), and observed the same trend. compressed version.
2) Transfer high to low quality: Results are shown in Fig-
ure 16. Obviously, the two networks with transferred features D. Experiments on Acceleration Strategies
converge faster than that training from scratch. For example, to
In this section, we conduct a set of controlled experiments
reach an average PSNR of 27.77dB, the “transfer 1 layer” takes
to demonstrate the effectiveness of the proposed acceleration
only 1.54×108 backprops, which are roughly a half of that for
strategies. Following the descriptions in Section IV, we pro-
“base-q10”. Moreover, the “transfer 1 layer” also outperforms
gressively modify the baseline AR-CNN by layer decomposi-
the ‘transfer 2 layers” by a slight margin throughout the
tion, adopting large-stride layers and expanding the mapping
training phase. One reason for this is that only initializing
layer. The networks are trained on JPEG images under the
the first layer provides the network with more flexibility in
adapting to a new dataset. This also indicates that a good 8 We have shared this dataset on our project page http://mmlab.ie.cuhk.edu.
starting point could help train a better network with higher hk/projects/ARCNN.html.
12

TABLE VII

AveragedtestdPSNRd(dB)
T HE EXPERIMENTAL RESULTS OF DIFFERENT SETTINGS .
27.9

Eval. Mat PSNR(dB) SSIM PSNR-B(dB)


27.8
layer base-q10 29.13 0.8232 28.74
replacement replace deeper 29.13 0.8234 28.72 s=1
27.7
s=1 29.13 0.8234 28.72 s=2
stride s=2 29.07 0.8232 28.66 s=3
27.6
s=3 28.78 0.8178 28.44 1 2 3 4 5 6
n4 = 16 29.07 0.8232 28.66 Numberdofdbackprops × 10 8
mapping n4 = 48 29.04 0.8238 28.58
filters n4 = 64 29.10 0.8246 28.65
n4 = 80 29.10 0.8244 28.69 Fig. 18. The performance of using different stride sizes.
fast-q10 29.10 0.8246 28.65
base-q10 29.13 0.8232 28.74
fast-q20 31.29 0.8873 30.54

AverageRtestRPSNRR(dB)
JPEG base-q20 31.40 0.8886 30.69
27.95
quality fast-q30 32.41 0.9124 31.43
base-q30 32.69 0.9166 32.15 n 4=80
fast-q40 33.43 0.9306 32.51
27.9 n 4=64
base-q40 33.63 0.9306 33.12
n 4=48

27.85
n 4=16

1 2 3 4 5 6 7 8 9 10
quality q = 10. We further test the performance of Fast AR- NumberRofRbackprops × 10 8
CNN on different compression qualities (q = 10, 20, 30, 40).
Fig. 19. The performance of using different mapping filters.
As all the modified networks are deeper than the baseline
model, we adopt the proposed transfer learning strategy (trans-
fer shallow to deeper model) for fast and stable training. The performance loss. In the part “mapping filters” of Table VII,
base network is also “base-q10” as in Section VI-C1. All the we compare a set of experiments that only differ in mapping
quantitative results are listed in Table VII. filters. To be specific, the network setting is 64(9)-32(1)-
1) Layer decomposition: The layer decomposition strategy 32(7)-n4 (1)-1[9]-s2 with n4 = 16, 48, 64, 80. The convergence
replaces the “feature enhancement” layer with a “shrinking” curves shown in Figure 1910 . can better reflect their differ-
layer and an “enhancement” layer, and we reach to a mod- ences. Obviously, using more filters will achieve better per-
ified network 64(9)-32(1)-32(7)-16(1)-1(5). The experimental formance, but the improvement is marginal beyond n4 = 64.
results are shown in Table VII, from which we can see that Thus we adopt n4 = 64, which is also consistent with our
the “replace deeper” achieves almost the same performance as comment in Section IV. Finally, we find the optimal network
the “base-q10” in all the metrics. This indicates that the layer setting – 64(9)-32(1)-32(7)-64(1)-1[9]-s2, namely Fast AR-
decomposition is an effective strategy to reduce the network CNN, which achieves similar performance as the baseline
parameters with almost no performance loss. model 64(9)-32(7)-16(1)-1(5) but is 7.5 times faster.
2) Stride size: Then we introduce the large-stride convo-
lutional and deconvolutional layers, and change the stride 4) JPEG quality: In the above experiments, we mainly
size. Generally, a larger stride will lead to much narrower focus on a very low quality q = 10. Here we want to examine
feature maps and faster inference, but at the risk of worse the capacity of the new network on different compression
reconstruction quality. In order to find a good trade-off setting, qualities. In the part “JPEG quality” of Table VII, we compare
we conduct experiments with different stride sizes as shown the Fast AR-CNN with the baseline AR-CNN on quality q =
in the part “stride” of Table VII. The network settings for 10, 20, 30, 40. For example, “fast-q10” and “base-q10” rep-
s = 1, s = 2 and s = 3 are 64(9)-32(1)-32(7)-16(1)-1(5), resent 64(9)-32(1)-32(7)-64(1)-1[9]-s2 and 64(9)-32(7)-16(1)-
64(9)-32(1)-32(7)-16(1)-1[9]-s2 and 64(9)-32(1)-32(7)-16(1)- 1(5) on quality q = 10, respectively. From the quantitative
1[9]-s3, respectively. From the results in Table VII, we can results, we observe that the Fast AR-CNN is comparable with
see that there are only small differences between “s = 1” and AR-CNN on low qualities such as q = 10 and q = 20, but it
“s = 2” in all metrics. But when we further enlarge the stride is inferior to AR-CNN on high qualities such as q = 30 and
size, the performance declines dramatically, e.g., the PSNR q = 40. This phenomenon is reasonable. As the low quality
value drops more than 0.2 dB from “s = 2” to “s = 3”. images contain much less information, extracting features in
Convergence curves in Figure 18 also exhibit a similar trend, a sparse way (using a large stride) does little harm to the
where “s = 3” achieves inferior performance to “s = 1” and restoration quality. On the contrary, for high quality images,
“s = 2” on the validation set9 . With little performance loss adjacent image patches may differ a lot. So when we adopt
yet 7.5 times faster, using stride s = 2 definitely balances the a large stride, we will lose the information that is useful
performance and time complexity. Thus we adopt stride s = 2 for restoration. Nevertheless, the proposed Fast AR-CNN
in the following experiments. still outperforms the state-of-the-art methods (as presented in
3) Mapping filters: As mentioned in Section IV, we can Section VI-A) on different compression qualities.
increase the number of mapping filters to compensate the
9 As the validation set (BSD500 validation set) is different from the test set 10 As the validation set (BSD500 validation set) is different from the test
(LIVE1 dataset), the results in Table VII and Figure 18 are different. set (LIVE1 dataset), the results in Table VII and Figure 19 are different.
13

VII. C ONCLUSION [19] Y. Li, F. Guo, R. T. Tan, and M. S. Brown. A contrast enhancement
framework with jpeg artifacts suppression. In European Conference on
Applying deep model on low-level vision problems requires Computer Vision, pages 174–188. 2014. 1, 3, 8
deep understanding of the problem itself. In this paper, we [20] A.-C. Liew and H. Yan. Blocking artifacts suppression in block-coded
images using overcomplete wavelet representation. IEEE Transactions
carefully study the compression process and propose a four- on Circuits and Systems for Video Technology, 14(4):450–461, 2004. 1,
layer convolutional network, AR-CNN, which is extremely 3
effective in dealing with various compression artifacts. Then [21] P. List, A. Joch, J. Lainema, G. Bjontegaard, and M. Karczewicz.
Adaptive deblocking filter. EEE Transactions on Circuits and Systems
we propose two acceleration strategies to reduce its time for Video Technology, 13(7):614–619, 2003. 1, 3
complexity while maintaining good performance. We further [22] X. Liu, X. Wu, J. Zhou, and D. Zhao. Data-driven sparsity-based
restoration of jpeg-compressed images in dual transform-pixel domain.
systematically investigate three easy-to-hard transfer settings In IEEE Conference on Computer Vision and Pattern Recognition,
that could facilitate training a deeper or better network, and volume 16, pages 1395–1411, 2015. 1, 3
verify the effectiveness of transfer learning in low-level vision [23] S. Y. Min Lin, Qiang Chen. Network in network. arXiv:1312.4400,
2014. 5
problems. [24] V. Nair and G. E. Hinton. Rectified linear units improve restricted
Boltzmann machines. In International Conference on Machine Learning,
pages 807–814, 2010. 4
R EFERENCES [25] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning and transferring
mid-level image representations using convolutional neural networks. In
[1] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Contour detection and IEEE Conference on Computer Vision and Pattern Recognition, pages
hierarchical image segmentation. IEEE Transactions on Pattern Analysis 1717–1724, 2014. 3
and Machine Intelligence, 33(5):898–916, 2011. 7, 11 [26] W. Ouyang, X. Wang, X. Zeng, S. Qiu, P. Luo, Y. Tian, H. Li, S. Yang,
[2] M. Bevilacqua, A. Roumy, C. Guillemot, and M.-L. A. Morel. Low- Z. Wang, C.-C. Loy, et al. Deepid-net: Deformable deep convolutional
complexity single-image super-resolution based on nonnegative neighbor neural networks for object detection. In IEEE Conference on Computer
embedding. In British Machine Vision Conference, 2012. 4 Vision and Pattern Recognition, pages 2403–2412, 2015. 5
[3] H. C. Burger, C. J. Schuler, and S. Harmeling. Image denoising: Can [27] H. C. Reeve III and J. S. Lim. Reduction of blocking effects in image
plain neural networks compete with BM3D? In IEEE Conference on coding. Optical Engineering, 23(1):230134–230134, 1984. 1, 3
Computer Vision and Pattern Recognition, pages 2392–2399, 2012. 2, [28] S. Roth and M. J. Black. Fields of experts. International Journal of
6 Computer Vision, 82(2):205–229, 2009. 2, 9
[4] P. Dollár and C. L. Zitnick. Structured forests for fast edge detection. In [29] R. Rothe, R. Timofte, and L. Van. Efficient regression priors for reducing
IEEE International Conference on Computer Vision, pages 1841–1848, image compression artifacts. In International Conference on Image
2013. 1 Processing, pages 1543–1547, 2015. 2, 3, 9
[5] C. Dong, Y. Deng, C. Change Loy, and X. Tang. Compression artifacts [30] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. Le-
reduction by a deep convolutional network. In IEEE International Cun. Overfeat: Integrated recognition, localization and detection using
Conference on Computer Vision, pages 576–584, 2015. 2 convolutional networks. arXiv:1312.6229, 2013. 3
[6] C. Dong, C. C. Loy, K. He, and X. Tang. Learning a deep convolu- [31] H. R. Sheikh, Z. Wang, L. Cormack, and A. C. Bovik. Live image
tional network for image super-resolution. In European Conference on quality assessment database release 2, 2005. 8
Computer Vision, pages 184–199. 2014. 1, 2, 3, 4, 7, 8 [32] K. Simonyan and A. Zisserman. Very deep convolutional networks for
[7] C. Dong, C. C. Loy, K. He, and X. Tang. Image super-resolution using large-scale image recognition. arXiv:1409.1556, 2014. 2, 6, 7
deep convolutional networks. IEEE Transactions on Pattern Analysis [33] D. Sun and W.-K. Cham. Postprocessing of low bit-rate block DCT
and Machine Intelligence, 38(2):295–307, 2015. 2, 7 coded images based on a fields of experts prior. IEEE Transactions on
[8] A. Foi, V. Katkovnik, and K. Egiazarian. Pointwise shape-adaptive DCT Image Processing, 16(11):2743–2751, 2007. 3
for high-quality denoising and deblocking of grayscale and color images. [34] R. Timofte, V. De Smet, and L. Van Gool. A+: Adjusted anchored
IEEE Transactions on Image Processing, 16(5):1395–1411, 2007. 1, 2, neighborhood regression for fast super-resolution. In IEEE Asian
3, 8 Conference on Computer Vision, pages 111–126. Springer, 2014. 3
[9] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature [35] C. Wang, J. Zhou, and S. Liu. Adaptive non-local means filter for image
hierarchies for accurate object detection and semantic segmentation. In deblocking. Signal Processing: Image Communication, 28(5):522–530,
IEEE Conference on Computer Vision and Pattern Recognition, pages 2013. 1, 3
580–587, 2014. 3 [36] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image
[10] M. A. Gluck and C. E. Myers. Hippocampal mediation of stimulus quality assessment: from error visibility to structural similarity. IEEE
representation: A computational theory. Hippocampus, 3(4):491–516, Transactions on Image Processing, 13(4):600–612, 2004. 8
1993. 7 [37] Z. Wang and E. P. Simoncelli. Maximum differentiation (MAD)
[11] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: competition: A methodology for comparing computational models of
Surpassing human-level performance on imagenet classification. In IEEE perceptual quantities. Journal of Vision, 8(12):8, 2008. 8
Conference on Computer Vision and Pattern Recognition, pages 1026– [38] Z. Xiong, X. Sun, and F. Wu. Image hallucination with feature
1034, 2015. 4, 6, 11 enhancement. In IEEE Conference on Computer Vision and Pattern
[12] V. Jain and S. Seung. Natural image denoising with convolutional Recognition, pages 2074–2081, 2009. 4
networks. In Advances in Neural Information Processing Systems, pages [39] C.-Y. Yang, C. Ma, and M.-H. Yang. Single-image super-resolution: A
769–776, 2009. 8 benchmark. In European Conference on Computer Vision, pages 372–
[13] J. Jancsary, S. Nowozin, and C. Rother. Loss-specific training of non- 386. 2014. 1, 8
parametric image restoration models: A new state of the art. In European [40] J. Yang, J. Wright, T. S. Huang, and Y. Ma. Image super-resolution
Conference on Computer Vision, pages 112–125. 2012. 3, 8, 9 via sparse representation. IEEE Transactions on Image Processing,
[14] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, 19(11):2861–2873, 2010. 4, 9
S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for [41] Y. Yang, N. P. Galatsanos, and A. K. Katsaggelos. Projection-based
fast feature embedding. In ACM Multimedia, pages 675–678, 2014. 7 spatially adaptive reconstruction of block-transform compressed images.
[15] C. Jung, L. Jiao, H. Qi, and T. Sun. Image deblocking via sparse IEEE Transactions on Image Processing, 4(7):896–908, 1995. 3
representation. Signal Processing: Image Communication, 27(6):663– [42] C. Yim and A. C. Bovik. Quality assessment of deblocked images. IEEE
677, 2012. 2, 3 Transactions on Image Processing, 20(1):88–98, 2011. 8
[16] K. I. Kim and Y. Kwon. Single-image super-resolution using sparse [43] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are
regression and natural image prior. IEEE Transactions on Pattern features in deep neural networks? In Advances in Neural Information
Analysis and Machine Intelligence, 32(6):1127–1133, 2010. 4 Processing Systems, pages 3320–3328, 2014. 3
[17] A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet classification with [44] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional
deep convolutional neural networks. In Advances in Neural Information networks. In European Conference on Computer Vision, pages 818–833.
Processing Systems, pages 1097–1105, 2012. 3, 5, 6 2014. 3, 4
[18] Y. Kwon, K. I. Kim, J. Tompkin, J. H. Kim, and C. Theobalt. Efficient [45] S. C. Q. L. T. S. H. Zhangyang Wang, Ding Liu. D3 : Deep dual-domain
learning of image super-resolution and compression artifact removal with based fast restoration of jpeg-compressed images. In IEEE Conference
semi-local gaussian processes. IEEE Transactions on Pattern Analysis on Computer Vision and Pattern Recognition, 2016. 1, 2, 3
and Machine Intelligence, 37(9):1792–1805, 2015. 2, 3, 9

View publication stats

You might also like