Deep Convolution Networks For Compression Artifact
Deep Convolution Networks For Compression Artifact
net/publication/306185963
CITATIONS READS
16 113
4 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Xiaoou Tang on 28 March 2017.
al. [45] further introduce deep sparse-coding networks to the vision problems. We also reveal its close relationship with the
DCT and pixel domains and achieve superior performance. conventional Multi-Layer Perceptron [3].
This kind of methods can be referred to as soft decoding Another issue we met is how to effectively train a deeper
for a specific compression standard (e.g., JPEG), and can be DCN. As pointed out in SRCNN [7], training a five-layer
hardly extended to other compression schemes. Alternatively, network becomes a bottleneck. The difficulty of training is
data-driven learning-based methods have better generalization partially due to the sub-optimal initialization settings. The
ability. Jung et al. [15] propose restoration method based on aforementioned difficulty motivates us to investigate a better
sparse representation. Kwon et al. [18] adopt the Gaussian way to train a deeper model for low-level vision problems.
process (GP) regression to achieve both super-resolution and We find that this can be effectively solved by transferring
compression artifact removal. The adjusted anchored neighbor- the features learned in a shallow network to a deeper one
hood regression (A+) approach [29] is also used to enhance and fine-tuning simultaneously1 . This strategy has also been
JPEG 2000 images. These methods can be easily generalized proven successful in learning a deeper CNN for image classi-
for different tasks. fication [32]. Following a similar general intuitive idea, easy
Deep learning has shown impressive results on both high- to hard, we discover other interesting transfer settings in our
level and low-level vision problems. In particular, the Super- low-level vision task: (1) We transfer the features learned in
Resolution Convolutional Neural Network (SRCNN) proposed a high-quality compression model (easier) to a low-quality
by Dong et al. [6] shows the great potential of an end-to- one (harder), and find that it converges faster than random
end DCN in image super-resolution. The study also points initialization. (2) In the real use case, companies tend to
out that conventional sparse-coding-based image restoration apply different compression strategies (including re-scaling)
model can be equally seen as a deep model. However, if we according to their purposes (e.g., Figure 1(b)). We transfer the
directly apply SRCNN in compression artifact reduction, the features learned in a standard compression model (easier) to
features extracted by its first layer could be noisy, leading to a real use case (harder), and find that it performs better than
undesirable noisy patterns in reconstruction. Thus the three- learning from scratch.
layer SRCNN is not well suited for restoring compressed
images, especially in dealing with complex artifacts. The contributions of this study are four-fold: (1) We formu-
To eliminate the undesired artifacts, we improve SRCNN late a new deep convolutional network for efficient reduction
by embedding one or more “feature enhancement” layers of various compression artifacts. Extensive experiments, in-
after the first layer to clean the noisy features. Experiments cluding that on real use cases, demonstrate the effectiveness of
show that the improved model, namely Artifacts Reduction our method over state-of-the-art methods [8] both perceptually
Convolutional Neural Networks (AR-CNN), is exceptionally and quantitatively. (2) We progressively modify the baseline
effective in suppressing blocking artifacts while retaining edge model AR-CNN and present a more efficient network struc-
patterns and sharp details (see Figure 1). Different from the ture, which achieves a speed up of 7.5× compared to the
JPEG-specific models, AR-CNN is equally effective in coping baseline AR-CNN while still maintaining the state-of-the-art
with different compression schemes, including JPEG, JPEG performance. (3) We verify that reusing the features in shallow
2000, Twitter and so on. networks is helpful in learning a deeper model for compression
However, the network scale increases significantly when artifacts reduction. Under the same intuitive idea – easy to
we add another layer, making it hard to be applied in real- hard, we reveal a number of interesting and practical transfer
world applications. Generally, the high computational cost settings.
has been a major bottleneck for most previous methods [45]. The preliminary version of this work was published ear-
When delving into the network structure, we find two key lier [5]. In this work, we make significant improvements in
factors that restrict the inference speed. First, the added both methodology and experiments. First, in the methodology,
“feature enhancement” layer accounts for almost 95% of the we add analysis on the computational cost of the proposed
total parameters. Second, when we adopt a fully-convolution model, and point out two key factors that affect the time
structure, the time complexity will increase quadratically with efficiency. Then we propose the corresponding acceleration
the spatial size of the input image. strategies, and extend the baseline model to a more general and
To accelerate the inference process while still maintaining efficient network structure. In the experiments, we adopt data
good performance, we investigate a more efficient framework augmentation to further push the performance. In addition,
with two main modifications. For the redundant parameters, we conduct experiments on JPEG 2000 images and show
we insert another “shrinking” layer with 1 × 1 filters between superior performance to the state-of-the-art methods [18], [28],
the first two layers. For the large computation load of con- [29]. A detailed investigation of network settings of the new
volution, we use large-stride convolution filters in the first framework is presented afterwards.
layer and the corresponding deconvolution filters in the last
layer. Then the convolution operation in the middle layers
will be conducted on smaller feature maps, leading to much
faster inference. Experiments show that the modified network, 1 Generally, the transfer learning method will train a base network first, and
namely Fast AR-CNN, can be 7.5 times faster than the baseline copy the learned parameters or features of several layers to the corresponding
layers of a target network. These transferred layers can be left frozen or fine-
AR-CNN with almost no performance loss. This further helps tuned to the target dataset. The remaining layers are randomly initialized and
us formulate a more general CNN framework for low-level trained to the target task.
3
Fig. 2. The framework of the Artifacts Reduction Convolutional Neural Network (AR-CNN). The network consists of four convolutional layers, each of
which is responsible for a specific operation. Then it optimizes the four operations (i.e., feature extraction, feature enhancement, mapping and reconstruction)
jointly in an end-to-end framework. Example feature maps shown in each step could well illustrate the functionality of each operation. They are normalized
for better visualization.
II. R ELATED WORK On their basis, Wang et al. [45] replace the sparse-coding
steps with deep neural networks in both domains and achieve
Existing algorithms can be classified into deblocking ori- superior performance. These methods all require the problem-
ented and restoration oriented methods. The deblocking ori- specific prior knowledge (e.g., the quantization table) and
ented methods focus on removing blocking and ringing ar- process on the 8×8 pixel blocks, thus cannot be generalized to
tifacts. In the spatial domain, different kinds of filters [21], other compression schemes, such as JPEG 2000 and Tiwtter.
[27], [35] have been proposed to adaptively deal with blocking
Super-Resolution Convolutional Neural Network (SR-
artifacts in specific regions (e.g., edge, texture, and smooth
CNN) [6] is closely related to our work. In the study, indepen-
regions). In the frequency domain, Liew et al. [20] utilize
dent steps in the sparse-coding-based method are formulated
wavelet transform and derive thresholds at different wavelet
as different convolutional layers and optimized in a unified
scales for denoising. The most successful deblocking oriented
network. It shows the potential of deep model in low-level
method is perhaps the Pointwise Shape-Adaptive DCT (SA-
vision problems like super-resolution. However, the problem
DCT) [8], which is widely acknowledged as the state-of-the-
of compression is different from super-resolution in that the
art approach [13], [19]. However, as most deblocking oriented
former consists of different kinds of artifacts. Designing a deep
methods, SA-DCT could not reproduce sharp edges, and tend
model for compression restoration requires a deep understand-
to overly smooth texture regions.
ing into the different artifacts. We show that directly applying
The restoration oriented methods regard the compression the SRCNN architecture for compression restoration will result
operation as distortion and aim to reduce such distortion. in undesired noisy patterns in the reconstructed image.
These methods include projection on convex sets based
Transfer learning in deep neural networks becomes popular
method (POCS) [41], solving an MAP problem (FoE) [33],
since the success of deep learning in image classification [17].
sparse-coding-based method [15], semi-local Gassian pro-
The features learned from the ImageNet show good general-
cess model [18], the Regression Tree Fields based method
ization ability [44] and become a powerful tool for several
(RTF) [13] and adjusted anchored neighborhood regression
high-level vision problems, such as Pascal VOC image classi-
(A+) [29]. The RTF takes the results of SA-DCT [8] as bases
fication [25] and object detection [9], [30]. Yosinski et al. [43]
and produces globally consistent image reconstructions with
have also tried to quantify the degree to which a particular
a regression tree field model. It could also be optimized for
layer is general or specific. Overall, transfer learning has
any differentiable loss functions (e.g., SSIM), but often at the
been systematically investigated in high-level vision problems,
cost of performing sub-optimally on other evaluation metrics.
but not in low-level vision tasks. In this study, we explore
As a recent method for image super-resolution [34], A+ [29]
several transfer settings on compression artifacts reduction and
has also been successfully applied for compression artifacts
show the effectiveness of transfer learning in low-level vision
reduction. In their method, the input image is decomposed into
problems.
overlapping patches and sparsely represented by a dictionary
of anchoring points. Then the uncompressed patches are pre-
dicted by multiplying with the corresponding linear regressors. III. M ETHODOLOGY
They obtain impressive results on JPEG 2000 image, but have
not tested on other compression schemes. Our proposed approach is based on the current successful
To deal with a specific compression standard, specially low-level vision model – SRCNN [6]. To have a better
JPEG, some recent progresses incorporate information from understanding of our work, we first give a brief overview of
dual-domains (DCT and pixel domains) and achieve impres- SRCNN. Then we explain the insights that lead to a deeper
sive results. Specifically, Liu et al. [22] apply sparse-coding in network and present our new model. Subsequently, we explore
the DCT-domain to eliminate the quantization error, then re- three types of transfer learning strategies that help in training
store the lost high frequency components in the pixel domain. a deeper and better network.
4
A. Review of SRCNN we extract new features from the n1 feature maps of the
The SRCNN aims at learning an end-to-end mapping, first layer, and combine them to form another set of feature
which takes the low-resolution image Y (after interpolation) maps. Overall, the AR-CNN consists of four layers, namely
as input and directly outputs the high-resolution one F (Y). the feature extraction, feature enhancement, mapping and
The network contains three convolutional layers, each of reconstruction layer.
which is responsible for a specific task. Specifically, the Different from SRCNN that adopts ReLU as the acti-
first layer performs patch extraction and representation, vation function, we use Parametric Rectified Linear Unit
which extracts overlapping patches from the input image and (PReLU) [11] in the new networks. To distinguish ReLU and
represents each patch as a high-dimensional vector. Then the PReLU, we define a general activation function as:
non-linear mapping layer maps each high-dimensional vector
P ReLU (xj ) = max(xj , 0) + aj · min(0, xj ), (4)
of the first layer to another high-dimensional vector, which
is conceptually the representation of a high-resolution patch. where xj is the input signal of the activation f on the j-th
At last, the reconstruction layer aggregates the patch-wise channel, and aj is the coefficient of the negative part. The
representations to generate the final output. The network can parameter aj is set to be zero for ReLU, but is learnable
be expressed as: for PReLU. We choose PReLU mainly to avoid the “dead
F0 (Y) = Y; (1) features” [44] caused by zero gradients in ReLU. We represent
the whole network as:
Fi (Y) = max (0, Wi ∗ Fi−1 (Y) + Bi ) , i ∈ {1, 2}; (2)
F (Y) = W3 ∗ F2 (Y) + B3 , (3) F0 (Y) = Y; (5)
Fi (Y) = P ReLU (Wi ∗ Fi−1 (Y) + Bi ) , i ∈ {1, 2, 3}; (6)
where Wi and Bi represent the filters and biases of the ith
layer respectively, Fi is the output feature maps and “∗” F (Y) = W4 ∗ F3 (Y) + B4 . (7)
denotes the convolution operation. The Wi contains ni filters
where the meaning of the variables is the same as that in
of support ni−1 × fi × fi , where fi is the spatial support of
Equation 1, and the second layer (W2 , B2 ) is the added feature
a filter, ni is the number of filters, and n0 is the number of
enhancement layer.
channels in the input image. Note that there is no pooling or
It is worth noticing that AR-CNN is not equal to a deeper
full-connected layers in SRCNN, so the final output F (Y)
SRCNN that contains more than one non-linear mapping
is of the same size as the input image. Rectified Linear Unit
layers2 . A deeper SRCNN imposes more non-linearity in
(ReLU, max(0, x)) [24] is applied on the filter responses.
the mapping stage, which equals to adopting a more robust
These three steps are analogous to the basic operations in the
regressor between the low-level features and the final output.
sparse-coding-based super-resolution methods [40], and this
Similar ideas have been proposed in some sparse-coding-
close relationship lays theoretical foundation for its successful
based methods [2], [16]. However, as compression artifacts
application in super-resolution. Details can be found in the
are complex, low-level features extracted by a single layer are
paper [6].
noisy. Thus the performance bottleneck lies on the features but
not the regressor. AR-CNN improves the mapping accuracy by
B. Convolutional Neural Network for Compression Artifacts enhancing the extracted low-level features, and the first two
Reduction layers together can be regarded as a better feature extractor.
Insights. In sparse-coding-based methods and SRCNN, the This leads to better performance than a deeper SRCNN. Ex-
first step – feature extraction – determines what should be perimental results of AR-CNN, SRCNN and deeper SRCNN
emphasized and restored in the following stages. However, will be shown in Section VI-A2.
as various compression artifacts are coupled together, the ex-
tracted features are usually noisy and ambiguous for accurate C. Model Learning
mapping. In the experiments of reducing JPEG compression
artifacts (see Section VI-A2), we find that some quantiza- Given a set of ground truth images {Xi } and their cor-
tion noises coupled with high frequency details are further responding compressed images {Yi }, we use Mean Squared
enhanced, bringing unexpected noisy patterns around sharp Error (MSE) as the loss function:
edges. Moreover, blocking artifacts in flat areas are misrec- n
1X
ognized as normal edges, causing abrupt intensity changes L(Θ) = ||F (Yi ; Θ) − Xi ||2 , (8)
in smooth regions. Inspired by the feature enhancement step n i=1
in super-resolution [38], we introduce a feature enhancement
where Θ = {W1 , W2 , W3 , W4 , B1 , B2 , B3 , B4 }, n is the
layer after the feature extraction layer in SRCNN to form a
number of training samples. The loss is minimized using
new and deeper network – AR-CNN. This layer maps the
stochastic gradient descent with the standard backpropagation.
“noisy” features to a relatively “cleaner” feature space, which
We adopt a batch-mode learning method with a batch size of
is equivalent to denoising the feature maps.
128.
Formulation. The overview of the new network AR-CNN
is shown in Figure 2. The three layers of SRCNN remain 2 Adding non-linear mapping layers has been suggested as an extension of
unchanged in the new model. To conduct feature enhancement, SRCNN in [6].
5
Hourglass
Input image Output image
Fig. 4. The framework of the Fast AR-CNN. There are two main modifications based on the original AR-CNN. First, the layer decomposition splits the
original “feature enhancement” layer into a “shrinking” layer and an “enhancement” layer. Then the large-stride convolutional and deconvolutional layers
significantly decrease the spatial size of the feature maps of the middle layers. The overall shape of the framework is like an hourglass, which is thick at the
ends and thin in the middle.
Figure 3(b)). Therefore, if we use the same stride for the first deconvolutional layer is equal to a convolutional layer. When
and the last layer, the output will remain the same size as the s > 1, the time complexity will decrease s2 times at the cost
input, as depicted in Figure 4. After joint use of large-stride of the reconstruction quality.
convolutional and deconvolutional layers, the spatial size of (3) When we adopt all 1 × 1 filters in the middle layer, it
the feature maps mi will become mi /s, which will reduce will work very similar to a Multi-Layer Perception (MLP) [3].
the overall time complexity significantly. The MLP processes each patch individually. Input patches are
Although the above modifications will improve the time extracted from the image with a stride s, and the output patches
efficiency, they may also influence the restoration quality. To are aggregated (i.e., averaging) on the overlapping areas. While
further improve the performance, we can expand the mapping for our framework, the patches are also extracted with a stride
layer (i.e., use more mapping filters) and enlarge the filter s, but in a convolution manner. The output patches are also
size of the deconvolutional layer. For instance, we can set the aggregated (i.e., summation) on overlapping areas, but in a
number of mapping filters to be same as that of the first-layer deconvolution manner. If the filter size of the middle layers
filters (i.e., from 16 to 64), and use the same filter size for the is set to 1, then each output patch is determined purely by
first and the last layer (i.e., f1 = f5 = 9). This is a feasible a single input patch, which is almost the same as a MLP.
solution but not a strict rule. In general, it can be seen as a However, when we set a larger filter size for middle layers,
compensation for the low time complexity. In Section VI-D1, the receptive field of an output patch will increase, leading
we investigate different settings through a series of controlled to much better performance. This also reveals why the CNN
experiments, and find a good trade-off between performance structure can outperform the conventional MLP theoretically.
and complexity. Here, we present the general framework as
Fast AR-CNN. Through the above modifications, we reach
n1 (f1 ) − n2 (1) − n3 (f3 ) × m − n4 (1) − n5 [f5 ] − s, (12)
to a more efficient network structure. If we set s = 2, the modi-
fied model can be represented as 64(9)-32(1)-32(7)-64(1)-1[9]- where f and n represent the filter size and the number of filters
s2, where the square bracket refers to the deconvolution filter. respectively. The number of middle layers is denoted as m, and
We name the new model as Fast AR-CNN. The number of its can be used to design a deeper network. As we focus more
overall parameters is 56,496 by Equation 9. Then the acceler- on speed, we just set m = 1 in the following experiments.
ation ratio can be calculated as 106448/56496·22 = 7.5. Note Figure 4 shows the overall structure of the new framework.
that this network could achieve similar results as the baseline We believe that this framework can be applied to more low-
model as shown in Section VI-D1. level vision problems, such as denoising and deblurring, but
this is beyond the scope of this paper.
C. A General Framework
When we relax the network settings, such as the filter V. E ASY-H ARD T RANSFER
number, filter size, and stride, we can obtain a more general Transfer learning in deep models provides an effective way
framework with some appealing properties as follows. of initialization. In fact, conventional initialization strategies
(1) The overall “shape” of the network is like an “hour- (i.e., randomly drawn from Gaussian distributions with fixed
glass”, which is thick at the ends and thin in the middle. The standard deviations [17]) are found not suitable for training a
shrinking and the mapping layers control the width of the very deep model, as reported in [11]. To address this issue,
network. They are all 1 × 1 filters and contribute little to the He et al. [11] derive a robust initialization method for rectifier
overall complexity. nonlinearities, Simonyan et al. [32] propose to use the pre-
(2) The choice of the stride can be very flexible. The trained features on a shallow network for initialization.
previous low-level vision CNNs, such as SRCNN and AR- In low-level vision problems (e.g., super resolution), it
CNN, can be seen as a special case of s = 1, where the is observed that training a network beyond 4 layers would
7
target𝐵3 Fig. 6. First layer filters of AR-CNN learned under different JPEG compres-
T𝑤𝑖𝑡𝑡𝑒𝑟 sion qualities.
Eval. Mat Quality JPEG SA-DCT AR-CNN Eval. Mat Quality JPEG SA-DCT AR-CNN
10 27.77 28.65 29.13 10 27.82 28.88 29.04
PSNR 20 30.07 30.81 31.40 PSNR 20 30.12 30.92 31.16
30 31.41 32.08 32.69 30 31.48 32.14 32.52
40 32.35 32.99 33.63 40 32.43 33.00 33.34
10 0.7905 0.8093 0.8232 10 0.7800 0.8071 0.8111
SSIM 20 0.8683 0.8781 0.8886 SSIM 20 0.8541 0.8663 0.8694
30 0.9000 0.9078 0.9166 30 0.8844 0.8914 0.8967
40 0.9173 0.9240 0.9306 40 0.9011 0.9055 0.9101
10 25.33 28.01 28.74 10 25.21 28.16 28.75
PSNR-B 20 27.57 29.82 30.69 PSNR-B 20 27.50 29.75 30.60
30 28.92 30.92 32.15 30 28.94 30.83 31.99
40 29.96 31.79 33.12 40 29.92 31.59 32.80
TABLE IV
layer). Specifically, given a 24 × 24 input Yi , AR-CNN T HE AVERAGE RESULTS OF PSNR ( D B), SSIM, PSNR-B ( D B) ON THE
produces a (24 − s + 1) × (24 − s + 1) output. Hence, the LIVE1 DATASET WITH q = 10 .
loss (Eqn. (8)) was computed by comparing against the up-
Eval. JPEG SRCNN Deeper AR-CNN
left (24 − s + 1) × (24 − s + 1) pixels of the ground truth Mat SRCNN
sub-image Xi . In the training phase, we follow [6], [12] and PSNR 27.77 28.91 28.92 29.13
use a smaller learning rate (5 × 10−5 ) in the last layer and a SSIM 0.7905 0.8175 0.8189 0.8232
PSNR-B 25.33 28.52 28.46 28.74
comparably larger one (5 × 10−4 ) in the remaining layers.
A. Experiments on JPEG-compressed Images classical test images used in [8]6 , and observed the same trend.
We first compare our methods with some state-of-the-art The results are shown in Table III.
algorithms, including the deblocking oriented method SA- To compare the visual quality, we present some restored
DCT [8] and the deep model SRCNN [6] and the restoration images with q = 10, 20 in Figure 10. From the qualitative
based RTF [13], on restoring JPEG-compressed images. As in results, we could see that the result of AR-CNN could produce
other compression artifacts reduction methods (e.g., RTF [13]), much sharper edges with much less blocking and ringing
we apply the standard JPEG compression scheme, and use the artifacts compared with SA-DCT. The visual quality has been
JPEG quality settings q = 40, 30, 20, 10 (from high quality largely improved on all aspects compared with the state-of-the-
to very low quality) in MATLAB JPEG encoder. We use art method. Furthermore, AR-CNN is superior to SA-DCT on
the LIVE1 dataset [31] (29 images) as test set to evaluate the implementation speed. For SA-DCT, it needs 3.4 seconds
both the quantitative and qualitative performance. The LIVE1 to process a 256 × 256 image. While AR-CNN only takes 0.5
dataset contains images with diverse properties. It is widely second. They are all implemented using C++ on a PC with
used in image quality assessment [36] as well as in super- Intel I3 CPU (3.1GHz) with 16GB RAM.
resolution [39]. To have a comprehensive qualitative evalua- 2) Comparison with SRCNN: As discussed in Section III-B,
tion, we apply the PSNR, structural similarity (SSIM) [36]5 , SRCNN is not suitable for compression artifacts reduction.
and PSNR-B [42] for quality assessment. We want to empha- For comparison, we train two SRCNN networks with different
size the use of PSNR-B. It is designed specifically to assess settings. (i) The original SRCNN (9-1-5) with f1 = 9, f3 = 5,
blocky and deblocked images. n1 = 64 and n2 = 32. (ii) Deeper SRCNN (9-1-1-5) with an
We use the baseline network settings – f1 = 9, f2 = 7, additional non-linear mapping layer (f3 = 1, n3 = 16). They
f3 = 1, f4 = 5, n1 = 64, n2 = 32, n3 = 16 and n4 = all use the BSDS500 dataset for training and validation as in
1, denoted as 64(9)-32(7)-16(1)-1(5) or simply AR-CNN. A Section VI. The compression quality is q = 10.
specific network is trained for each JPEG quality. Parameters Quantitative results tested on LIVE1 dataset are shown in
are randomly initialized from a Gaussian distribution with a Table IV. We could see that the two SRCNN networks are
standard deviation of 0.001. inferior on all evaluation metrics. From convergence curves
1) Comparison with SA-DCT: We first compare AR-CNN shown in Figure 7, it is clear that AR-CNN achieves higher
with SA-DCT [8], which is widely regarded as the state-of-the- PSNR from the beginning of the learning stage. Furthermore,
art deblocking oriented method [13], [19]. The quantization from their restored images in Figure 11, we find out that the
results of PSNR, SSIM and PSNR-B are shown in Table II. two SRCNN networks all produce images with noisy edges
On the whole, our AR-CNN outperforms SA-DCT on all JPEG and unnatural smooth regions. These results demonstrate our
qualities and evaluation metrics by a large margin. Note that statements in Section III-B. The success of training a deep
the gains on PSNR-B are much larger than those on PSNR. model needs comprehensive understanding of the problem and
This indicates that AR-CNN could produce images with less careful design of the model structure.
blocking artifacts. We have also conducted evaluation on 5 3) Comparison with RTF: RTF [13] is a recent state-of-the-
art restoration oriented method. Without their deblocking code,
5 We use the unweighted structural similarity defined over fixed 8 × 8
windows as in [37]. 6 The 5 test images in [8] are baboon, barbara, boats, lenna and peppers.
9
27.7 0.8
27.6
AR−CNN 0.6
27.5
PSNRogaino(dB)
deeperRSRCNN
27.4 SRCNN
0.4
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
NumberRofRbackprops xR10
8
0.2
Fig. 7. Comparisons with SRCNN and Deeper SRCNN.
0
TABLE V
T HE AVERAGE RESULTS OF PSNR ( D B), SSIM, PSNR-B ( D B) ON THE
TEST SET BSDS500 DATASET. -0.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Eval. Quality JPEG RTF RTF AR-CNN
Mat +SA-DCT Fig. 9. PSNR gain comparison of the proposed AR-CNN against A+, SLGP
PSNR 10 26.62 27.66 27.71 27.79 , and FoE. The x axis corresponds to the image index. The average PSNR
20 28.80 29.84 29.87 30.00 gains across the dataset are marked with solid lines.
SSIM 10 0.7904 0.8177 0.8186 0.8228
20 0.8690 0.8864 0.8871 0.8899 package7 . We also adopt the same training strategy as A+.
PSNR-B 10 23.54 26.93 26.99 27.32 To test on images degraded at 0.1 bits per pixel (BPP), the
20 25.62 28.80 28.80 29.15
training images are compressed at 0.3 BPP instead of 0.1 BPP.
As indicated in [29], the regressors can more easily pick up the
we can only compare with the released deblocking results. artifact patterns at a lower compression rate, leading to better
Their model is trained on the training set (200 images) of the performance. We use the same AR-CNN network structure
BSDS500 dataset, but all images are down-scaled by a factor (64(9)-32(7)-16(1)-1(5)) as in the JPEG experiments. Figure 8
of 0.5 [13]. To have a fair comparison, we also train new AR- shows the patterns of the learned first-layer filters, which differ
CNN networks on the same half-sized 200 images. Testing a lot from that for JPEG images (see Figure 6).
is performed on the test set of the BSDS500 dataset (images
Apart from A+, we compare our results against another
scaled by a factor of 0.5), which is also consistent with [13].
two methods – SLGP [18] and FoE [28]. The PSNR gains
We compare with two RTF variants. One is the plain RTF,
of the 16 test images are shown in Figure 9. It is observed
which uses the filter bank and is optimized for PSNR. The
that our method outperforms others on most test images. For
other is the RTF+SA-DCT, which includes the SA-DCT as a
the average performance, we achieve a PSNR gain of 0.353
base method and is optimized for MAE. The later achieves
dB, better than A+ with 0.312 dB, SLGP with 0.192 dB and
the highest PSNR value among all RTF variants [13].
FoE with 0.115 dB. Note that the improvement is already
As shown in Table V, we obtain superior performance
significant in such a difficult scenario – JPEG 2000 at 0.1
than the plain RTF, and even better performance than the
BPP [29]. Figure 12 shows some qualitative results, where our
combination of RTF and SA-DCT, especially under the more
method achieves better PSNR and SSIM than A+. However,
representative PSNR-B metric. Moreover, training on such a
we also notice that AR-CNN is inferior to other methods
small dataset has largely restricted the ability of AR-CNN.
on the tenth image in Figure 9. The restoring results of this
The performance of AR-CNN will further improve given more
image are shown in Figure 13. It is observed that the result
training images.
of AR-CNN is still visually pleasant, and the lower PSNR
a' c' b' is mainly due to the chromatic aberration in smooth regions.
The above experiments demonstrate the generalization ability
of AR-CNN on handling different compression standards.
During training, we also find that AR-CNN is hard to con-
verge using random initialization mentioned in Section VI-A.
We solve the problem by adopting the transfer learning strat-
egy. To be specific, we can transfer the first-layer filters of
Fig. 8. First-layer filters of AR-CNN learned for JPEG 2000 at 0.3 BPP.
a well-trained three-layer network to the four-layer AR-CNN,
or we can reuse the features of AR-CNN trained on the JPEG
B. Experiments on JPEG 2000 Images
images. They refer to different ‘easy-hard transfer” strategies
As mentioned in the introduction, the proposed AR-CNN is – transfer shallow to deeper model and transfer standard to
effective in dealing with various compression schemes. In this real use case, which will be detailed in the following section.
section, we conduct experiments on the JPEG 2000 standard,
and compare with the state-of-the-art method – the adjusted
anchored regression (A+) [29]. To have a fair comparison, we C. Experiments on Easy-Hard Transfer
follow A+ on the choice of datasets and software. Specifically, We show the experimental results of different “easy-hard
we adopt the the 91-image dataset [40] for training and 16 transfer” settings on JPEG-compressed images. The details of
classical images [18] for testing. The images are compressed
using the JPEG 2000 encoder from the Kakadu software 7 http://www.kakadusoftware.com
10
Fig. 10. Results on image “parrots” (q = 10) show that AR-CNN is better than SA-DCT on removing blocking artifacts.
Fig. 11. Results on image “monarch” show that AR-CNN is better than SRCNN on removing ringing effects.
Fig. 12. Results on image “lenna” compressed with JPEG 2000 at 0.1 BPP.
Fig. 13. Results on image “pepper” compressed with JPEG 2000 at 0.1 BPP.
TABLE VI 27.8
AverageRtestRPSNRRqdB)
E XPERIMENTAL SETTINGS OF “ EASY- HARD TRANSFER ”. T HE “9-7-1-5”
AND “9-7-3-1-5” ARE SHORT FOR 64(9)-32(7)-16(1)-1(5) AND 27.7
64(9)-32(7)-16(3)-16(1)-1(5), RESPECTIVELY.
27.6
transferR1Rlayer
transfer short network training initialization transferR2Rlayers
strategy form structure dataset strategy 27.5 base−q10
base base-q10 9-7-1-5 BSDS-q10 Gaussian (0, 0.001) 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
network base-q20 9-7-1-5 BSDS-q20 Gaussian (0, 0.001) NumberRofRbackprops xR10
8
Average(test(PSNR(idB)
to transfer 1 layer 9-7-1-5 BSDS-q10 1 layer of base-q20
low transfer 2 layers 9-7-1-5 BSDS-q10 1,2 layer of base-q20 25
standard base-Twitter 9-7-1-5 Twitter Gaussian (0, 0.001)
to transfer q10 9-7-1-5 Twitter 1 layer of base-q10 24.8
real transfer q20 9-7-1-5 Twitter 1 layer of base-q20 transfer(q10
24.6 transfer(q20
base−Twitter
2 4 6 8 10 12
Number(of(backprops x(10
7
27.8
AveragedtestdPSNRd]dB-
27.6
convergence speed.
transferddeeper 3) Transfer standard to real use case – Twitter: Online
27.5 Hed[11]
base-q10 Social Media like Twitter are popular platforms for message
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
posting. However, Twitter will compress the uploaded images
Numberdofdbackprops × 10 8 on the server-side. For instance, a typical 8 mega-pixel (MP)
Fig. 15. Transfer shallow to deeper model. image (3264 × 2448) will result in a compressed and re-
scaled version with a fixed resolution of 600 × 450. Such re-
settings are shown in Table VI. Take the base network as scaling and compression will introduce very complex artifacts,
an example, the “base-q10” is a four-layer AR-CNN 64(9)- making restoration difficult for existing deblocking algorithms
32(7)-16(1)-1(5) trained on the BSDS500 [1] dataset (400 (e.g., SA-DCT). However, AR-CNN can fit to the new data
images) under the compression quality q = 10. Parameters are easily. Further, we want to show that features learned under
initialized by randomly drawing from a Gaussian distribution standard compression schemes could also facilitate training on
with zero mean and standard deviation 0.001. Figures 15 - 17 a completely different dataset. We use 40 photos of resolution
show the convergence curves on the validation set. 3264 × 2448 taken by mobile phones (totally 335,209 training
1) Transfer shallow to deeper model: In Table VI, we de- subimages) and their Twitter-compressed version8 to train
note a deeper (five-layer) AR-CNN 64(9)-32(7)-16(3)-16(1)- three networks with initialization settings listed in Table VI.
1(5) as “9-7-3-1-5”. Results in Figure 15 show that the From Figure 17, we observe that the “transfer q10” and
transferred features from a four-layer network enable us to “transfer q20” networks converge much faster than the “base-
train a five-layer network successfully. Note that directly Twitter” trained from scratch. Specifically, the “transfer q10”
training a five-layer network using conventional initialization takes 6 × 107 backprops to achieve 25.1dB, while the “base-
ways is unreliable. Specifically, we have exhaustively tried Twitter” uses 10×107 backprops. Despite of fast convergence,
different groups of learning rates, but still could not observe transferred features also lead to higher PSNR values compared
convergence. Furthermore, the “transfer deeper” converges with “base-Twitter”. This observation suggests that features
faster and achieves better performance than using He et al.’s learned under standard compression schemes are also trans-
method [11], which is also very effective in training a deep ferrable to tackle real use case problems. Some restoration
model. We have also conducted comparative experiments with results are shown in Figure 14. We could see that both
the structure 64(9)-32(7)-16(1)-16(1)-1(5) and 64(9)-32(1)- networks achieve satisfactory quality improvements over the
32(7)-16(1)-1(5), and observed the same trend. compressed version.
2) Transfer high to low quality: Results are shown in Fig-
ure 16. Obviously, the two networks with transferred features D. Experiments on Acceleration Strategies
converge faster than that training from scratch. For example, to
In this section, we conduct a set of controlled experiments
reach an average PSNR of 27.77dB, the “transfer 1 layer” takes
to demonstrate the effectiveness of the proposed acceleration
only 1.54×108 backprops, which are roughly a half of that for
strategies. Following the descriptions in Section IV, we pro-
“base-q10”. Moreover, the “transfer 1 layer” also outperforms
gressively modify the baseline AR-CNN by layer decomposi-
the ‘transfer 2 layers” by a slight margin throughout the
tion, adopting large-stride layers and expanding the mapping
training phase. One reason for this is that only initializing
layer. The networks are trained on JPEG images under the
the first layer provides the network with more flexibility in
adapting to a new dataset. This also indicates that a good 8 We have shared this dataset on our project page http://mmlab.ie.cuhk.edu.
starting point could help train a better network with higher hk/projects/ARCNN.html.
12
TABLE VII
AveragedtestdPSNRd(dB)
T HE EXPERIMENTAL RESULTS OF DIFFERENT SETTINGS .
27.9
AverageRtestRPSNRR(dB)
JPEG base-q20 31.40 0.8886 30.69
27.95
quality fast-q30 32.41 0.9124 31.43
base-q30 32.69 0.9166 32.15 n 4=80
fast-q40 33.43 0.9306 32.51
27.9 n 4=64
base-q40 33.63 0.9306 33.12
n 4=48
27.85
n 4=16
1 2 3 4 5 6 7 8 9 10
quality q = 10. We further test the performance of Fast AR- NumberRofRbackprops × 10 8
CNN on different compression qualities (q = 10, 20, 30, 40).
Fig. 19. The performance of using different mapping filters.
As all the modified networks are deeper than the baseline
model, we adopt the proposed transfer learning strategy (trans-
fer shallow to deeper model) for fast and stable training. The performance loss. In the part “mapping filters” of Table VII,
base network is also “base-q10” as in Section VI-C1. All the we compare a set of experiments that only differ in mapping
quantitative results are listed in Table VII. filters. To be specific, the network setting is 64(9)-32(1)-
1) Layer decomposition: The layer decomposition strategy 32(7)-n4 (1)-1[9]-s2 with n4 = 16, 48, 64, 80. The convergence
replaces the “feature enhancement” layer with a “shrinking” curves shown in Figure 1910 . can better reflect their differ-
layer and an “enhancement” layer, and we reach to a mod- ences. Obviously, using more filters will achieve better per-
ified network 64(9)-32(1)-32(7)-16(1)-1(5). The experimental formance, but the improvement is marginal beyond n4 = 64.
results are shown in Table VII, from which we can see that Thus we adopt n4 = 64, which is also consistent with our
the “replace deeper” achieves almost the same performance as comment in Section IV. Finally, we find the optimal network
the “base-q10” in all the metrics. This indicates that the layer setting – 64(9)-32(1)-32(7)-64(1)-1[9]-s2, namely Fast AR-
decomposition is an effective strategy to reduce the network CNN, which achieves similar performance as the baseline
parameters with almost no performance loss. model 64(9)-32(7)-16(1)-1(5) but is 7.5 times faster.
2) Stride size: Then we introduce the large-stride convo-
lutional and deconvolutional layers, and change the stride 4) JPEG quality: In the above experiments, we mainly
size. Generally, a larger stride will lead to much narrower focus on a very low quality q = 10. Here we want to examine
feature maps and faster inference, but at the risk of worse the capacity of the new network on different compression
reconstruction quality. In order to find a good trade-off setting, qualities. In the part “JPEG quality” of Table VII, we compare
we conduct experiments with different stride sizes as shown the Fast AR-CNN with the baseline AR-CNN on quality q =
in the part “stride” of Table VII. The network settings for 10, 20, 30, 40. For example, “fast-q10” and “base-q10” rep-
s = 1, s = 2 and s = 3 are 64(9)-32(1)-32(7)-16(1)-1(5), resent 64(9)-32(1)-32(7)-64(1)-1[9]-s2 and 64(9)-32(7)-16(1)-
64(9)-32(1)-32(7)-16(1)-1[9]-s2 and 64(9)-32(1)-32(7)-16(1)- 1(5) on quality q = 10, respectively. From the quantitative
1[9]-s3, respectively. From the results in Table VII, we can results, we observe that the Fast AR-CNN is comparable with
see that there are only small differences between “s = 1” and AR-CNN on low qualities such as q = 10 and q = 20, but it
“s = 2” in all metrics. But when we further enlarge the stride is inferior to AR-CNN on high qualities such as q = 30 and
size, the performance declines dramatically, e.g., the PSNR q = 40. This phenomenon is reasonable. As the low quality
value drops more than 0.2 dB from “s = 2” to “s = 3”. images contain much less information, extracting features in
Convergence curves in Figure 18 also exhibit a similar trend, a sparse way (using a large stride) does little harm to the
where “s = 3” achieves inferior performance to “s = 1” and restoration quality. On the contrary, for high quality images,
“s = 2” on the validation set9 . With little performance loss adjacent image patches may differ a lot. So when we adopt
yet 7.5 times faster, using stride s = 2 definitely balances the a large stride, we will lose the information that is useful
performance and time complexity. Thus we adopt stride s = 2 for restoration. Nevertheless, the proposed Fast AR-CNN
in the following experiments. still outperforms the state-of-the-art methods (as presented in
3) Mapping filters: As mentioned in Section IV, we can Section VI-A) on different compression qualities.
increase the number of mapping filters to compensate the
9 As the validation set (BSD500 validation set) is different from the test set 10 As the validation set (BSD500 validation set) is different from the test
(LIVE1 dataset), the results in Table VII and Figure 18 are different. set (LIVE1 dataset), the results in Table VII and Figure 19 are different.
13
VII. C ONCLUSION [19] Y. Li, F. Guo, R. T. Tan, and M. S. Brown. A contrast enhancement
framework with jpeg artifacts suppression. In European Conference on
Applying deep model on low-level vision problems requires Computer Vision, pages 174–188. 2014. 1, 3, 8
deep understanding of the problem itself. In this paper, we [20] A.-C. Liew and H. Yan. Blocking artifacts suppression in block-coded
images using overcomplete wavelet representation. IEEE Transactions
carefully study the compression process and propose a four- on Circuits and Systems for Video Technology, 14(4):450–461, 2004. 1,
layer convolutional network, AR-CNN, which is extremely 3
effective in dealing with various compression artifacts. Then [21] P. List, A. Joch, J. Lainema, G. Bjontegaard, and M. Karczewicz.
Adaptive deblocking filter. EEE Transactions on Circuits and Systems
we propose two acceleration strategies to reduce its time for Video Technology, 13(7):614–619, 2003. 1, 3
complexity while maintaining good performance. We further [22] X. Liu, X. Wu, J. Zhou, and D. Zhao. Data-driven sparsity-based
restoration of jpeg-compressed images in dual transform-pixel domain.
systematically investigate three easy-to-hard transfer settings In IEEE Conference on Computer Vision and Pattern Recognition,
that could facilitate training a deeper or better network, and volume 16, pages 1395–1411, 2015. 1, 3
verify the effectiveness of transfer learning in low-level vision [23] S. Y. Min Lin, Qiang Chen. Network in network. arXiv:1312.4400,
2014. 5
problems. [24] V. Nair and G. E. Hinton. Rectified linear units improve restricted
Boltzmann machines. In International Conference on Machine Learning,
pages 807–814, 2010. 4
R EFERENCES [25] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning and transferring
mid-level image representations using convolutional neural networks. In
[1] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Contour detection and IEEE Conference on Computer Vision and Pattern Recognition, pages
hierarchical image segmentation. IEEE Transactions on Pattern Analysis 1717–1724, 2014. 3
and Machine Intelligence, 33(5):898–916, 2011. 7, 11 [26] W. Ouyang, X. Wang, X. Zeng, S. Qiu, P. Luo, Y. Tian, H. Li, S. Yang,
[2] M. Bevilacqua, A. Roumy, C. Guillemot, and M.-L. A. Morel. Low- Z. Wang, C.-C. Loy, et al. Deepid-net: Deformable deep convolutional
complexity single-image super-resolution based on nonnegative neighbor neural networks for object detection. In IEEE Conference on Computer
embedding. In British Machine Vision Conference, 2012. 4 Vision and Pattern Recognition, pages 2403–2412, 2015. 5
[3] H. C. Burger, C. J. Schuler, and S. Harmeling. Image denoising: Can [27] H. C. Reeve III and J. S. Lim. Reduction of blocking effects in image
plain neural networks compete with BM3D? In IEEE Conference on coding. Optical Engineering, 23(1):230134–230134, 1984. 1, 3
Computer Vision and Pattern Recognition, pages 2392–2399, 2012. 2, [28] S. Roth and M. J. Black. Fields of experts. International Journal of
6 Computer Vision, 82(2):205–229, 2009. 2, 9
[4] P. Dollár and C. L. Zitnick. Structured forests for fast edge detection. In [29] R. Rothe, R. Timofte, and L. Van. Efficient regression priors for reducing
IEEE International Conference on Computer Vision, pages 1841–1848, image compression artifacts. In International Conference on Image
2013. 1 Processing, pages 1543–1547, 2015. 2, 3, 9
[5] C. Dong, Y. Deng, C. Change Loy, and X. Tang. Compression artifacts [30] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. Le-
reduction by a deep convolutional network. In IEEE International Cun. Overfeat: Integrated recognition, localization and detection using
Conference on Computer Vision, pages 576–584, 2015. 2 convolutional networks. arXiv:1312.6229, 2013. 3
[6] C. Dong, C. C. Loy, K. He, and X. Tang. Learning a deep convolu- [31] H. R. Sheikh, Z. Wang, L. Cormack, and A. C. Bovik. Live image
tional network for image super-resolution. In European Conference on quality assessment database release 2, 2005. 8
Computer Vision, pages 184–199. 2014. 1, 2, 3, 4, 7, 8 [32] K. Simonyan and A. Zisserman. Very deep convolutional networks for
[7] C. Dong, C. C. Loy, K. He, and X. Tang. Image super-resolution using large-scale image recognition. arXiv:1409.1556, 2014. 2, 6, 7
deep convolutional networks. IEEE Transactions on Pattern Analysis [33] D. Sun and W.-K. Cham. Postprocessing of low bit-rate block DCT
and Machine Intelligence, 38(2):295–307, 2015. 2, 7 coded images based on a fields of experts prior. IEEE Transactions on
[8] A. Foi, V. Katkovnik, and K. Egiazarian. Pointwise shape-adaptive DCT Image Processing, 16(11):2743–2751, 2007. 3
for high-quality denoising and deblocking of grayscale and color images. [34] R. Timofte, V. De Smet, and L. Van Gool. A+: Adjusted anchored
IEEE Transactions on Image Processing, 16(5):1395–1411, 2007. 1, 2, neighborhood regression for fast super-resolution. In IEEE Asian
3, 8 Conference on Computer Vision, pages 111–126. Springer, 2014. 3
[9] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature [35] C. Wang, J. Zhou, and S. Liu. Adaptive non-local means filter for image
hierarchies for accurate object detection and semantic segmentation. In deblocking. Signal Processing: Image Communication, 28(5):522–530,
IEEE Conference on Computer Vision and Pattern Recognition, pages 2013. 1, 3
580–587, 2014. 3 [36] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image
[10] M. A. Gluck and C. E. Myers. Hippocampal mediation of stimulus quality assessment: from error visibility to structural similarity. IEEE
representation: A computational theory. Hippocampus, 3(4):491–516, Transactions on Image Processing, 13(4):600–612, 2004. 8
1993. 7 [37] Z. Wang and E. P. Simoncelli. Maximum differentiation (MAD)
[11] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: competition: A methodology for comparing computational models of
Surpassing human-level performance on imagenet classification. In IEEE perceptual quantities. Journal of Vision, 8(12):8, 2008. 8
Conference on Computer Vision and Pattern Recognition, pages 1026– [38] Z. Xiong, X. Sun, and F. Wu. Image hallucination with feature
1034, 2015. 4, 6, 11 enhancement. In IEEE Conference on Computer Vision and Pattern
[12] V. Jain and S. Seung. Natural image denoising with convolutional Recognition, pages 2074–2081, 2009. 4
networks. In Advances in Neural Information Processing Systems, pages [39] C.-Y. Yang, C. Ma, and M.-H. Yang. Single-image super-resolution: A
769–776, 2009. 8 benchmark. In European Conference on Computer Vision, pages 372–
[13] J. Jancsary, S. Nowozin, and C. Rother. Loss-specific training of non- 386. 2014. 1, 8
parametric image restoration models: A new state of the art. In European [40] J. Yang, J. Wright, T. S. Huang, and Y. Ma. Image super-resolution
Conference on Computer Vision, pages 112–125. 2012. 3, 8, 9 via sparse representation. IEEE Transactions on Image Processing,
[14] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, 19(11):2861–2873, 2010. 4, 9
S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for [41] Y. Yang, N. P. Galatsanos, and A. K. Katsaggelos. Projection-based
fast feature embedding. In ACM Multimedia, pages 675–678, 2014. 7 spatially adaptive reconstruction of block-transform compressed images.
[15] C. Jung, L. Jiao, H. Qi, and T. Sun. Image deblocking via sparse IEEE Transactions on Image Processing, 4(7):896–908, 1995. 3
representation. Signal Processing: Image Communication, 27(6):663– [42] C. Yim and A. C. Bovik. Quality assessment of deblocked images. IEEE
677, 2012. 2, 3 Transactions on Image Processing, 20(1):88–98, 2011. 8
[16] K. I. Kim and Y. Kwon. Single-image super-resolution using sparse [43] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are
regression and natural image prior. IEEE Transactions on Pattern features in deep neural networks? In Advances in Neural Information
Analysis and Machine Intelligence, 32(6):1127–1133, 2010. 4 Processing Systems, pages 3320–3328, 2014. 3
[17] A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet classification with [44] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional
deep convolutional neural networks. In Advances in Neural Information networks. In European Conference on Computer Vision, pages 818–833.
Processing Systems, pages 1097–1105, 2012. 3, 5, 6 2014. 3, 4
[18] Y. Kwon, K. I. Kim, J. Tompkin, J. H. Kim, and C. Theobalt. Efficient [45] S. C. Q. L. T. S. H. Zhangyang Wang, Ding Liu. D3 : Deep dual-domain
learning of image super-resolution and compression artifact removal with based fast restoration of jpeg-compressed images. In IEEE Conference
semi-local gaussian processes. IEEE Transactions on Pattern Analysis on Computer Vision and Pattern Recognition, 2016. 1, 2, 3
and Machine Intelligence, 37(9):1792–1805, 2015. 2, 3, 9