Deep Attentional Guided Image Filtering
Deep Attentional Guided Image Filtering
9, SEPTEMBER 2020 1
Abstract—Guided filter is a fundamental tool in computer guidance image are explicitly involved into filtering process
vision and computer graphics which aims to transfer structure through the photometric similarity. The guided image filtering
information from guidance image to target image. Most existing scheme proposed by He et al. [1] takes a more rigorous manner
methods construct filter kernels from the guidance itself without
considering the mutual dependency between the guidance and to exploit the structure information of the guidance, which
target. However, since there typically exist significantly different computes a locally linear model over the guidance image for
arXiv:2112.06401v2 [cs.CV] 28 Feb 2022
edges in two images, simply transferring all structural infor- filtering. These filters consider only the information contained
mation of the guidance to the target would result in various in the guidance image in filtering. However, since there
artifacts. To cope with this problem, we propose an effective typically exist significantly different edges in the two images,
framework named deep attentional guided image filtering, the
filtering process of which can fully integrate the complementary simply transferring all patterns of the guidance to the target
information contained in both images. Specifically, we propose an would introduce various artifacts. Some works [3], [8] pro-
attentional kernel learning module to generate dual sets of filter pose to utilize the optimization-based manner to find mutual
kernels from the guidance and the target, respectively, and then structures for propagation while suppressing inconsistent ones.
adaptively combine them by modeling the pixel-wise dependency However, it is challenging to select reference structures and
between the two images. Meanwhile, we propose a multi-scale
guided image filtering module to progressively generate the propagate them properly by hand-crafted objective functions.
filtering result with the constructed kernels in a coarse-to- In addition, the computational complexity of these methods is
fine manner. Correspondingly, a multi-scale fusion strategy is usually high.
introduced to reuse the intermediate results in the coarse-to-fine In recent years, learning-based approaches for GF design are
process. Extensive experiments show that the proposed frame- becoming increasingly popular, which derive GF in a purely
work compares favorably with the state-of-the-art methods in a
wide range of guided image filtering applications, such as guided data-driven manner. They allow the networks to learn how
super-resolution, cross-modality restoration, texture removal, and to adaptively select structures to transfer, and thus have the
semantic segmentation. Moreover, our scheme achieved the first ability to handle more complicated scenarios. For instance,
place in real depth map super-resolution challenge held in ACM in [16], a dynamic filter network (DFN) is proposed where
ICMR’20211 . pixel-wise filters are generated dynamically using a separate
Index Terms—Guided filter, dual regression, attentional kernel sub-network conditioned on the guidance. Unlike DFN, Su et
learning, guided super-resolution, cross-modality restoration. al. [6] adapts a standard spatially invariant kernel at each pixel
by multiplying it with a spatially varying filter. Although with
increased flexibility thanks to their adaptive nature, [16] and
I. I NTRODUCTION
[6] still suffer from the same drawback as [1], [15] that only
UIDED filter (GF), also named joint filter, is tailored
G to transfer structural information from a guidance image
to a target one. The popularity of GF can be attributed to
the guidance information is considered in filters design. Some
recent methods attempt to exploit the target and guidance
information jointly. For instance, Li et al. [5] propose to
its ability in handling visual signals in various domains and leverage two sub-networks to extract informative features from
modalities, where one modal signal serves as the guidance both the target and guidance images, which are then concate-
to improve the quality of the other one. It has been a useful nated as inputs for the fusion network to selectively transfer
tool for many image processing and computer vision tasks, salient structures from the guidance to the target. Instead of
such as depth map super-resolution [2], [5], scale-space fil- regressing the filtering results directly from the network, Kim
tering [8], [9], cross-modality image restoration [1], [3], [10], et al. [7] proposes to use spatially variant weighted averages,
structure-texture separation [11], [12], [13], image semantic where the set of neighbors and the corresponding kernel
segmentation [7], [14] and so on. weights are learned in an end-to-end manner. However, in the
In the literature, GF has been extensively studied, rang- designed networks of these methods, the simple concatenation
ing from the classical bilateral filter to the emerging deep or element-wise multiplication is exploited to combine multi-
learning-based ones. The pioneer bilateral filter [15] constructs modal information, which is not that effective. There is no
spatially-varying kernels, where local image structures of the mechanism to distinguish the contributions of the guidance and
the target to the final filtering result, and thus would also lead
Z. Zhong, X. Liu, J. Jiang and D. Zhao are with the School of Computer to erroneous structure propagation. In addition, the guidance
Science and Technology, Harbin Institute of Technology, Harbin 150001,
China, and also with Peng Cheng Laboratory, Shenzhen 518052, China E- and target images are treated as independent information
mail: {zhwzhong,csxm,jiangjunjun}@hit.edu.cn. since existing methods typically utilize two separate networks
X. Ji is with the Department of Automation, Tsinghua University, Beijing for feature extraction, thus the complementary information
100084, China. E-mail: [email protected].
contained in the two images cannot be fully exploited.
1 https://icmr21-realdsr-challenge.github.io/#Leaderboard By reviewing existing GF methods, it can be found that most
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 2
(a) Guidance Image (b) GF (c) JBU (d) MuGF (e) CUNet
(f) DJFR (g) PAC (h) DKN (i) Ours (j) Ground Truth
Fig. 1. Guided image filtering on a RGB/D image pair for 16 × guided super-resolution: (a) Guidance image, (b) GF [1], (c) JBU [2], (d) MuGF [3], (e)
CUNet [4], (f) DJFR [5], (g) PAC [6], (h) DKN [7], (i) Ours, (j) Ground truth. The results of (b)-(c) suffer from edge blurring artifact and the results of
(d)-(h) suffer from texture-copying artifacts. Our result produces much sharper edges. Please enlarge the PDF for more details.
of them concentrate their efforts on how to transfer structural with inconsistent contents by regression on the target itself.
information from the guidance to the target. However, for We show an illustrated example in Fig. 1, which presents
some scenarios, such as cross-modality image restoration [4] the visual filtering results comparison of our scheme with the
and guided super-resolution [17], multi-modal data has signifi- state-of-the-art guided depth super-resolution methods. It can
cantly different characteristics due to the difference of sensing be found that our proposed method is capable of producing
principle, making the guidance not always trustworthy. In view high-resolution depth image with clear boundaries as well as
of this, we argue that the purpose of GF should be two-folds: avoiding texture-copying artifacts.
1) apply the guidance as a prior for reconstruction of regions The main contributions of the proposed method are sum-
in the target where there are structure-consistent contents; and marized as follows:
2) derive a plausible prediction for regions in the target with • We propose an attentional kernel learning (AKL) module
inconsistent contents of the guidance. The latter represents the for guided image filtering, which generates dual sets of
case that the guidance is no longer reliable, so we have to filter kernels from both guidance and target, and then
rely on the target itself for reconstruction. Most existing GF adaptively combines these kernels by modeling the pixel-
methods only concern structure transferring from the guidance, wise dependencies between the two images in a learning
but neglect structure prediction from the target, leading to manner. Compared with existing kernel generation ap-
erroneous or extraneous artifacts in the output. It implies that proaches, the proposed method is more robust when there
instead of performing regression on guidance only, as done are inconsistent structures between the guidance and the
in [1], [15], we should perform dual regression on both the target.
guidance and the target, and combine them adaptively in a • We propose a multi-scale guided filtering module, which
smarter manner instead of simple concatenation or element- generates the filtering result in a coarse-to-fine manner.
wise multiplication, as done in [5], [7]. “Dual regression” Correspondingly, we propose a multi-scale fusion strategy
and “smart combination” bring the main motivations of our with deep supervision to fully explore the intermediate
proposed method. results in the coarse-to-fine process. To the best of our
Accordingly, in this paper, we propose an effective deep at- knowledge, this is the first guided filter framework that
tentional guided image filtering scheme, which constructs filter learns the multi-scale kernels to filter the target image at
kernels by fully considering information from both guidance different scales in the embedding space.
and target images. Specifically, an attentional kernel learning • We evaluate the performance of the proposed method
module is proposed to generate dual sets of filter kernels from on various computational photography and computer vi-
the guidance and the target, respectively. Moreover, pixel- sion tasks, such as guided image super-resolution, cross-
wise contributions of the guidance and the target to the final modality image restoration, texture removal, and semantic
filtering result are automatically learned. In this way, we can segmentation. The quantitative and qualitative results
adaptively apply the guidance as a prior for reconstruction demonstrate the effectiveness and universality of the
of target regions where there are structure-consistent contents proposed method.
with the guidance; and derive a prediction for target regions • Considering that there is no standard protocol to train
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 3
and evaluate the performance of guided image filtering kernel by multiplying a spatially varying filter on standard
algorithms, we reimplement eight recently proposed state- spatially invariant kernel:
of-the-art deep learning-based guided filtering models and X
unify their settings to facilitate fair comparison. All of fi = K(gi , gj )W [pi − pj ]tj + b, (4)
the codes and trained models are publicly available2 to j∈Ni
encourage reproducible research. where W is the spatially invariant kernel; K(·, ·) is a varying
The remainder of this paper is organized as follows. Sect. II filter kernel function that has a fixed form such as Gaussian,
gives a brief introduction to the relevant works of guided [pi − pj ] denotes the index offset of kernel weights. From the
filter. Sect. III introduces the proposed method for guided above formulation, it can be found that, similar to bilateral
image filtering. Sect. IV provides experimental comparisons filter and guided image filter, dynamic filter network and
with existing state-of-the-art methods for a varied range of pixel-adaptive convolution also only depend on the guidance
guided filtering tasks. Ablation experiments are presented in g in defining the filter kernels. When there are inconsistent
Sect. V to analyze the network hyper-parameters and verify structures in the guidance and the target, this approach would
the advantage of each components proposed in our model. We generate annoying artifacts in the output.
conclude the paper in Sect. VI. The recent deep joint filtering (DJF) method [5] alleviates
this drawback by jointly leveraging features of both the
II. G UIDED F ILTERS R EVISITING guidance and the target. It designs two-branch sub-networks to
In this section, we start with a revisiting of formal defini- extract features from the guidance and the target respectively,
tions of popular variants of guided filters in the literature, and which are passed through a fusion sub-network to output the
then explain our generalization of them to derive the proposed filtering result. The joint filter Φ is learned in an end-to-end
deep attentional guided image filter. manner by the following optimization:
Define the guidance image as g and the target image as t, where f gt is the ground truth of the output. In contrast to
the output f of guided filtering can be represented as: the implicit filter learning approach of DJF, deformable kernel
X networks (DKN) [7] explicitly learns the kernel weights K
fi = Wi,j (g, t)tj , (1) and offsets s using two-branch sub-networks from the two
j
images. Concretely, the filtering is performed by
where i and j are pixel coordinates; Wi,j is the filter kernel X
weight, whose parameters (g, t) mean that it can be derived fi = Wi,s(j) (g, t)ts(j) , (6)
from either g or t, or both. j∈Ni
In the classical bilateral filter and guided image filter, Wi,j with
is only dependent on the guidance g. Specifically, the filter
W (g, t) = K(g) K(t), (7)
weight in bilateral filter is defined as:
where K(g) and K(t) are kernel weights learned from the
BF 1 ki − jk kgi − gj k
Wi,j = exp − exp − , (2) guidance and the target, respectively, denotes element-
Ci σs σr
wise multiplication. Although DJF and DKN achieve better
where Ci is the normalization parameter; σs and σr are param- performance than previous methods, they treat the guidance
eters for geometric and photometric similarity, respectively. In and target images as independent information and utilize
guided image filter (He et al., [1]), the filter kernel weight is separate networks for kernel learning, thus the complementary
defined as: information contained in the two images cannot be fully
1 X (gi − µk )(gj − µk )
exploited. In addition, the fusion approach of multi-modal
GIF
Wi,j = 1+ ,
|Nk |2 σk2 + weights through element-wise multiplication is not effective,
k:(i,j)∈Nk
(3) in which the guidance and the target contribute equally to the
where |Nk | is the number of pixels in a window Nk ; µk and final filtering results.
σk2 are the mean and variance of g in Nk .
C. Our Strategy
B. Learning-based Guided Filters
Considering the drawbacks of existing methods, we propose
Among deep learning based approaches for guided filter a deep attentional guided image filtering scheme to more effec-
design, dynamic filter network [16] first defines a filter- tively leverage multi-modal information. Our method performs
generating network (FGN) that takes the guidance g as input dual regression on both guidance and target, and combines
to obtain location-specific dynamic filters Fθ = FGN(g, θ), them adaptively using an attention mechanism. Mathemati-
which are then applied to the target image t to yield the output cally, our filtering process can be generally formulated as
f = Fθ (t). Pixel-adaptive convolution [6] defines the filter X X
g t
fi = Ai,j Wi,j tj + (1 − Ai,j )Wi,j tj , (8)
2 https://github.com/zhwzhong/DAGF
j∈Ni j∈Ni
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 4
Fig. 2. The network architecture of the proposed deep attentional guided image filtering (DAGF) with the number of pyramid level m = 3. DAGF consists
of a kernel generation network for constructing filter kernels and a multi-scale guided image filtering network with the purpose of filtering target image by
using the generated kernels.
g t
where Wi,j and Wi,j are filter kernels computed from the Fig. 2 illustrates the overall architecture of the proposed
guidance and the target respectively; Ai,j denotes the pixel- network, which is composed of kernel generation sub-network
wise reliability weight of the guidance image, which is deter- and multi-scale guided filtering sub-network. Instead of di-
mined automatically by considering both guidance and target rectly predicting kernels in image domain and enlarging its
information. The above formulation means that, when the receptive field by using the deformable sampling strategy
guidance information is not trustworthy, we should turn to use as [7], we employ a pyramid architecture to achieve a large
the target information itself for regression, so as to prevent the receptive field, and conduct filter learning in the feature
unreliable structure propagation. domain since deep features are more robust with respect to
appearance difference of the target and the guidance.
III. P ROPOSED M ETHOD
An effective guided image filtering scheme should be able to
identify the consistent structures contained in the guidance as • In the filter kernel generation sub-network, the multi-scale
well as avoid transferring extraneous or erroneous contents to features of I t and I g are fed into the attentional kernel
the target. In this section, we introduce in detail the proposed learning (AKL) module to generate filter kernels {Wi }.
deep attentioanl guided image filtering (DAGF) framework for The network architecture of AKL is illustrated in Fig. 3,
this purpose, where the complementary information contained where an attentional contribution module based on U-
in the two images can be fully explored in both kernel Net architecture is designed to adaptively fuse the filter
generation and image filtering process. kernels generated by the guidance and the target.
• In the guided filtering sub-network, with the derived
pixel-wise filter kernels, features of the target image are
A. Network Architecture processed in a coarse-to-fine manner to get the upsampled
t
The DAGF takes a target image I t ∈ RH×W×C (e.g., features.
g
low-resolution depth) and a guidance image I g ∈ RH×W×C
(e.g., high-resolution color image) as inputs, and generates a
t
reconstructed image I out ∈ RH×W×C as output, where H, W The above process is repeated until arriving the final scale. In
and C denote the height, width and the number of channels the following, we will elaborate these two sub-networks and
respectively. the loss function design for network training.
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 5
TABLE I
Q UANTITATIVE COMPARISON FOR DEPTH IMAGE SUPER - RESOLUTION ON FOUR STANDARD RGB/D DATASETS IN TERMS OF AVERAGE RMSE VALUES .
F OLLOWING THE EXPERIMENTAL SETTING OF [7], [26], WE CALCULATE THE AVERAGE RMSE VALUES IN CENRIMETER FOR NYU V 2 [25] DATASET.
F OR OTHER DATASETS , WE COMPUTE THE RMSE VALUES BY SCALING THE DEPTH VALUE TO THE RANGE [0, 255]. T HE BEST PERFORMANCE FOR EACH
CASE ARE HIGHLIGHTED IN BOLDFACE WHILE THE SECOND BEST ONES ARE UNDERSCORED . F OR RMSE METRIC , THE LOWER VALUES MEAN THE
BETTER PERFORMANCE .
Bicubic 4.44 7.58 11.87 5.07 9.22 14.27 8.16 14.22 22.32 10.11 14.51 19.95
MRF [27] 4.26 7.43 11.80 4.90 9.03 14.19 7.84 13.98 22.20 9.87 13.45 18.19
GF [1]) 4.01 7.22 11.70 4.87 8.85 14.09 7.32 13.62 22.03 8.83 12.60 18.78
TGV [28]) 3.39 5.41 12.03 4.48 7.58 17.46 6.98 11.23 28.13 8.30 13.05 19.96
SDF [8]) 3.14 5.03 8.83 4.65 7.53 11.52 5.27 12.31 19.24 9.20 13.63 19.36
FBS [14]) 2.58 4.19 7.30 3.03 5.77 8.48 4.29 8.94 14.59 8.29 10.31 16.18
JBU [2]) 2.44 3.81 6.13 2.99 5.06 7.51 4.07 8.29 13.35 8.25 11.74 16.02
Experiment results for depth map super-resolution (Nearest-neighbour down-sampling).
DGF [29]) 3.92 6.04 10.02 2.73 5.98 11.73 4.50 8.98 16.77 7.53 11.53 17.50
DJF [26]) 2.14 3.77 6.12 2.54 4.71 7.66 3.54 6.20 10.21 7.09 9.12 12.36
DMSG [30]) 1.79 3.39 5.87 2.48 4.74 7.51 3.48 6.07 10.27 6.80 9.09 11.81
DJFR [5]) 1.98 3.61 6.07 2.22 4.54 7.48 3.38 5.86 10.11 7.05 9.12 12.61
DSRN [31]) 2.08 3.26 5.78 2.57 4.46 6.45 3.49 5.70 9.76 7.29 9.43 11.62
PAC [6]) 1.91 3.20 5.60 2.48 4.37 6.60 2.82 5.01 8.64 6.79 8.36 11.02
DKN [7]) 1.93 3.17 5.49 2.35 4.16 6.33 2.46 4.76 8.50 6.84 8.61 11.21
DAGF(Ours) 1.78 2.73 4.75 1.96 3.81 6.16 2.35 4.62 7.81 6.72 8.35 10.64
Experiment results for depth map super-resolution (Bicubic down-sampling).
DGF [29]) 1.94 3.36 5.81 2.45 4.42 7.26 3.21 5.92 10.45 5.91 8.02 11.17
DJF [26]) 1.68 3.24 5.62 1.65 3.96 6.75 2.80 5.33 9.46 5.30 7.53 10.41
DMSG [30]) 1.88 3.45 6.28 2.30 4.17 7.22 3.02 5.38 9.17 4.73 6.26 8.36
DJFR [5]) 1.32 3.19 5.57 1.15 3.57 6.77 2.38 4.94 9.18 4.90 7.39 10.33
DSRN [31]) 1.77 3.05 4.96 1.77 3.10 5.11 3.00 5.16 8.41 4.49 6.53 9.28
PAC [6]) 1.32 2.62 4.58 1.20 2.33 5.19 1.89 3.33 6.78 4.42 6.13 8.42
DKN [7]) 1.23 2.12 4.24 0.96 2.16 5.11 1.62 3.26 6.51 4.38 5.89 8.40
DAGF(Ours) 1.15 1.80 3.70 0.83 1.93 4.80 1.36 2.87 6.06 3.84 5.59 7.44
existing methods, we exploit the nearest-neighbour down- DSRN [31], PAC [6] and DKN [7]. We adopt Root Mean
sampling as the standard downsampling operator to generate Square Error (RMSE) as the evaluation metric. Lower RMSE
LR target image from the ground-truth. Three scales are values mean higher recovery quality.
considered, including 4×, 8×, 16×. To show the effectiveness Table I summarizes the quantitative comparison results
of the proposed method, we further conduct experiments on between ours and other state-of-the-art methods. The best
Bicubic downsampling as done in [7]. The performance of the performance is highlighted in bold. As can be seen from
proposed method is evaluated on the following four standard this table, our method achieves the best results among all
benchmark datasets: the compared methods on both synthetic and real datasets
• Sintel dataset [32]: this dataset consists of 1064 image (e.g. the Sintel and NYU v2 dataset) and on three scales.
pairs which are obtained by an animated 3D movie. The superior performance benefits from the more precise filter
• NYU v2 dataset [25]: this dataset contains 1449 image kernels learned and the multi-scale filtering process. Compared
pairs acquired by Microsoft Kinect. We use the last of 449 with the second best results (underlined), our results obtain the
image pairs to evaluate the performance of our method. gains of 0.12(4×), 0.24(8×) and 0.39(16×) withe respect to
• Lu dataset [33]: it contains 6 image pairs captured by average RMSE values.
ASUS Xtion Pro camera. To further analyze the performance of the proposed method,
• Middlebury dataset [34], [36]: this dataset is captured we present the visual results for 8× depth image super-
by structure light, and we utilize the 30 image pairs resolution in Fig. 4. It can be observed that the results of
from 2001-2006 datasets with the missing depth values JBU [2] suffer from jaggy artifacts. The results of GF [1]
generated by Lu et al. [33]. are over-smoothed, which indicates that the local filter is
We compare our method with 13 state-of-the-art meth- not effective at large scale factors (e.g. 8×). Compared to
ods, including two local filtering-based methods: GF [1] and GF [1] and JBU [2], the learning-based methods are capable
JBU [2]; four global optimization-based methods: MRF [27], of generating results with clearer boundaries. However, for
TGV [28], SDF [8] and FBS [14]; seven deep learning- finer details, e.g., the arm in the second image and the rope in
based methods: DGF [29], DJF [26], DMSG [30], DJFR [5], the last image, the compared learning-based methods exhibit
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 8
(a) Guidance (b) JBU (c) GF (d) DMSG (e) DJFR (f) PAC (g) DKN (h) DAGF (i) GT
Fig. 4. Qualitative comparison for recovered depth maps (8×). (a) Guidace image, (b): JBU [2], (c): GF [1], (d): DMSG [30], (e): DJFR [5], (f): PAC [6],
(g): DKN [7], (h): DAGF, and (i) Ground truth depth map. Top to bottom: Each two rows present recovered depth maps on the NYU v2 [25], Sintel [32],
Lu [33] and Middlebury [34] datasets respectively. Please enlarge the PDF for more details.
TABLE II
Q UANTITATIVE COMPARISON OF 8× SALIENCY MAP SUPER - RESOLUTION ON THE DUT-OMRON DATASET [35]. F OLLOWING DJFR [5], WE USE
F- MEASURE TO CALCULATE THE DIFFERENCE BETWEEN THE PREDICTED SALIENCY MAP AND THE CORRESPONDING GROUND - TRUTH . T HE BEST
PERFORMANCE FOR EACH CASE IS HIGHLIGHTED IN BOLDFACE WHILE THE SECOND ONE IS UNDERSCORED F OR F- MEASURE , THE HIGHER VALUES
MEAN THE BETTER PERFORMANCE .
Methods Bicubic GF [1] DMSG [30] DJFR [5] PAC [6] FDKN [7] DKN [7] DAGF (Ours)
obvious artifacts such as blurring on the arm and wrong estimation on the rope, which implies that the downsampling
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 9
(a) Guidance (b) Bicubic (c) DMSG (d) DJFR (e) PAC (f) DKN (g) DAGF (h) GT
Fig. 5. Visual comparison of 8× saliency map super-resolution on the DUT-OMRON dataset [35]: (a): Guidance, (b): low-resolution image, (c): DMSG [30],
(d): DJFR [5], (e): PAC [6], (f): DKN [7], (e): DAGF, (h): Ground truth. Please enlarge the PDF for more details.
degradation brings significantly damage on the small objects models on two noise reduction tasks using flash/non-flash and
and therefore makes those regions harder to recover. On the RGB/NIR image pairs. Finally, we conduct experiments on
contrary, the results obtained by the proposed method are ToF Mark dataset [39]. It contains three real world depth
clearer, sharper, and more faithful to the ground truth image. images acquired by Time of Flight (ToF) camera, which have
RGB-guided Saliency Map Super-resolution. To further complicated multi-modality degradation
demonstrate the generalization ability of the proposed method, Joint Depth Image Super-resolution and Denoising.
we apply the model that is trained on NYU v2 dataset directly Depth images acquired by ranging sensors are typically noisy.
to the task of saliency map super-resolution without any In order to simulate the data acquisition process of the depth
fine-tuning step. Similar to DKN [7], we use 5168 image sensor, we add Gaussian noise with variance as 25 to the low-
pairs from DUT-OMRON dataset [35] to evaluate the SR resolution target depth images. We use the same experimental
performance. We use bicubic interpolation (8×) to generate settings as the task of GSR in Sect IV-A to train our model.
the low-resolution saliency maps and then super-resolve them We compare our method with ten state-of-the-art methods,
with the corresponding high-resolution color image as the including GF [1], MUF [3] and SDF [8], which are traditional
guidance. The quantitative results in terms of F-measure are model-based methods; and DGF [29], DJF [26], DMSG [30],
listed in Table II. As can be seen from this table, our DAGF DJFR [5], DSRN [31], PAC [6], DKN [7], which are deep
achieves the best result among all the compared methods learning-based methods. Since most of the existing methods
and outperforms the second best method by a large margin, do not provide experimental results for this task, we retrain all
which demonstrates the generalization ability of the proposed the deep learning-based methods with the same training and
method. In addition, we random select two images and visu- test dataset as ours.
alize the recovered high-resolution saliency map obtained by
The quantitative results in terms of RMSE values for four
different methods in Fig. 5. It can be observed that the results
benchmark datasets are reported in Table III, from which we
of Bicubic are over-smoothed, in which the structure details
can see that the proposed method can obtain consistently better
are severely damaged. DMSG [30] and DJFR [5] struggle to
results than the existing state-of-the-art methods, especially for
generate clear boundaries. The results of DKN [7] have certain
the 8× and 16× cases which are more difficult to recover. This
artifacts around the edge area. In contrast, our method is able
is mainly because that: 1) we employ a pyramid architecture
to generate high-quality saliency maps as well as keep the
to extract multi-modality features for guided kernel genera-
sharpest boundaries, which indicates that the proposed method
tion, thus the multi-scale complementary information can be
can fully take advantage of the guidance image and effectively
obtained; 2) for guided image filtering, we leverage the coarse-
transfer meaningful structure information.
to-fine strategy to filter the low-resolution target image and
thus the structure details can be progressively recovered; 3)
B. Cross-modality Image Restoration compared to single loss at the end of network, the proposed
For the task of cross-modality image restoration, we first multi-scale loss can bring stronger supervision to our model.
conduct experiments on joint depth image super-resolution and Fig. 6 further demonstrates the visual superiority of the
denoising to show the superiority of the proposed method. proposed method for joint depth image super-resolution and
Moreover, to verify the ability of the proposed method on denoising (16× Bicubic downsampling and Gaussian noise).
dealing with various visual domains, we apply the trained The results of GF [1], MUF [3] and SDF [8] still contain much
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 10
Fig. 6. Qualitative comparison of joint depth map super-resolution and denoising. Please enlarge the PDF for more details. (a): Guidance Image, (b): Target
image, (c): GF [1], (d): MUF [3], (e): SDF [8], (f): PAC [6], (g): DJFR [5], (h): DKN [7], (i): DAGF and (j): Ground-truth image. ease enlarge the PDF for
more details.
noise, and the visual quality of the whole image is poor. This contrast, our method is able to remove the noise effectively
is due to that these methods are based on the locally linear and produces the clearest and sharpest boundaries.
assumption and they employ the mean filter to calculate the
Cross-modality Image Restoration. We further demon-
coefficients for pixel-wise linear representations. The methods
strate that our model trained for depth image denoising can be
of PAC [6] and DJFR [5] can remove noise well, while they
generalized to address other cross-modality image restoration
cannot preserve the sharp edge and introduce ringing artifacts.
tasks, such as flash guided non-flash image denoising and
The results of DKN are clearer and sharper than previous
NIR guided color image restroation. Fig. 8 shows the visual
methods. However, they suffer from color distortion, which
comparison among existing state-of-the-art methods and ours.
attributes to the batch normalization used in DKN [7]. In
All of the deep learning-based methods (e.g. DJFR [5] and
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 11
(a) Guidance (b) Target (c) SDF (d) DGDIE (e) DKN (f) DAGF (g) GT
Fig. 7. Visual comparison of realistic depth map super-resolution on two examples (books and devil) from ToFMark [28] dataset: (a): Guidance image, (b):
Target image, (c): SDF [8], (d): DGDIE [37], (e): DKN [7], (f): DAGF, (e): Ground truth. Please enlarge the PDF for more details.
TABLE III
Q UANTITATIVE COMPARISON FOR JOINT DEPTH IMAGE SUPER - RESOLUTION AND DENOISING ON FOUR STANDARD RGB/D DATASETS IN TERMS OF
AVERAGE RMSE VALUES . F OLLOWING THE EXPERIMENTAL SETTING OF [7], [26]), WE CALCULATE THE AVERAGE RMSE VALUES IN CENRIMETER FOR
NYU V 2 [25] DATASET. F OR OTHER DATASETS , WE COMPUTE THE RMSE VALUES BY SCALING THE DEPTH VALUE TO THE RANGE [0, 255]. T HE BEST
PERFORMANCE FOR EACH CASE ARE HIGHLIGHTED IN BOLDFACE WHILE THE SECOND BEST ONES ARE UNDERSCORED . F OR RMSE METRIC , THE
LOWER VALUES MEAN THE BETTER PERFORMANCE .
DGF [29] 2.70 4.13 6.38 4.06 5.85 8.39 6.52 9.23 13.00 6.94 9.03 12.05
DJF [26] 1.80 2.99 5.16 1.85 3.13 5.39 3.74 5.95 9.61 4.88 6.93 10.05
DMSG [30] 1.79 2.69 4.75 1.88 2.79 4.84 3.60 5.31 8.07 4.74 6.36 8.72
DJFR [5] 1.86 3.07 5.27 1.91 3.21 5.51 4.01 6.21 9.90 5.10 7.12 10.23
DSRN [31] 1.84 2.99 4.70 1.97 2.98 5.94 4.36 6.31 9.75 5.49 7.21 9.80
PAC [6] 1.81 2.94 5.08 1.93 3.44 6.18 4.23 6.24 9.54 5.40 7.32 9.89
DKN [7] 1.76 2.68 4.55 1.81 2.82 4.81 3.39 5.24 8.41 4.51 6.25 9.20
DAGF (Ours) 1.72 2.61 4.24 1.74 2.72 4.51 3.25 5.01 7.54 4.42 6.09 8.25
(a) Guidance (b) Target (c) SDF (d) RTV (e) DJFR (f) DKN (g) DAGF
Fig. 8. Visual comparison of cross-modality image restoration. Top: flash guided non-flash image denoising. Bottom: NIR guided color image denoising. (a):
Guidance image, (b): Target image, (c): SDF [8], (d): RTV [13], (e): DJFR [5], (f): DKN [7], (f): DAGF. Please enlarge the PDF for more details.
DKN [7]) are tested with the same setting as ours. Among the best performance.
the compared methods, SDF [8] and RTV [13] are specially Realistic Depth Image Super-resolution. To further eval-
designed for this task. As can be seen from Fig. 8, DJFR [5] uate the robustness of the proposed method, we conduct
cannot remove noise, and the results of DKN [7] suffer experiments on ToFmark dataset [39], which include real ToF
from halo artifacts. On the contrary, the proposed DAGF can sensor data and thus have complicated multi-modality degrada-
produce more convincing results with less artifact. The method tion. Following the experimental protocol of DGDIE [37], we
of RTV [13] which is specially designed for this task, obtains first perform image completion on the acquired depth images
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 12
(a) Target (b) RTV (c) RGF (d) SDF (e) DJFR (f) DKN (g) DAGF
Fig. 9. Visual comparisons of texture remove results. (a): Target image, (b): RTV [13], (c): RGF [9], (d): SDF [8], (e): DJFR [5], (f): DKN [7], (g): DAGF.
Please enlarge the PDF for more details.
TABLE IV they deviate from the ground truth. By comparison, the results
Q UANTITATIVE COMPARISON FOR REALISTIC DEPTH IMAGE of the proposed method are sharper and much closer to the
SUPER - RESOLUTION IN TERMS OF RMSE VALUES ON THE T O FM ARK [28]
DATASET. T HE BEST PERFORMANCE FOR EACH CASE ARE HIGHLIGHTED ground truth, especially at the boundary regions.
IN BOLDFACE WHILE THE SECOND ONES ARE UNDERSCORED .
C. Texture Removal
Methods Books Devil Shark
Texture removal is the task of extracting semantically
Bilinear 17.10 20.17 18.66 meaningful structures from textured surfaces. For this task,
JBU [2] 16.03 18.79 27.57 we use the textured image itself as the guidance, and apply
GF [1] 15.74 18.21 27.04
our model trained for depth image denoising iteratively to
TGV [28] 12.36 15.29 14.68
SDF [8] 12.66 14.33 10.68
remove small-scale textures. We compare our method with
Yang [38] 12.25 14.71 13.83 RTV [13], RGF [9], SDF [8], DJFR [5] and DKN [7]. For deep
DGDIE [37] 12.32 14.06 9.66 learning-based methods, we follow DKN [7], set the number of
DKN [7] 11.81 13.54 9.11 iterations as 4, and for other methods we carefully fine-tune the
DAGF (Ours) 11.80 13.47 9.07 parameters to provide the best results. The visual comparison
are presented in Fig. 9. Obviously, our method outperforms
other compared methods, and it can painlessly remove small-
TABLE V scale textures as well as preserve the global color variation
Q UANTITATIVE COMPARISON FOR SEMANTIC SEGMENTATION IN TERMS and main edges.
OF AVERAGE I O U ON THE VALIDATION SET OF PASCAL VOC 2012. T HE
BEST PERFORMANCE FOR EACH CASE ARE HIGHLIGHTED IN BOLDFACE
WHILE THE SECOND ONES ARE UNDERSCORED . D. Semantic Segmentation
Semantic segmentation is a fundamental computer vision
Methods Mean IoU
task, which aims at assigning pre-defined labels to each pixel
Deeplab-V2 [40] 70.69 of an image. In DGF [29], the author proposed to use guided
DenseCRF [41] 71.98 image filtering as a layer to replace the time-consuming fully
DGF [29] 72.96
connected conditional random filed (CFR) [41] for semantic
DJFR [5] 73.30
FDKN [7] 73.60
segmentation. We demonstrate that the proposed DAGF can
DAGF (Ours) 73.76 be applied to this problem. Following DGF [29], we plug the
proposed model into DeepLab-v2 [40] and train the whole
network in an end-to-end manner, and thus the offline post-
processing of CRFs can be avoided. We utilize the Pascal VOC
and then send them to our model (4× super-resolution and 2012 dataset [42] in our experiment, which contains 1264,
denoising) trained on NYU v2 dataset [25] to obtain the final 1229 and 1456 images for training, validation and testing,
results. We compare our method with a recently proposed respectively. Similar to DGF [29], we augment the training
deep learning-based method (e.g. DKN [7]) and some tradi- set with the annotations provided by [43], resulting in 10582
tional methods (e.g. TGV [28], SDF [8], DGDIE [37]). As images. The 1449 images in the validation set are employed
shown in Table IV, our method constantly obtains the best to evaluated the proposed method.
objective results for the three test images. Fig. 7 presents We use the mean intersection-over-union (IoU) score as
visual comparison results for two images (books and devil). evaluation metric and report the quantitative results for the
Form these figures, it is easy to observe that the results of validation set of Pacal VOC dataset [42] in Table V. The
SDF [8] suffer from texture-copying artifacts. The results of baseline denotes DeepLab-v2 [40] without CRF. As can be
DKN [7] are smooth and blurred, since DKN generates filter seen from this table, our method outperforms the second best
kernels without considering the inconsistence between color model DKN [7] by 0.16% mIoU and other models by a
and depth image. The results of DGDIE [37] are clear but large margin. We visualize the segmentation results among our
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 13
(a) RGB Image (b) Deeplab-V2 (c) DGF (d) FDKN (e) DAGF (f) GT
Fig. 10. Visual comparison of semantic segmentation on the validation set of Pascal VOC 2012 dataset [42]. (a): RGB image, (b): Deeplab-V2 [40], (c):
DGF [29], (d): FDKN [7], (e): DAGF, (f): ground truth image. Please enlarge the PDF for more details.
TABLE VII
A BLATION S TUDY . Q UANTITATIVE COMPARISON OF DIFFERENT COMPONENTS FOR 16× DEPTH IMAGE SUPER - RESOLUTION . W E CHOSE RMSE AS THE
EVALUATION METRIC , AND THE LOWER VALUES INDICATE BETTER PERFORMANCE . M ODEL 7 IS OUR FINAL MODEL (DAGF).
(a) Guidance (b) Model1 (c) Model2 (d) Model3 (e) Model4 (f) Model5
Fig. 11. Ablation Study. Visual comparison of an example without and with the proposed attentional kernel learning module (AKL) for depth image
super-resolution. The first row shows the super-resolved depth images and the last row shows the error map (I h − I out ). Please enlarge the PDF for more
details.
• Model2, which takes (guidance, guidance) as inputs for boost the network performance significantly. In the following,
kernel generation, and is trained with L1 loss. we will give a detailed analysis of each component in our
• Model3, which takes (target, guidance) as inputs for method. Effectiveness of Attentional Kernel Learning
kernel generation, and uses element-wise multiplication (AKL): In this paper, we propose to use AKL to generate
to combine the generated two sets of kernels, and is filter kernels for guided image filtering. Specifically, it first
trained with L1 loss. generates dual sets of kernels by using the extracted guidance
• Model4, which takes (target, guidance) as inputs for and target features respectively, and then adaptively combines
kernel generation, and uses element-wise summation to the generated kernels by the learned attention maps. To demon-
combine the generated two sets of kernels, and is trained strate the effectiveness of AKL, we implement several variants
with L1 loss. (e.g., different inputs for kernel construction and different
• Model5, which takes (target, guidance) as inputs for kernel fusion strategies) of the proposed method, including
kernel generation, and uses the learned weight map to Model1–Model5. The quantitative results on the four testing
adaptively combine the generated two sets of kernels, and datasets are reported in Table VII. As can be seen from
is trained with L1 loss. this table, Model1 generates kernels from target image only,
• Model6, which is Model5 but trained with L1 loss and thus the reconstruction accuracy is relatively low. With the
Lms loss. assistance of guidance image, Model2 obtains a significant
• Model7: which is Model5 but trained with L1 loss, Lms improvement compared with Model1, which implies that the
loss and Lba loss. This is our full model. guidance information is helpful for filter kernel generation.
However, the guidance images are not always reliable, such as
It is noteworthy that we adjust the number of convolutional color images captured in bad weather or low-light conditions.
layers in multi-scale guided image filtering sub-network to In view of this, Model3 and Model4 generate dual sets of
guarantee that each variant could have roughly the same num- kernels from the guidance and target images, respectively,
ber of parameters with our final model. The quantitative results and the difference between the two models is the strategy
are shown in Table V, from which we can see that the full of kernel combination. As shown in Table VII, Model3 and
model (Model7) achieves the best reconstruction performance Model4 can further improve the accuracy over Model2 (The
across four testing datasets when compared with the ablated average RMSE is dropped from 8.45 to 8.28 and 8.23),
models, and every component proposed in our model can
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 15
(a) Guidance (b) Target (c) w/o Lba (d) w/ Lba (e) GT
Fig. 12. Ablation Study. Visual comparison of an example without and with the proposed boundary-aware loss for depth image super-resolution. Please
enlarge the PDF for more details.
Fig. 14. Ablation Study. Training and testing RMSE values on NYU v2
Fig. 13. Ablation Study. Visualization of the learned multi-scale attention dataset (Silberman et al., [25]) for 16× depth image super-resolution. MS
maps for kernel combination. We resize the attention map to the same size denotes the proposed multi-stage loss Lms .
for better Visualization. Please enlarge the PDF for more details.
(a) (b)
Fig. 15. Ablation Study. Average RMSE values for depth image super-
resolution. The low-resolution depth image are obtained by (a): Nearest-
neighbour downsampling, (b): Bicubic downsampling and Gaussian noise.
a multi-scale fusion with deep supervision to regularize and [21] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:
combine multiple filtering results. Finally, boundary-aware Surpassing human-level performance on imagenet classification,” in
Proceedings of the IEEE International Conference on Computer Vision,
loss is introduced to enhance the high-frequency details of 2015, pp. 1026–1034.
guided filtering. Experimental results on various guided image [22] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop,
filtering applications show the superiority and flexibility of the D. Rueckert, and Z. Wang, “Real-time single image and video super-
resolution using an efficient sub-pixel convolutional neural network,” in
proposed model and the ablation experiments demonstrate the Proceedings of the IEEE Conference on Computer Vision and Pattern
effectiveness of each component in our method. Recognition, 2016, pp. 1874–1883.
[23] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
arXiv: Learning, 2014.
[24] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin,
ACKNOWLEDGMENT A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in
pytorch,” in NIPS-W, 2017.
R EFERENCES [25] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation
and support inference from rgbd images,” in Proceedings of the 12th
[1] K. He, J. Sun, and X. Tang, “Guided image filtering,” IEEE Transactions European conference on Computer Vision - Volume Part V, 2012, pp.
on Pattern Analysis & Machine Intelligence, vol. 35, no. 6, pp. 1397– 746–760.
1409, 2013. [26] Y. Li, J.-B. Huang, N. Ahuja, and M.-H. Yang, “Deep joint image
[2] J. Kopf, M. F. Cohen, D. Lischinski, and M. Uyttendaele, “Joint bilateral filtering,” in Proceedings of the European Conference on Computer
upsampling,” Acm Transactions on Graphics, vol. 26, no. 3, pp. 96.1– Vision, 2016, pp. 154–169.
96.4, 2007. [27] J. Diebel and S. Thrun, “An application of markov random fields to
[3] X. Shen, C. Zhou, L. Xu, and J. Jia, “Mutual-structure for joint filtering,” range sensing.” Advances in Neural Information Processing Systems,
in Proceedings of the IEEE International Conference on Computer pp. 291–298, 2005.
Vision, 2015, pp. 3406–3414. [28] D. Ferstl, C. Reinbacher, R. Ranftl, M. Ruether, and H. Bischof, “Image
[4] X. Deng and P. L. Dragotti, “Deep convolutional neural network guided depth upsampling using anisotropic total generalized variation,”
for multi-modal image restoration and fusion,” IEEE Transactions on in Proceedings of the 2013 IEEE International Conference on Computer
Pattern Analysis and Machine Intelligence, pp. 1–1, 2020. Vision (ICCV), 2013, pp. 993–1000.
[5] Y. Li, J. B. Huang, N. Ahuja, and M. H. Yang, “Joint image filtering with [29] H. Wu, S. Zheng, J. Zhang, and K. Huang, “Fast end-to-end trainable
deep convolutional networks,” IEEE Transactions on Pattern Analysis guided filter,” in Proceedings of the IEEE Conference on Computer
and Machine Intelligence, vol. 41, no. 8, pp. 1909–1923, 2019. Vision and Pattern Recognition, 2018, pp. 1838–1847.
[6] H. Su, V. Jampani, D. Sun, O. Gallo, E. Learned-Miller, and J. Kautz, [30] T.-W. Hui, C. C. Loy, and X. Tang, “Depth map super-resolution by
“Pixel-adaptive convolutional neural networks,” in Proceedings of the deep multi-scale guidance,” in Proceedings of the European Conference
IEEE Conference on Computer Vision and Pattern Recognition, 2019, on Computer Vision, 2016, pp. 353–369.
pp. 11 166–11 175. [31] C. Guo, C. Li, J. Guo, R. Cong, H. Fu, and P. Han, “Hierarchical
[7] B. Kim, J. Ponce, and B. Ham, “Deformable kernel networks for joint features driven residual learning for depth map super-resolution,” IEEE
image filtering,” International Journal of Computer Vision, pp. 1–22, Transactions on Image Processing, vol. 28, no. 5, pp. 2545–2557, 2019.
2020. [32] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black, “A naturalistic
[8] B. Ham, M. Cho, and J. Ponce, “Robust guided image filtering using open source movie for optical flow evaluation,” in Proceedings of the
nonconvex potentials,” IEEE Transactions on Pattern Analysis and European Conference on Computer Vision, 2012, pp. 611–625.
Machine Intelligence, vol. 40, no. 1, pp. 192–207, 2018. [33] S. Lu, X. Ren, and F. Liu, “Depth enhancement via low-rank matrix
[9] Z. Qi, X. Shen, X. Li, and J. Jia, “Rolling guidance filter,” in Proceedings completion,” in Proceedings of the IEEE Conference on Computer Vision
of the European conference on computer vision, 2014, pp. 815–830. and Pattern Recognition, 2014, pp. 3390–3397.
[34] D. Scharstein and C. Pal, “Learning conditional random fields for
[10] B. Stimpel, C. Syben, F. Schirrmacher, P. Hoelter, A. Dörfler, and
stereo,” in Proceedings of the IEEE Conference on Computer Vision
A. Maier, “Multi-modal super-resolution with deep guided filtering,” in
and Pattern Recognition, 2007, pp. 1–8.
Bildverarbeitung für die Medizin 2019, Wiesbaden, 2019, pp. 110–115.
[35] C. Yang, L. Zhang, H. Lu, X. Ruan, and M. Yang, “Saliency detec-
[11] L. Xu, Q. Yan, Y. Xia, and J. Jia, “Structure extraction from texture via
tion via graph-based manifold ranking,” in 2013 IEEE Conference on
relative total variation,” ACM transactions on graphics, vol. 31, no. 6,
Computer Vision and Pattern Recognition, 2013, pp. 3166–3173.
pp. 1–10, 2012.
[36] H. Hirschmüller and D. Scharstein, “Evaluation of cost functions for
[12] L. Karacan, E. Erdem, and A. Erdem, “Structure-preserving image stereo matching,” in Proceedings of the IEEE Conference on Computer
smoothing via region covariances,” ACM Transactions on Graphics, Vision and Pattern Recognition, 2007.
vol. 32, no. 6, pp. 1–11, 2013. [37] S. Gu, W. Zuo, S. Guo, Y. Chen, C. Chen, and L. Zhang, “Learning
[13] L. Xu, Q. Yan, Y. Xia, and J. Jia, “Structure extraction from texture via dynamic guidance for depth image enhancement,” in Proceedings of the
relative total variation,” ACM Trans. Graph., vol. 31, no. 6, 2012. IEEE Conference on Computer Vision and Pattern Recognition, 2017,
[14] J. T. Barron and B. Poole, “The fast bilateral solver,” in Proceedings of pp. 3769–3778.
the European Conference on Computer Vision, 2016, pp. 617–632. [38] J. Yang, X. Ye, K. Li, C. Hou, and Y. Wang, “Color-guided depth
[15] C. Tomasi and R. Manduchi, “Bilateral filtering for gray and color im- recovery from rgb-d data using an adaptive autoregressive model,” IEEE
ages,” in Proceedings of the Sixth International Conference on Computer Transactions on Image Processing, vol. 23, no. 8, pp. 3443–3458, 2015.
Vision, 1998, pp. 839–846. [39] D. Ferstl, C. Reinbacher, R. Ranftl, M. Ruether, and H. Bischof, “Image
[16] X. Jia, B. De Brabandere, T. Tuytelaars, and L. V. Gool, “Dynamic guided depth upsampling using anisotropic total generalized variation,”
filter networks,” in Advances in Neural Information Processing Systems, in Proceedings of the 2013 IEEE International Conference on Computer
vol. 29, 2016, pp. 667–675. Vision, 2013.
[17] R. D. Lutio, S. D’aronco, J. D. Wegner, and K. Schindler, “Guided [40] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
super-resolution as pixel-to-pixel transformation,” in Proceedings of “Deeplab: Semantic image segmentation with deep convolutional nets,
IEEE/CVF International Conference on Computer Vision, 2019, pp. atrous convolution, and fully connected crfs,” IEEE Transactions on
8828–8836. Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834–848,
[18] J. Kwak and D. Son, “Fractal residual network and solutions for real 2018.
super-resolution,” in Proceedings of IEEE/CVF Conference on Computer [41] P. Krhenbühl and V. Koltun, “Efficient inference in fully connected crfs
Vision and Pattern Recognition Workshops, 2019, pp. 2114–2121. with gaussian edge potentials,” Curran Associates Inc., 2012.
[19] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks [42] M. Everingham, S. Eslami, L. V. Gool, C. Williams, J. Winn, and A. Zis-
for biomedical image segmentation,” in International Conference on serman, “The pascal visual object classes challenge: A retrospective,”
Medical image computing and computer-assisted intervention, 2015, pp. International Journal of Computer Vision, vol. 111, no. 1, pp. 98–136,
234–241. 2015.
[20] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image [43] B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik, “Semantic
recognition,” in Proceedings of the IEEE Conference on Computer Vision contours from inverse detectors,” in Proceedings of the International
and Pattern Recognition, 2016, pp. 770–778. Conference on Computer Vision, 2011, pp. 991–998.
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 18
Zhiwei Zhong received the B.S. degree in computer Xiangyang Ji received the B.S. degree in materials
science from the Heilongjiang University, Harbin, science and the M.S. degree in computer science
China, in 2017. He is currently pursing the Ph.D. from the Harbin Institute of Technology, Harbin,
degree in computer science from the Harbin Institute China, in 1999 and 2001, respectively, and the Ph.D.
of Technology (HIT), Harbin, China. His research degree in computer science from the Institute of
interests include image processing, computer vision Computing Technology, Chinese Academy of Sci-
and deep learning. ences, Beijing, China. He joined Tsinghua Univer-
sity, Beijing, in 2008, where he is currently a Pro-
fessor with the Department of Automation, School
of Information Science and Technology. He has
authored over 100 referred conference and journal
papers. His current research interests include signal processing, image/video
compressing, and intelligent imaging.