0% found this document useful (0 votes)
67 views19 pages

Deep Attentional Guided Image Filtering

Uploaded by

x835254
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views19 pages

Deep Attentional Guided Image Filtering

Uploaded by

x835254
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

JOURNAL OF LATEX CLASS FILES, VOL. 18, NO.

9, SEPTEMBER 2020 1

Deep Attentional Guided Image Filtering


Zhiwei Zhong, Xianming Liu, Member, IEEE, Junjun Jiang, Member, IEEE, Debin Zhao, Member, IEEE,
Xiangyang Ji, Member, IEEE

Abstract—Guided filter is a fundamental tool in computer guidance image are explicitly involved into filtering process
vision and computer graphics which aims to transfer structure through the photometric similarity. The guided image filtering
information from guidance image to target image. Most existing scheme proposed by He et al. [1] takes a more rigorous manner
methods construct filter kernels from the guidance itself without
considering the mutual dependency between the guidance and to exploit the structure information of the guidance, which
target. However, since there typically exist significantly different computes a locally linear model over the guidance image for
arXiv:2112.06401v2 [cs.CV] 28 Feb 2022

edges in two images, simply transferring all structural infor- filtering. These filters consider only the information contained
mation of the guidance to the target would result in various in the guidance image in filtering. However, since there
artifacts. To cope with this problem, we propose an effective typically exist significantly different edges in the two images,
framework named deep attentional guided image filtering, the
filtering process of which can fully integrate the complementary simply transferring all patterns of the guidance to the target
information contained in both images. Specifically, we propose an would introduce various artifacts. Some works [3], [8] pro-
attentional kernel learning module to generate dual sets of filter pose to utilize the optimization-based manner to find mutual
kernels from the guidance and the target, respectively, and then structures for propagation while suppressing inconsistent ones.
adaptively combine them by modeling the pixel-wise dependency However, it is challenging to select reference structures and
between the two images. Meanwhile, we propose a multi-scale
guided image filtering module to progressively generate the propagate them properly by hand-crafted objective functions.
filtering result with the constructed kernels in a coarse-to- In addition, the computational complexity of these methods is
fine manner. Correspondingly, a multi-scale fusion strategy is usually high.
introduced to reuse the intermediate results in the coarse-to-fine In recent years, learning-based approaches for GF design are
process. Extensive experiments show that the proposed frame- becoming increasingly popular, which derive GF in a purely
work compares favorably with the state-of-the-art methods in a
wide range of guided image filtering applications, such as guided data-driven manner. They allow the networks to learn how
super-resolution, cross-modality restoration, texture removal, and to adaptively select structures to transfer, and thus have the
semantic segmentation. Moreover, our scheme achieved the first ability to handle more complicated scenarios. For instance,
place in real depth map super-resolution challenge held in ACM in [16], a dynamic filter network (DFN) is proposed where
ICMR’20211 . pixel-wise filters are generated dynamically using a separate
Index Terms—Guided filter, dual regression, attentional kernel sub-network conditioned on the guidance. Unlike DFN, Su et
learning, guided super-resolution, cross-modality restoration. al. [6] adapts a standard spatially invariant kernel at each pixel
by multiplying it with a spatially varying filter. Although with
increased flexibility thanks to their adaptive nature, [16] and
I. I NTRODUCTION
[6] still suffer from the same drawback as [1], [15] that only
UIDED filter (GF), also named joint filter, is tailored
G to transfer structural information from a guidance image
to a target one. The popularity of GF can be attributed to
the guidance information is considered in filters design. Some
recent methods attempt to exploit the target and guidance
information jointly. For instance, Li et al. [5] propose to
its ability in handling visual signals in various domains and leverage two sub-networks to extract informative features from
modalities, where one modal signal serves as the guidance both the target and guidance images, which are then concate-
to improve the quality of the other one. It has been a useful nated as inputs for the fusion network to selectively transfer
tool for many image processing and computer vision tasks, salient structures from the guidance to the target. Instead of
such as depth map super-resolution [2], [5], scale-space fil- regressing the filtering results directly from the network, Kim
tering [8], [9], cross-modality image restoration [1], [3], [10], et al. [7] proposes to use spatially variant weighted averages,
structure-texture separation [11], [12], [13], image semantic where the set of neighbors and the corresponding kernel
segmentation [7], [14] and so on. weights are learned in an end-to-end manner. However, in the
In the literature, GF has been extensively studied, rang- designed networks of these methods, the simple concatenation
ing from the classical bilateral filter to the emerging deep or element-wise multiplication is exploited to combine multi-
learning-based ones. The pioneer bilateral filter [15] constructs modal information, which is not that effective. There is no
spatially-varying kernels, where local image structures of the mechanism to distinguish the contributions of the guidance and
the target to the final filtering result, and thus would also lead
Z. Zhong, X. Liu, J. Jiang and D. Zhao are with the School of Computer to erroneous structure propagation. In addition, the guidance
Science and Technology, Harbin Institute of Technology, Harbin 150001,
China, and also with Peng Cheng Laboratory, Shenzhen 518052, China E- and target images are treated as independent information
mail: {zhwzhong,csxm,jiangjunjun}@hit.edu.cn. since existing methods typically utilize two separate networks
X. Ji is with the Department of Automation, Tsinghua University, Beijing for feature extraction, thus the complementary information
100084, China. E-mail: [email protected].
contained in the two images cannot be fully exploited.
1 https://icmr21-realdsr-challenge.github.io/#Leaderboard By reviewing existing GF methods, it can be found that most
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 2

(a) Guidance Image (b) GF (c) JBU (d) MuGF (e) CUNet

(f) DJFR (g) PAC (h) DKN (i) Ours (j) Ground Truth

Fig. 1. Guided image filtering on a RGB/D image pair for 16 × guided super-resolution: (a) Guidance image, (b) GF [1], (c) JBU [2], (d) MuGF [3], (e)
CUNet [4], (f) DJFR [5], (g) PAC [6], (h) DKN [7], (i) Ours, (j) Ground truth. The results of (b)-(c) suffer from edge blurring artifact and the results of
(d)-(h) suffer from texture-copying artifacts. Our result produces much sharper edges. Please enlarge the PDF for more details.

of them concentrate their efforts on how to transfer structural with inconsistent contents by regression on the target itself.
information from the guidance to the target. However, for We show an illustrated example in Fig. 1, which presents
some scenarios, such as cross-modality image restoration [4] the visual filtering results comparison of our scheme with the
and guided super-resolution [17], multi-modal data has signifi- state-of-the-art guided depth super-resolution methods. It can
cantly different characteristics due to the difference of sensing be found that our proposed method is capable of producing
principle, making the guidance not always trustworthy. In view high-resolution depth image with clear boundaries as well as
of this, we argue that the purpose of GF should be two-folds: avoiding texture-copying artifacts.
1) apply the guidance as a prior for reconstruction of regions The main contributions of the proposed method are sum-
in the target where there are structure-consistent contents; and marized as follows:
2) derive a plausible prediction for regions in the target with • We propose an attentional kernel learning (AKL) module
inconsistent contents of the guidance. The latter represents the for guided image filtering, which generates dual sets of
case that the guidance is no longer reliable, so we have to filter kernels from both guidance and target, and then
rely on the target itself for reconstruction. Most existing GF adaptively combines these kernels by modeling the pixel-
methods only concern structure transferring from the guidance, wise dependencies between the two images in a learning
but neglect structure prediction from the target, leading to manner. Compared with existing kernel generation ap-
erroneous or extraneous artifacts in the output. It implies that proaches, the proposed method is more robust when there
instead of performing regression on guidance only, as done are inconsistent structures between the guidance and the
in [1], [15], we should perform dual regression on both the target.
guidance and the target, and combine them adaptively in a • We propose a multi-scale guided filtering module, which
smarter manner instead of simple concatenation or element- generates the filtering result in a coarse-to-fine manner.
wise multiplication, as done in [5], [7]. “Dual regression” Correspondingly, we propose a multi-scale fusion strategy
and “smart combination” bring the main motivations of our with deep supervision to fully explore the intermediate
proposed method. results in the coarse-to-fine process. To the best of our
Accordingly, in this paper, we propose an effective deep at- knowledge, this is the first guided filter framework that
tentional guided image filtering scheme, which constructs filter learns the multi-scale kernels to filter the target image at
kernels by fully considering information from both guidance different scales in the embedding space.
and target images. Specifically, an attentional kernel learning • We evaluate the performance of the proposed method
module is proposed to generate dual sets of filter kernels from on various computational photography and computer vi-
the guidance and the target, respectively. Moreover, pixel- sion tasks, such as guided image super-resolution, cross-
wise contributions of the guidance and the target to the final modality image restoration, texture removal, and semantic
filtering result are automatically learned. In this way, we can segmentation. The quantitative and qualitative results
adaptively apply the guidance as a prior for reconstruction demonstrate the effectiveness and universality of the
of target regions where there are structure-consistent contents proposed method.
with the guidance; and derive a prediction for target regions • Considering that there is no standard protocol to train
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 3

and evaluate the performance of guided image filtering kernel by multiplying a spatially varying filter on standard
algorithms, we reimplement eight recently proposed state- spatially invariant kernel:
of-the-art deep learning-based guided filtering models and X
unify their settings to facilitate fair comparison. All of fi = K(gi , gj )W [pi − pj ]tj + b, (4)
the codes and trained models are publicly available2 to j∈Ni

encourage reproducible research. where W is the spatially invariant kernel; K(·, ·) is a varying
The remainder of this paper is organized as follows. Sect. II filter kernel function that has a fixed form such as Gaussian,
gives a brief introduction to the relevant works of guided [pi − pj ] denotes the index offset of kernel weights. From the
filter. Sect. III introduces the proposed method for guided above formulation, it can be found that, similar to bilateral
image filtering. Sect. IV provides experimental comparisons filter and guided image filter, dynamic filter network and
with existing state-of-the-art methods for a varied range of pixel-adaptive convolution also only depend on the guidance
guided filtering tasks. Ablation experiments are presented in g in defining the filter kernels. When there are inconsistent
Sect. V to analyze the network hyper-parameters and verify structures in the guidance and the target, this approach would
the advantage of each components proposed in our model. We generate annoying artifacts in the output.
conclude the paper in Sect. VI. The recent deep joint filtering (DJF) method [5] alleviates
this drawback by jointly leveraging features of both the
II. G UIDED F ILTERS R EVISITING guidance and the target. It designs two-branch sub-networks to
In this section, we start with a revisiting of formal defini- extract features from the guidance and the target respectively,
tions of popular variants of guided filters in the literature, and which are passed through a fusion sub-network to output the
then explain our generalization of them to derive the proposed filtering result. The joint filter Φ is learned in an end-to-end
deep attentional guided image filter. manner by the following optimization:

Φ∗ = arg min kf gt − Φ(g, t)k2 , (5)


A. Classical Guided Filters Φ

Define the guidance image as g and the target image as t, where f gt is the ground truth of the output. In contrast to
the output f of guided filtering can be represented as: the implicit filter learning approach of DJF, deformable kernel
X networks (DKN) [7] explicitly learns the kernel weights K
fi = Wi,j (g, t)tj , (1) and offsets s using two-branch sub-networks from the two
j
images. Concretely, the filtering is performed by
where i and j are pixel coordinates; Wi,j is the filter kernel X
weight, whose parameters (g, t) mean that it can be derived fi = Wi,s(j) (g, t)ts(j) , (6)
from either g or t, or both. j∈Ni

In the classical bilateral filter and guided image filter, Wi,j with
is only dependent on the guidance g. Specifically, the filter
W (g, t) = K(g) K(t), (7)
weight in bilateral filter is defined as:
where K(g) and K(t) are kernel weights learned from the
   
BF 1 ki − jk kgi − gj k
Wi,j = exp − exp − , (2) guidance and the target, respectively, denotes element-
Ci σs σr
wise multiplication. Although DJF and DKN achieve better
where Ci is the normalization parameter; σs and σr are param- performance than previous methods, they treat the guidance
eters for geometric and photometric similarity, respectively. In and target images as independent information and utilize
guided image filter (He et al., [1]), the filter kernel weight is separate networks for kernel learning, thus the complementary
defined as: information contained in the two images cannot be fully
1 X  (gi − µk )(gj − µk )

exploited. In addition, the fusion approach of multi-modal
GIF
Wi,j = 1+ ,
|Nk |2 σk2 +  weights through element-wise multiplication is not effective,
k:(i,j)∈Nk
(3) in which the guidance and the target contribute equally to the
where |Nk | is the number of pixels in a window Nk ; µk and final filtering results.
σk2 are the mean and variance of g in Nk .
C. Our Strategy
B. Learning-based Guided Filters
Considering the drawbacks of existing methods, we propose
Among deep learning based approaches for guided filter a deep attentional guided image filtering scheme to more effec-
design, dynamic filter network [16] first defines a filter- tively leverage multi-modal information. Our method performs
generating network (FGN) that takes the guidance g as input dual regression on both guidance and target, and combines
to obtain location-specific dynamic filters Fθ = FGN(g, θ), them adaptively using an attention mechanism. Mathemati-
which are then applied to the target image t to yield the output cally, our filtering process can be generally formulated as
f = Fθ (t). Pixel-adaptive convolution [6] defines the filter X X
g t
fi = Ai,j Wi,j tj + (1 − Ai,j )Wi,j tj , (8)
2 https://github.com/zhwzhong/DAGF
j∈Ni j∈Ni
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 4

Fig. 2. The network architecture of the proposed deep attentional guided image filtering (DAGF) with the number of pyramid level m = 3. DAGF consists
of a kernel generation network for constructing filter kernels and a multi-scale guided image filtering network with the purpose of filtering target image by
using the generated kernels.

g t
where Wi,j and Wi,j are filter kernels computed from the Fig. 2 illustrates the overall architecture of the proposed
guidance and the target respectively; Ai,j denotes the pixel- network, which is composed of kernel generation sub-network
wise reliability weight of the guidance image, which is deter- and multi-scale guided filtering sub-network. Instead of di-
mined automatically by considering both guidance and target rectly predicting kernels in image domain and enlarging its
information. The above formulation means that, when the receptive field by using the deformable sampling strategy
guidance information is not trustworthy, we should turn to use as [7], we employ a pyramid architecture to achieve a large
the target information itself for regression, so as to prevent the receptive field, and conduct filter learning in the feature
unreliable structure propagation. domain since deep features are more robust with respect to
appearance difference of the target and the guidance.
III. P ROPOSED M ETHOD
An effective guided image filtering scheme should be able to
identify the consistent structures contained in the guidance as • In the filter kernel generation sub-network, the multi-scale
well as avoid transferring extraneous or erroneous contents to features of I t and I g are fed into the attentional kernel
the target. In this section, we introduce in detail the proposed learning (AKL) module to generate filter kernels {Wi }.
deep attentioanl guided image filtering (DAGF) framework for The network architecture of AKL is illustrated in Fig. 3,
this purpose, where the complementary information contained where an attentional contribution module based on U-
in the two images can be fully explored in both kernel Net architecture is designed to adaptively fuse the filter
generation and image filtering process. kernels generated by the guidance and the target.
• In the guided filtering sub-network, with the derived
pixel-wise filter kernels, features of the target image are
A. Network Architecture processed in a coarse-to-fine manner to get the upsampled
t
The DAGF takes a target image I t ∈ RH×W×C (e.g., features.
g
low-resolution depth) and a guidance image I g ∈ RH×W×C
(e.g., high-resolution color image) as inputs, and generates a
t
reconstructed image I out ∈ RH×W×C as output, where H, W The above process is repeated until arriving the final scale. In
and C denote the height, width and the number of channels the following, we will elaborate these two sub-networks and
respectively. the loss function design for network training.
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 5

are k 2 where k is the desired filter kernel size. However,


these kernels generated by the target or guidance information
alone cannot explore the dependencies among them, making
the filtering outputs suffer from blurring or texture copying
artifacts. To alleviate this problem, we introduce adaptive
kernels combination module based on a light-weight UNet
architecture, which takes both guidance and target features as
inputs and models the pixel-wise dependencies among them
in a learning manner. This process is formulated as:
Ai = UNet([Fit , Fig ]), 0 ≤ i < m, (13)
where UNet is a five-layer U-like [19] network, [·, ·] denotes
concatenation operation; Ai is the output of this module,
which can be considered as an attention map to adaptively
combine kernels constructed from guidance and target fea-
tures. The final guided filter kernels can be derived as:
Fig. 3. The network architecture of the proposed attentional kernel learning Wi = Ai Wig + (1 − Ai ) Wit , 0 ≤ i < m, (14)
(AKL) module, where denotes the element-wise multiplication and } is
concatenation operation. where Wi is the generated i-th filter kernel; 1 denotes the
all-1 matrix; means element-wise multiplication.

B. Filter Kernel Generation C. Multi-scale Guided Filtering


The filter kernel generation sub-network is tailored to After generating the guided filter kernels, the following step
generate spatial-variant kernels by considering the mutual is to perform filtering on the target image, which is done
dependency between the target and the guidance. As illustrated by the guided filtering sub-network. As shown in the right
in the left part of Fig. 2, given I t and I g as inputs, we first part of Fig. 2, it takes the target image I t as the input, and
employ two-branch pyramid network to extract multi-scale progressively filters the input target image by using the learned
features {Fit , 0 ≤ i < m} and {Fig , 0 ≤ i < m} from the filter kernels {W0 , · · · , Wm−1 } in a coarse-to-fine manner.
target and the guidance, respectively. We take the target branch Specifically, given I t as input, we first utilize Bicubic to
as an example, which is done by: resize it to the same resolution as its corresponding filter
F0t = Conv(Conv(I t )), (9) kernels:
Fit = Down(Fi−1
t
), 0 < i < m, (10) Iˆt = Bicubic(I t ). (15)
where m denotes the level of pyramid network and Conv(·) Then the filtering process can be formulated as:
is the convolution operator; Down(·) represents the down-
F0 = ResNet(GIF(Conv(Iˆt )), W0 , ), (16)
sample block with scale factor 2, which is implemented by two
↑ ˆt
convolution layers and a inverse pixel-shuffle [18] operation. Fi = ResNet(GIF([Fi−1 , Conv(I )], Wi )), 0 < i < m,
For guided image filtering, the prior information for re- (17)
construction is either from the guidance image if there are where [·, ·] means concatenation operation and ResNet(·) is
consistent structures between the guidance and target images, the function including three residual blocks (He et al., [20]); Fi
or from the target image itself if there is no reliable guidance is i-th filtered target feature; ↑ is upsampling operation. GIF(·)
information. This inspires us to design dual regression over the is a filtering operation that conducts filtering operation on the
guidance and the target, respectively, instead of only relying corresponding target features. Specifically, we first reshape the
on the guidance as done in most existing methods. To this third dimension of the filter from k 2 to k ×k, then the filtering
end, as shown in Fig. 3, we propose an attentional kernel process for a pixel {(u, v)|0 ≤ u < H, 0 ≤ v < W } can be
learning (AKL) module. It takes the extracted guidance and defined as following:
target features as inputs and consists of two steps: dual kernels σ σ
generation and adaptive kernels combination. X X
F (u, v) = Wu,v (x, y) · F̃ (u − x, v − y), (18)
The first step is the dual kernels generation, which is x=−σ y=−σ
formulated as:
where σ = bk/2c; F̂ is the output of the GIF module.
Wit = Conv(Conv(Fit )), 0 ≤ i < m, (11) m−2
Based on {Fi }i=0 , we can obtain the filter results of DAGF
Wig = Conv(Conv(Fig )), 0 ≤ i < m, (12) by using the proposed the multi-scale fusion strategy:
where Wit and Wig are the i-th constructed filter kernels F̂0 = Conv(F0 ), (19)
from the target and guidance features respectively. The spatial- F̂i = Conv(Fi ) + λi−1 · ↑
F̂i−1 ,0 < i < m, (20)
resolution of i-th kernels is the same as the one of its
corresponding input features while the number of channels Iiout t
= Conv(F̂i ) + I , 0 ≤ i < m − 1, (21)
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 6

where λi is a learnable parameter that is initialized as 0. The E. Implementation Details


parameter enables the output layer first to rely on features of In our model, we set the number of pyramid levels as m = 3
the current layer and then gradually learn to combine high- and the size of generated kernel in AKL modules as 3 × 3.
level features from previous layers. Therefore, the output of The ablation study presented blow will verify the effectiveness
the last layer can enjoy the merit of preserving both high-level of our configuration. The hyper-parameters of our model are
contextual details and low-level spatial information. {Iiout }m−2
i=0 ω1 = 1, ω2 = 0.001 and ω3 = 1 . All the convolution layers
out
are the intermediate multi-scale results and Im−1 is the final within the proposed methods are sized of 3×3 and the channels
filtering result of the proposed scheme. of intermediate features are 32. We use PReLU [21] as the
default activation function. We utilize PixelShuffle [22] and
D. Loss function InvPixelShuffle [18] as the up-sampling and down-sampling
We adopt the residual learning strategy to train the proposed operators to resize the features in our model.
method. Let I g and I t be the input guidance and target image, In the training phase, the batch size is set as 32 and we
I h be the corresponding ground-truth image. The proposed random crop 256 × 256 image patches from the target and
DAGF network aims to learn the residual between I h and I t . guidance images as inputs. We augment the training data with
The overall all loss function is composed of three terms: a L1 random flipping and rotation. Adam [23] with β1 = 0.9 and
loss L1 , a multi-stage loss Lms and a boundary-aware loss β2 = 0.999 is employed as optimizer. The initial learning
Lb : rate is set as 1 × 10−4 and we halve it every 80 epochs, stop
• L1 loss. L1 measures the pixel-wise errors between the the training after 100 epochs. Our model is implemented by
out
output image Im−1 and its corresponding residual image Pytorch [24] and trained on one RTX 1080ti GPU. Training
r
I: the proposed method roughly takes 2 day for NYU v2 [25]
L1 = ||I h − Im−1
out
||1 . (22) datasets.
Our network takes three channels guidance and one channel
• Multi-stage loss. To stabilize the network training pro- target images as inputs. For the multi-channels target images,
cess and promote the multi-stage guided filtering module we apply the trained model separately for each channel and
to learn more effective parameters, we propose a multi- the outputs are combined to obtain the final result. For the
stage loss to enforce all intermediate results to be close single-channel guidance image, we copy the single-channel
to the ground truth residual image: three times to generate three-channels guidance image.
m−2
1 X h
Lms = ||I − Bicubic(Iiout )||1 , (23) IV. E XPERIMENTS
m − 1 i=0
In this section, we conduct extensive experiments to evaluate
where m is the number of pyramid levels. We use Bicubic the performance of the proposed method on a wide range of
interpolation to resize the output image Iiout to the same guided image filtering tasks, including guided image super-
resolution as the ground truth target image I h . resolution (e.g. depth image super-resolution and saliency map
• Boundary-aware loss. Optimizing the pixel-wise loss super-resolution, Sect. IV-A), cross-modality image restoration
(e.g., L1 and L2 ) typically cannot preserve high- (e.g. joint depth image super-resolution and denoising, and
frequency structure information well, and tends to pro- flash/non-flash image denoising, Sect. IV-B), texture removal
duce blurry images as all pixels are treated equally. To (Sect. IV-C) and image semantic segmentation (Sect. IV-D).
mitigate this problem and encourage the network to give For fair comparison, the results for the compared methods
more emphasis on the high-frequency parts, we propose are generated by using the source codes released by their
a boundary-aware loss to promote our model to generate authors with the default parameter settings, and all learning
sharper boundaries. Specifically, we first employ Sobel based methods are trained and tested on the same datasets.
operator ∇ to detect the boundary information of the
ground truth and the network output, and obtain the
A. Guided Image Super-resolution
boundary mask M :
Guided image super-resolution (GSR) is a classic computer
M = (∇x I h − ∇x Im−1
out
) (∇y I h − ∇y Im−1
out
), (24) vision task which aims to reconstruct a high-resolution (HR)
then the boundary-aware loss is defined as: image from a low-resolution (LR) one with the help of a HR
image from another modality. For example, we can obtain a
Lba = ||M Ih − M out
Im−1 ||1 , (25) HR depth by GSR using a LR depth and a HR RGB image
where denotes element-wise multiplication. as inputs, where the HR RGB image serves as the guidance.
With these three losses, the total loss is then formulated as: Following the experimental settings of [5], [7], we train our
model on the task of depth image super-resolution, and then
L = ω1 · L1 + ω2 · Lba + ω3 · Lms , (26)
evaluate the performance of the model on tasks of depth image
where ω1 , ω2 and ω3 are hyper-parameters to balance these super-resolution and saliency map super-resolution, the latter
loss functions. We set ω3 = 1 to stabilize the training one is used to verify the generalization ability of our model.
procedure at early stage and then progressively decay to zero RGB-guided Depth Super-resolution. For this task, we use
with the training progress to boost the performance of final the first 1000 RGB-D image pairs from NYU v2 dataset [25]
output. We set ω1 = 1, ω2 = 10, respectively. as the training set. In order to make fair comparison with
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 7

TABLE I
Q UANTITATIVE COMPARISON FOR DEPTH IMAGE SUPER - RESOLUTION ON FOUR STANDARD RGB/D DATASETS IN TERMS OF AVERAGE RMSE VALUES .
F OLLOWING THE EXPERIMENTAL SETTING OF [7], [26], WE CALCULATE THE AVERAGE RMSE VALUES IN CENRIMETER FOR NYU V 2 [25] DATASET.
F OR OTHER DATASETS , WE COMPUTE THE RMSE VALUES BY SCALING THE DEPTH VALUE TO THE RANGE [0, 255]. T HE BEST PERFORMANCE FOR EACH
CASE ARE HIGHLIGHTED IN BOLDFACE WHILE THE SECOND BEST ONES ARE UNDERSCORED . F OR RMSE METRIC , THE LOWER VALUES MEAN THE
BETTER PERFORMANCE .

Datasets Middlebury Lu NYU v2 Sintel


Method 4× 8× 16× 4× 8× 16× 4× 8× 16× 4× 8× 16×

Bicubic 4.44 7.58 11.87 5.07 9.22 14.27 8.16 14.22 22.32 10.11 14.51 19.95
MRF [27] 4.26 7.43 11.80 4.90 9.03 14.19 7.84 13.98 22.20 9.87 13.45 18.19
GF [1]) 4.01 7.22 11.70 4.87 8.85 14.09 7.32 13.62 22.03 8.83 12.60 18.78
TGV [28]) 3.39 5.41 12.03 4.48 7.58 17.46 6.98 11.23 28.13 8.30 13.05 19.96
SDF [8]) 3.14 5.03 8.83 4.65 7.53 11.52 5.27 12.31 19.24 9.20 13.63 19.36
FBS [14]) 2.58 4.19 7.30 3.03 5.77 8.48 4.29 8.94 14.59 8.29 10.31 16.18
JBU [2]) 2.44 3.81 6.13 2.99 5.06 7.51 4.07 8.29 13.35 8.25 11.74 16.02
Experiment results for depth map super-resolution (Nearest-neighbour down-sampling).
DGF [29]) 3.92 6.04 10.02 2.73 5.98 11.73 4.50 8.98 16.77 7.53 11.53 17.50
DJF [26]) 2.14 3.77 6.12 2.54 4.71 7.66 3.54 6.20 10.21 7.09 9.12 12.36
DMSG [30]) 1.79 3.39 5.87 2.48 4.74 7.51 3.48 6.07 10.27 6.80 9.09 11.81
DJFR [5]) 1.98 3.61 6.07 2.22 4.54 7.48 3.38 5.86 10.11 7.05 9.12 12.61
DSRN [31]) 2.08 3.26 5.78 2.57 4.46 6.45 3.49 5.70 9.76 7.29 9.43 11.62
PAC [6]) 1.91 3.20 5.60 2.48 4.37 6.60 2.82 5.01 8.64 6.79 8.36 11.02
DKN [7]) 1.93 3.17 5.49 2.35 4.16 6.33 2.46 4.76 8.50 6.84 8.61 11.21
DAGF(Ours) 1.78 2.73 4.75 1.96 3.81 6.16 2.35 4.62 7.81 6.72 8.35 10.64
Experiment results for depth map super-resolution (Bicubic down-sampling).
DGF [29]) 1.94 3.36 5.81 2.45 4.42 7.26 3.21 5.92 10.45 5.91 8.02 11.17
DJF [26]) 1.68 3.24 5.62 1.65 3.96 6.75 2.80 5.33 9.46 5.30 7.53 10.41
DMSG [30]) 1.88 3.45 6.28 2.30 4.17 7.22 3.02 5.38 9.17 4.73 6.26 8.36
DJFR [5]) 1.32 3.19 5.57 1.15 3.57 6.77 2.38 4.94 9.18 4.90 7.39 10.33
DSRN [31]) 1.77 3.05 4.96 1.77 3.10 5.11 3.00 5.16 8.41 4.49 6.53 9.28
PAC [6]) 1.32 2.62 4.58 1.20 2.33 5.19 1.89 3.33 6.78 4.42 6.13 8.42
DKN [7]) 1.23 2.12 4.24 0.96 2.16 5.11 1.62 3.26 6.51 4.38 5.89 8.40
DAGF(Ours) 1.15 1.80 3.70 0.83 1.93 4.80 1.36 2.87 6.06 3.84 5.59 7.44

existing methods, we exploit the nearest-neighbour down- DSRN [31], PAC [6] and DKN [7]. We adopt Root Mean
sampling as the standard downsampling operator to generate Square Error (RMSE) as the evaluation metric. Lower RMSE
LR target image from the ground-truth. Three scales are values mean higher recovery quality.
considered, including 4×, 8×, 16×. To show the effectiveness Table I summarizes the quantitative comparison results
of the proposed method, we further conduct experiments on between ours and other state-of-the-art methods. The best
Bicubic downsampling as done in [7]. The performance of the performance is highlighted in bold. As can be seen from
proposed method is evaluated on the following four standard this table, our method achieves the best results among all
benchmark datasets: the compared methods on both synthetic and real datasets
• Sintel dataset [32]: this dataset consists of 1064 image (e.g. the Sintel and NYU v2 dataset) and on three scales.
pairs which are obtained by an animated 3D movie. The superior performance benefits from the more precise filter
• NYU v2 dataset [25]: this dataset contains 1449 image kernels learned and the multi-scale filtering process. Compared
pairs acquired by Microsoft Kinect. We use the last of 449 with the second best results (underlined), our results obtain the
image pairs to evaluate the performance of our method. gains of 0.12(4×), 0.24(8×) and 0.39(16×) withe respect to
• Lu dataset [33]: it contains 6 image pairs captured by average RMSE values.
ASUS Xtion Pro camera. To further analyze the performance of the proposed method,
• Middlebury dataset [34], [36]: this dataset is captured we present the visual results for 8× depth image super-
by structure light, and we utilize the 30 image pairs resolution in Fig. 4. It can be observed that the results of
from 2001-2006 datasets with the missing depth values JBU [2] suffer from jaggy artifacts. The results of GF [1]
generated by Lu et al. [33]. are over-smoothed, which indicates that the local filter is
We compare our method with 13 state-of-the-art meth- not effective at large scale factors (e.g. 8×). Compared to
ods, including two local filtering-based methods: GF [1] and GF [1] and JBU [2], the learning-based methods are capable
JBU [2]; four global optimization-based methods: MRF [27], of generating results with clearer boundaries. However, for
TGV [28], SDF [8] and FBS [14]; seven deep learning- finer details, e.g., the arm in the second image and the rope in
based methods: DGF [29], DJF [26], DMSG [30], DJFR [5], the last image, the compared learning-based methods exhibit
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 8

(a) Guidance (b) JBU (c) GF (d) DMSG (e) DJFR (f) PAC (g) DKN (h) DAGF (i) GT

Fig. 4. Qualitative comparison for recovered depth maps (8×). (a) Guidace image, (b): JBU [2], (c): GF [1], (d): DMSG [30], (e): DJFR [5], (f): PAC [6],
(g): DKN [7], (h): DAGF, and (i) Ground truth depth map. Top to bottom: Each two rows present recovered depth maps on the NYU v2 [25], Sintel [32],
Lu [33] and Middlebury [34] datasets respectively. Please enlarge the PDF for more details.

TABLE II
Q UANTITATIVE COMPARISON OF 8× SALIENCY MAP SUPER - RESOLUTION ON THE DUT-OMRON DATASET [35]. F OLLOWING DJFR [5], WE USE
F- MEASURE TO CALCULATE THE DIFFERENCE BETWEEN THE PREDICTED SALIENCY MAP AND THE CORRESPONDING GROUND - TRUTH . T HE BEST
PERFORMANCE FOR EACH CASE IS HIGHLIGHTED IN BOLDFACE WHILE THE SECOND ONE IS UNDERSCORED F OR F- MEASURE , THE HIGHER VALUES
MEAN THE BETTER PERFORMANCE .

Methods Bicubic GF [1] DMSG [30] DJFR [5] PAC [6] FDKN [7] DKN [7] DAGF (Ours)

Fscore 0.853 0.821 0.910 0.901 0.922 0.921 0.926 0.932

obvious artifacts such as blurring on the arm and wrong estimation on the rope, which implies that the downsampling
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 9

(a) Guidance (b) Bicubic (c) DMSG (d) DJFR (e) PAC (f) DKN (g) DAGF (h) GT

Fig. 5. Visual comparison of 8× saliency map super-resolution on the DUT-OMRON dataset [35]: (a): Guidance, (b): low-resolution image, (c): DMSG [30],
(d): DJFR [5], (e): PAC [6], (f): DKN [7], (e): DAGF, (h): Ground truth. Please enlarge the PDF for more details.

degradation brings significantly damage on the small objects models on two noise reduction tasks using flash/non-flash and
and therefore makes those regions harder to recover. On the RGB/NIR image pairs. Finally, we conduct experiments on
contrary, the results obtained by the proposed method are ToF Mark dataset [39]. It contains three real world depth
clearer, sharper, and more faithful to the ground truth image. images acquired by Time of Flight (ToF) camera, which have
RGB-guided Saliency Map Super-resolution. To further complicated multi-modality degradation
demonstrate the generalization ability of the proposed method, Joint Depth Image Super-resolution and Denoising.
we apply the model that is trained on NYU v2 dataset directly Depth images acquired by ranging sensors are typically noisy.
to the task of saliency map super-resolution without any In order to simulate the data acquisition process of the depth
fine-tuning step. Similar to DKN [7], we use 5168 image sensor, we add Gaussian noise with variance as 25 to the low-
pairs from DUT-OMRON dataset [35] to evaluate the SR resolution target depth images. We use the same experimental
performance. We use bicubic interpolation (8×) to generate settings as the task of GSR in Sect IV-A to train our model.
the low-resolution saliency maps and then super-resolve them We compare our method with ten state-of-the-art methods,
with the corresponding high-resolution color image as the including GF [1], MUF [3] and SDF [8], which are traditional
guidance. The quantitative results in terms of F-measure are model-based methods; and DGF [29], DJF [26], DMSG [30],
listed in Table II. As can be seen from this table, our DAGF DJFR [5], DSRN [31], PAC [6], DKN [7], which are deep
achieves the best result among all the compared methods learning-based methods. Since most of the existing methods
and outperforms the second best method by a large margin, do not provide experimental results for this task, we retrain all
which demonstrates the generalization ability of the proposed the deep learning-based methods with the same training and
method. In addition, we random select two images and visu- test dataset as ours.
alize the recovered high-resolution saliency map obtained by
The quantitative results in terms of RMSE values for four
different methods in Fig. 5. It can be observed that the results
benchmark datasets are reported in Table III, from which we
of Bicubic are over-smoothed, in which the structure details
can see that the proposed method can obtain consistently better
are severely damaged. DMSG [30] and DJFR [5] struggle to
results than the existing state-of-the-art methods, especially for
generate clear boundaries. The results of DKN [7] have certain
the 8× and 16× cases which are more difficult to recover. This
artifacts around the edge area. In contrast, our method is able
is mainly because that: 1) we employ a pyramid architecture
to generate high-quality saliency maps as well as keep the
to extract multi-modality features for guided kernel genera-
sharpest boundaries, which indicates that the proposed method
tion, thus the multi-scale complementary information can be
can fully take advantage of the guidance image and effectively
obtained; 2) for guided image filtering, we leverage the coarse-
transfer meaningful structure information.
to-fine strategy to filter the low-resolution target image and
thus the structure details can be progressively recovered; 3)
B. Cross-modality Image Restoration compared to single loss at the end of network, the proposed
For the task of cross-modality image restoration, we first multi-scale loss can bring stronger supervision to our model.
conduct experiments on joint depth image super-resolution and Fig. 6 further demonstrates the visual superiority of the
denoising to show the superiority of the proposed method. proposed method for joint depth image super-resolution and
Moreover, to verify the ability of the proposed method on denoising (16× Bicubic downsampling and Gaussian noise).
dealing with various visual domains, we apply the trained The results of GF [1], MUF [3] and SDF [8] still contain much
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 10

(a) Guidance (b) Target (c) GF (d) MUF (e) SDF

(f) PAC (g) DJFR (h) DKN (i) DAGF (j) GT

Fig. 6. Qualitative comparison of joint depth map super-resolution and denoising. Please enlarge the PDF for more details. (a): Guidance Image, (b): Target
image, (c): GF [1], (d): MUF [3], (e): SDF [8], (f): PAC [6], (g): DJFR [5], (h): DKN [7], (i): DAGF and (j): Ground-truth image. ease enlarge the PDF for
more details.

noise, and the visual quality of the whole image is poor. This contrast, our method is able to remove the noise effectively
is due to that these methods are based on the locally linear and produces the clearest and sharpest boundaries.
assumption and they employ the mean filter to calculate the
Cross-modality Image Restoration. We further demon-
coefficients for pixel-wise linear representations. The methods
strate that our model trained for depth image denoising can be
of PAC [6] and DJFR [5] can remove noise well, while they
generalized to address other cross-modality image restoration
cannot preserve the sharp edge and introduce ringing artifacts.
tasks, such as flash guided non-flash image denoising and
The results of DKN are clearer and sharper than previous
NIR guided color image restroation. Fig. 8 shows the visual
methods. However, they suffer from color distortion, which
comparison among existing state-of-the-art methods and ours.
attributes to the batch normalization used in DKN [7]. In
All of the deep learning-based methods (e.g. DJFR [5] and
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 11

(a) Guidance (b) Target (c) SDF (d) DGDIE (e) DKN (f) DAGF (g) GT

Fig. 7. Visual comparison of realistic depth map super-resolution on two examples (books and devil) from ToFMark [28] dataset: (a): Guidance image, (b):
Target image, (c): SDF [8], (d): DGDIE [37], (e): DKN [7], (f): DAGF, (e): Ground truth. Please enlarge the PDF for more details.

TABLE III
Q UANTITATIVE COMPARISON FOR JOINT DEPTH IMAGE SUPER - RESOLUTION AND DENOISING ON FOUR STANDARD RGB/D DATASETS IN TERMS OF
AVERAGE RMSE VALUES . F OLLOWING THE EXPERIMENTAL SETTING OF [7], [26]), WE CALCULATE THE AVERAGE RMSE VALUES IN CENRIMETER FOR
NYU V 2 [25] DATASET. F OR OTHER DATASETS , WE COMPUTE THE RMSE VALUES BY SCALING THE DEPTH VALUE TO THE RANGE [0, 255]. T HE BEST
PERFORMANCE FOR EACH CASE ARE HIGHLIGHTED IN BOLDFACE WHILE THE SECOND BEST ONES ARE UNDERSCORED . F OR RMSE METRIC , THE
LOWER VALUES MEAN THE BETTER PERFORMANCE .

Datasets Middlebury Lu NYU v2 Sintel


Method 4× 8× 16× 4× 8× 16× 4× 8× 16× 4× 8× 16×

DGF [29] 2.70 4.13 6.38 4.06 5.85 8.39 6.52 9.23 13.00 6.94 9.03 12.05
DJF [26] 1.80 2.99 5.16 1.85 3.13 5.39 3.74 5.95 9.61 4.88 6.93 10.05
DMSG [30] 1.79 2.69 4.75 1.88 2.79 4.84 3.60 5.31 8.07 4.74 6.36 8.72
DJFR [5] 1.86 3.07 5.27 1.91 3.21 5.51 4.01 6.21 9.90 5.10 7.12 10.23
DSRN [31] 1.84 2.99 4.70 1.97 2.98 5.94 4.36 6.31 9.75 5.49 7.21 9.80
PAC [6] 1.81 2.94 5.08 1.93 3.44 6.18 4.23 6.24 9.54 5.40 7.32 9.89
DKN [7] 1.76 2.68 4.55 1.81 2.82 4.81 3.39 5.24 8.41 4.51 6.25 9.20
DAGF (Ours) 1.72 2.61 4.24 1.74 2.72 4.51 3.25 5.01 7.54 4.42 6.09 8.25

(a) Guidance (b) Target (c) SDF (d) RTV (e) DJFR (f) DKN (g) DAGF

Fig. 8. Visual comparison of cross-modality image restoration. Top: flash guided non-flash image denoising. Bottom: NIR guided color image denoising. (a):
Guidance image, (b): Target image, (c): SDF [8], (d): RTV [13], (e): DJFR [5], (f): DKN [7], (f): DAGF. Please enlarge the PDF for more details.

DKN [7]) are tested with the same setting as ours. Among the best performance.
the compared methods, SDF [8] and RTV [13] are specially Realistic Depth Image Super-resolution. To further eval-
designed for this task. As can be seen from Fig. 8, DJFR [5] uate the robustness of the proposed method, we conduct
cannot remove noise, and the results of DKN [7] suffer experiments on ToFmark dataset [39], which include real ToF
from halo artifacts. On the contrary, the proposed DAGF can sensor data and thus have complicated multi-modality degrada-
produce more convincing results with less artifact. The method tion. Following the experimental protocol of DGDIE [37], we
of RTV [13] which is specially designed for this task, obtains first perform image completion on the acquired depth images
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 12

(a) Target (b) RTV (c) RGF (d) SDF (e) DJFR (f) DKN (g) DAGF

Fig. 9. Visual comparisons of texture remove results. (a): Target image, (b): RTV [13], (c): RGF [9], (d): SDF [8], (e): DJFR [5], (f): DKN [7], (g): DAGF.
Please enlarge the PDF for more details.

TABLE IV they deviate from the ground truth. By comparison, the results
Q UANTITATIVE COMPARISON FOR REALISTIC DEPTH IMAGE of the proposed method are sharper and much closer to the
SUPER - RESOLUTION IN TERMS OF RMSE VALUES ON THE T O FM ARK [28]
DATASET. T HE BEST PERFORMANCE FOR EACH CASE ARE HIGHLIGHTED ground truth, especially at the boundary regions.
IN BOLDFACE WHILE THE SECOND ONES ARE UNDERSCORED .

C. Texture Removal
Methods Books Devil Shark
Texture removal is the task of extracting semantically
Bilinear 17.10 20.17 18.66 meaningful structures from textured surfaces. For this task,
JBU [2] 16.03 18.79 27.57 we use the textured image itself as the guidance, and apply
GF [1] 15.74 18.21 27.04
our model trained for depth image denoising iteratively to
TGV [28] 12.36 15.29 14.68
SDF [8] 12.66 14.33 10.68
remove small-scale textures. We compare our method with
Yang [38] 12.25 14.71 13.83 RTV [13], RGF [9], SDF [8], DJFR [5] and DKN [7]. For deep
DGDIE [37] 12.32 14.06 9.66 learning-based methods, we follow DKN [7], set the number of
DKN [7] 11.81 13.54 9.11 iterations as 4, and for other methods we carefully fine-tune the
DAGF (Ours) 11.80 13.47 9.07 parameters to provide the best results. The visual comparison
are presented in Fig. 9. Obviously, our method outperforms
other compared methods, and it can painlessly remove small-
TABLE V scale textures as well as preserve the global color variation
Q UANTITATIVE COMPARISON FOR SEMANTIC SEGMENTATION IN TERMS and main edges.
OF AVERAGE I O U ON THE VALIDATION SET OF PASCAL VOC 2012. T HE
BEST PERFORMANCE FOR EACH CASE ARE HIGHLIGHTED IN BOLDFACE
WHILE THE SECOND ONES ARE UNDERSCORED . D. Semantic Segmentation
Semantic segmentation is a fundamental computer vision
Methods Mean IoU
task, which aims at assigning pre-defined labels to each pixel
Deeplab-V2 [40] 70.69 of an image. In DGF [29], the author proposed to use guided
DenseCRF [41] 71.98 image filtering as a layer to replace the time-consuming fully
DGF [29] 72.96
connected conditional random filed (CFR) [41] for semantic
DJFR [5] 73.30
FDKN [7] 73.60
segmentation. We demonstrate that the proposed DAGF can
DAGF (Ours) 73.76 be applied to this problem. Following DGF [29], we plug the
proposed model into DeepLab-v2 [40] and train the whole
network in an end-to-end manner, and thus the offline post-
processing of CRFs can be avoided. We utilize the Pascal VOC
and then send them to our model (4× super-resolution and 2012 dataset [42] in our experiment, which contains 1264,
denoising) trained on NYU v2 dataset [25] to obtain the final 1229 and 1456 images for training, validation and testing,
results. We compare our method with a recently proposed respectively. Similar to DGF [29], we augment the training
deep learning-based method (e.g. DKN [7]) and some tradi- set with the annotations provided by [43], resulting in 10582
tional methods (e.g. TGV [28], SDF [8], DGDIE [37]). As images. The 1449 images in the validation set are employed
shown in Table IV, our method constantly obtains the best to evaluated the proposed method.
objective results for the three test images. Fig. 7 presents We use the mean intersection-over-union (IoU) score as
visual comparison results for two images (books and devil). evaluation metric and report the quantitative results for the
Form these figures, it is easy to observe that the results of validation set of Pacal VOC dataset [42] in Table V. The
SDF [8] suffer from texture-copying artifacts. The results of baseline denotes DeepLab-v2 [40] without CRF. As can be
DKN [7] are smooth and blurred, since DKN generates filter seen from this table, our method outperforms the second best
kernels without considering the inconsistence between color model DKN [7] by 0.16% mIoU and other models by a
and depth image. The results of DGDIE [37] are clear but large margin. We visualize the segmentation results among our
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 13

(a) RGB Image (b) Deeplab-V2 (c) DGF (d) FDKN (e) DAGF (f) GT

Fig. 10. Visual comparison of semantic segmentation on the validation set of Pascal VOC 2012 dataset [42]. (a): RGB image, (b): Deeplab-V2 [40], (c):
DGF [29], (d): FDKN [7], (e): DAGF, (f): ground truth image. Please enlarge the PDF for more details.

TABLE VI k or m can increase the receptive field of our model but at


A BLATION STUDY . Q UANTITATIVE COMPARISON OF DIFFERENT SIZE OF the expense of higher computational complexity. To seek an
KERNEL (k × k) AND THE NUMBER OF PYRAMID LEVEL (m).
appropriate trade-off between complexity and performance, we
conduct experiments on the task of depth map super-resolution
Kernel Size m=1 m=2 m=3 m=4
with different k and m, and the results are summarized in
1×1 12.8 11.72 11.41 11.26 Table VI. From this table, we can see that the reconstruction
3×3 8.97 8.12 7.81 7.75 performance is significantly improved when the number of
5×5 8.60 8.02 7.73 7.98
pyramid levels m increased from 1 to 3. However, when m
7×7 8.67 7.94 7.78 7.99
is too large, e.g., m = 4, the improvements are small or even
We report average RMSE values of the last 449 image pairs in NYU v2 worse. We can draw the same conclusion for the size of filter
dataset [25].
kernels k × k. The possible reason for this phenomenon is that
the receptive filed is enough for this task when m = 3, k = 3
method and other compared methods in Fig 10, from which and larger m or k will burden the optimization process of
we can see that our method is capable of generating results network. Therefore, we set m = 3, k = 3 in our experiments.
with accurate and complete object boundaries.

V. A BLATION S TUDY B. Ablation Experiments


In this section, we first present the hyper-parameters setting As shown in Fig. 2, our model consists of two part: kernel
in our model, and then conduct a series of ablation experiments generation sub-network and multi-scale guided image filtering
to investigate the effectiveness of our main contributions, e.g., sub-network. For kernel generation sub-network, we propose
attentional kernel learning module (mentioned in Sect. III-B), to generate dual sets of kernels from the guidance and target
multi-scale fusion (mentioned in Sect. III-C) with deep su- images, and employ a tiny network to learn a weight map to
pervision (mentioned in Sect. III-D) and boundary-aware loss adaptively combine the two sets of kernels. For guided image
(mentioned in Sect. III-D). In this study, we train different filtering sub-network, we progressively filter the target image
variants of our model on the commonly used NYU v2 dataset with the learned multi-scale kernels. In order to fully integrate
(Silberman et al., [25]) with 16× nearest-neighbour downsam- the intermediate filtered results, we propose a multi-scale
pling and evaluate the performance of them on four benchmark feature fusion strategy and a multi-stage loss. To encourage
datasets. The experimental settings are the same as Sect. IV-A. our model to give more emphasis to the high-frequency and to
generate visual pleasing results, we propose to train our model
A. Hyper-parameters setting with hybird loss functions, e.g., pixel-wise loss L1 , multi-
For network hyper-parameters setting, we investigate the scale loss Lms , and boundary-aware loss Lba . To analyze the
influence the size k × k of learned filter kernels in our contribution of each component of our model, we implement
kernel generation sub-network (e.g., W0 , W1 , W2 in Fig. 2) seven variants of our model:
and the number of pyramid level m in our model for multi- • Model1, which takes (target, target) as inputs for kernel
modality feature extraction to the final performance. Enlarging generation, and is trained with L1 loss.
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 14

TABLE VII
A BLATION S TUDY . Q UANTITATIVE COMPARISON OF DIFFERENT COMPONENTS FOR 16× DEPTH IMAGE SUPER - RESOLUTION . W E CHOSE RMSE AS THE
EVALUATION METRIC , AND THE LOWER VALUES INDICATE BETTER PERFORMANCE . M ODEL 7 IS OUR FINAL MODEL (DAGF).

Kernel Generation Kernel Combination


Model Lms Lba Middlebury Lu NYU v2 Sintel Average
Target Guidance MUL SUM AKL

Model1 X 7.08 7.87 11.99 13.67 10.15


Model2 X 5.68 7.19 9.09 11.82 8.45
Model3 X X X 5.47 6.84 9.07 11.74 8.28
Model4 X X X 5.36 6.90 8.99 11.65 8.23
Model5 X X X 5.06 6.57 8.49 11.18 7.82
Model6 X X X X 4.88 6.19 7.92 10.89 7.47
Model7 X X X X X 4.75 6.16 7.81 10.64 7.34

(a) Guidance (b) Model1 (c) Model2 (d) Model3 (e) Model4 (f) Model5

Fig. 11. Ablation Study. Visual comparison of an example without and with the proposed attentional kernel learning module (AKL) for depth image
super-resolution. The first row shows the super-resolved depth images and the last row shows the error map (I h − I out ). Please enlarge the PDF for more
details.

• Model2, which takes (guidance, guidance) as inputs for boost the network performance significantly. In the following,
kernel generation, and is trained with L1 loss. we will give a detailed analysis of each component in our
• Model3, which takes (target, guidance) as inputs for method. Effectiveness of Attentional Kernel Learning
kernel generation, and uses element-wise multiplication (AKL): In this paper, we propose to use AKL to generate
to combine the generated two sets of kernels, and is filter kernels for guided image filtering. Specifically, it first
trained with L1 loss. generates dual sets of kernels by using the extracted guidance
• Model4, which takes (target, guidance) as inputs for and target features respectively, and then adaptively combines
kernel generation, and uses element-wise summation to the generated kernels by the learned attention maps. To demon-
combine the generated two sets of kernels, and is trained strate the effectiveness of AKL, we implement several variants
with L1 loss. (e.g., different inputs for kernel construction and different
• Model5, which takes (target, guidance) as inputs for kernel fusion strategies) of the proposed method, including
kernel generation, and uses the learned weight map to Model1–Model5. The quantitative results on the four testing
adaptively combine the generated two sets of kernels, and datasets are reported in Table VII. As can be seen from
is trained with L1 loss. this table, Model1 generates kernels from target image only,
• Model6, which is Model5 but trained with L1 loss and thus the reconstruction accuracy is relatively low. With the
Lms loss. assistance of guidance image, Model2 obtains a significant
• Model7: which is Model5 but trained with L1 loss, Lms improvement compared with Model1, which implies that the
loss and Lba loss. This is our full model. guidance information is helpful for filter kernel generation.
However, the guidance images are not always reliable, such as
It is noteworthy that we adjust the number of convolutional color images captured in bad weather or low-light conditions.
layers in multi-scale guided image filtering sub-network to In view of this, Model3 and Model4 generate dual sets of
guarantee that each variant could have roughly the same num- kernels from the guidance and target images, respectively,
ber of parameters with our final model. The quantitative results and the difference between the two models is the strategy
are shown in Table V, from which we can see that the full of kernel combination. As shown in Table VII, Model3 and
model (Model7) achieves the best reconstruction performance Model4 can further improve the accuracy over Model2 (The
across four testing datasets when compared with the ablated average RMSE is dropped from 8.45 to 8.28 and 8.23),
models, and every component proposed in our model can
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 15

(a) Guidance (b) Target (c) w/o Lba (d) w/ Lba (e) GT

Fig. 12. Ablation Study. Visual comparison of an example without and with the proposed boundary-aware loss for depth image super-resolution. Please
enlarge the PDF for more details.

(a) Guidance (b) A0 (c) A1 (d) A2

(e) Target (f) 1 − A0 (g) 1 − A1 (h) 1 − A2

(a) Train (b) Test

Fig. 14. Ablation Study. Training and testing RMSE values on NYU v2
Fig. 13. Ablation Study. Visualization of the learned multi-scale attention dataset (Silberman et al., [25]) for 16× depth image super-resolution. MS
maps for kernel combination. We resize the attention map to the same size denotes the proposed multi-stage loss Lms .
for better Visualization. Please enlarge the PDF for more details.

kernel learning theme that constructs kernels by fully integrat-


which indicates that constructing kernels from both target and ing complementary information contained in both guidance
guidance images enjoys some benifits over using only the and target images, the visual effect and reconstruction accuracy
guidance. Nevertheless, using element-wise multiplication or of Model5 are substantially improved.
summation to combine the generated kernels would limit the Moreover, we visualize the attention map in Fig. 13 to
capacity of the network, since they ignore the inconsistency further validate the capability of the proposed AKL, from
between guidance and target images. To solve this problem, which we can see that the kernels generated from the target
we first learn an attention map, and then utilize the attention and guidance images are both important for the task of guided
map to selectively combine the dual kernels as in Eq. 14. filtering as most of the pixel values in the attention maps are
As depicted in Table VII, equipped with AKL, compared with in the range of [0.4, 0.6]. In addition, as shown in the first
Model3, Model5 reduces the average RMSE from 8.32 to 7.82. row of Fig. 13, the structure regions are lighter than texture
To visually show the effect of AKL, we present in Fig. 11 regions, and this indicates that our model can adaptively select
the super-resolved depth images (first row) and error maps relevant information from the guidance image while avoiding
(last row) with different configurations. The error map is texture over-transfer issues.
obtained by I h − I out . As shown in Fig. 11, the result of Effectiveness of Multi-scale Fusion and Deep Super-
Model1 is blur and lack of high-frequency details. For the error vision: In this paper, we propose a multi-scale framework
map of Model1, most of the values at the image boundaries for guided image filtering. Specifically, in order to obtain
are positive, which means that the boundaries generated by both high-level structure information and low-level details, we
Model1 are weaker than the ones of ground truth. The reason propose to fuse multi-level filtered outputs. Moreover, a multi-
is that the kernels generated from the target image only cannot stage loss is introduced to enforce the intermediate results to be
produce the high-frequency details which are lost by the image close to the ground-truth target image. The quantitative results
degradation process. On the contrary, most values in the error are illustrated in Table VII. As expected, Model6 trained with a
map of Model2 are negative, although the depth boundaries are hybrid loss of L1 and Lms further improves the reconstruction
enhanced, the texture-copying artifacts seriously influence the accuracy. Fig. 14 further shows the train (left) and test (right)
super-resolved depth maps. Thanks to the proposed attentional RMSE plots. We observe that the multi-stage loss (Lms ) is
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 16

(a) (b)

Fig. 15. Ablation Study. Average RMSE values for depth image super-
resolution. The low-resolution depth image are obtained by (a): Nearest-
neighbour downsampling, (b): Bicubic downsampling and Gaussian noise.

able to accelerate convergence velocity and produce results


with lower RMSE values.
Effectiveness of Boundary-aware Loss: To encourage the
network to pay more attention to high-frequency information,
Fig. 16. Average runtime (in seconds) and root mean square error (RMSE)
we propose to train our model with boundary-aware loss (Lba ). comparison for 16× depth image super-resolution on NYU v2 dataset [25].
Table VII demonstrates that Lba loss is helpful in improving All the runtimes are evaluated on the same NVIDIA 1080Ti GPU with depth
the reconstruction accuracy (Model7). Fig. 12 presents an image size 480 × 640.
example visual comparison with and without the Lba loss.
Obviously, Lba improves the visual quality further, yielding
compare the running time among our method and other
more precise edges. The boundaries on the doorframe and
comparison methods on NYU v2 (Silberman et al., [25])
corner of the mattress are sharper and clearer, which verifies
for 16× depth image super-resolution. For fair comparison,
the effectiveness of the proposed boundary-aware loss.
all the running times are obtained on the same machine by
Effectiveness of Guidance Branch: The general principle
one NVIDIA 1080Ti GPU. As shown in Fig. 2, our method
of guide image filtering is that we can transfer the valuable
produces multiple results I0out , I1out , I2out , and we first resize
structure information contained in guidance image to the
them to the same resolution as the ground truth target image
target image. Recently, various approaches have been proposed
by a simple bilinear interpolation method, and then calculate
for guided image filtering. Nevertheless, most of them focus
the RMSE values. As illustratedd in Fig. 16, the final result
on designing advanced algorithm for efficiently transferring
I2out achieves the best RMSE result than DKN (Kim et al.,
structures from the guidance to the target image, and the con-
[7]) and DSRN (Guo et al., [31]) but needs less time. The
tributions of guidance images under different conditions are
time cost for I0out is the least, and the performance of I0out
rarely explored. Here, we conduct experiments on several ap-
is comparable to other methods. If the purpose is to achieve
plications of guided image super-resolution, e.g., depth image
the performance as best as possible, we can increase the
super-resolution (nearest-neighbour downsampling) and noisy
level of pyramid, otherwise reduce the level of pyramid.
depth super-resolution (Bicubic downsampling and Gaussian
Overall, our method can achieve a better trade-off between the
noise). As shown in Fig. 15, we evaluate the role of guid-
reconstruction performance and computational complexity.
ance image and compute the average RMSE value for each
upsampling factor. Model1 takes the target image as input for
kernel generation, while Model2 takes the guidance images as VI. C ONCLUSION
input for kernel generation. The results show that the guidance In this paper, we present an effective network architecture
image can provide significant assistance for the 8× and 16× for guided image filtering, which can automatically select and
cases, and the model (DAGF) equipped with the proposed transfer important structures from the guidance to the target
AKL can further improve the performance. However, for the image. Specifically, an attentional kernel learning module
4× case, which is easy to recover, the guidance information (AKL) is proposed to generate dual sets of filter kernels
has a negligible effect. The main reason is that the target from the guidance and target images, respectively, and then
image is not severely damaged by downsampling degradation, adaptively combine the learned kernels in a learning manner.
therefore, the target image can be easily recovered by Model1. Furthermore, a multi-scale guided image filtering framework
For the more difficult cases (8× and 16×), the target image is is introduced, which takes the generated kernels and target
badly polluted, the guidance image would play an important image as inputs and progressively filters the target image
role in the reconstruction process. in a coarse-to-fine manner. Moreover, to fully explore the
Performance vs. Complexity Analysis: In Fig. 16, we intermediate results in the coarse-to-fine process, we propose
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 17

a multi-scale fusion with deep supervision to regularize and [21] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:
combine multiple filtering results. Finally, boundary-aware Surpassing human-level performance on imagenet classification,” in
Proceedings of the IEEE International Conference on Computer Vision,
loss is introduced to enhance the high-frequency details of 2015, pp. 1026–1034.
guided filtering. Experimental results on various guided image [22] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop,
filtering applications show the superiority and flexibility of the D. Rueckert, and Z. Wang, “Real-time single image and video super-
resolution using an efficient sub-pixel convolutional neural network,” in
proposed model and the ablation experiments demonstrate the Proceedings of the IEEE Conference on Computer Vision and Pattern
effectiveness of each component in our method. Recognition, 2016, pp. 1874–1883.
[23] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
arXiv: Learning, 2014.
[24] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin,
ACKNOWLEDGMENT A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in
pytorch,” in NIPS-W, 2017.
R EFERENCES [25] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation
and support inference from rgbd images,” in Proceedings of the 12th
[1] K. He, J. Sun, and X. Tang, “Guided image filtering,” IEEE Transactions European conference on Computer Vision - Volume Part V, 2012, pp.
on Pattern Analysis & Machine Intelligence, vol. 35, no. 6, pp. 1397– 746–760.
1409, 2013. [26] Y. Li, J.-B. Huang, N. Ahuja, and M.-H. Yang, “Deep joint image
[2] J. Kopf, M. F. Cohen, D. Lischinski, and M. Uyttendaele, “Joint bilateral filtering,” in Proceedings of the European Conference on Computer
upsampling,” Acm Transactions on Graphics, vol. 26, no. 3, pp. 96.1– Vision, 2016, pp. 154–169.
96.4, 2007. [27] J. Diebel and S. Thrun, “An application of markov random fields to
[3] X. Shen, C. Zhou, L. Xu, and J. Jia, “Mutual-structure for joint filtering,” range sensing.” Advances in Neural Information Processing Systems,
in Proceedings of the IEEE International Conference on Computer pp. 291–298, 2005.
Vision, 2015, pp. 3406–3414. [28] D. Ferstl, C. Reinbacher, R. Ranftl, M. Ruether, and H. Bischof, “Image
[4] X. Deng and P. L. Dragotti, “Deep convolutional neural network guided depth upsampling using anisotropic total generalized variation,”
for multi-modal image restoration and fusion,” IEEE Transactions on in Proceedings of the 2013 IEEE International Conference on Computer
Pattern Analysis and Machine Intelligence, pp. 1–1, 2020. Vision (ICCV), 2013, pp. 993–1000.
[5] Y. Li, J. B. Huang, N. Ahuja, and M. H. Yang, “Joint image filtering with [29] H. Wu, S. Zheng, J. Zhang, and K. Huang, “Fast end-to-end trainable
deep convolutional networks,” IEEE Transactions on Pattern Analysis guided filter,” in Proceedings of the IEEE Conference on Computer
and Machine Intelligence, vol. 41, no. 8, pp. 1909–1923, 2019. Vision and Pattern Recognition, 2018, pp. 1838–1847.
[6] H. Su, V. Jampani, D. Sun, O. Gallo, E. Learned-Miller, and J. Kautz, [30] T.-W. Hui, C. C. Loy, and X. Tang, “Depth map super-resolution by
“Pixel-adaptive convolutional neural networks,” in Proceedings of the deep multi-scale guidance,” in Proceedings of the European Conference
IEEE Conference on Computer Vision and Pattern Recognition, 2019, on Computer Vision, 2016, pp. 353–369.
pp. 11 166–11 175. [31] C. Guo, C. Li, J. Guo, R. Cong, H. Fu, and P. Han, “Hierarchical
[7] B. Kim, J. Ponce, and B. Ham, “Deformable kernel networks for joint features driven residual learning for depth map super-resolution,” IEEE
image filtering,” International Journal of Computer Vision, pp. 1–22, Transactions on Image Processing, vol. 28, no. 5, pp. 2545–2557, 2019.
2020. [32] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black, “A naturalistic
[8] B. Ham, M. Cho, and J. Ponce, “Robust guided image filtering using open source movie for optical flow evaluation,” in Proceedings of the
nonconvex potentials,” IEEE Transactions on Pattern Analysis and European Conference on Computer Vision, 2012, pp. 611–625.
Machine Intelligence, vol. 40, no. 1, pp. 192–207, 2018. [33] S. Lu, X. Ren, and F. Liu, “Depth enhancement via low-rank matrix
[9] Z. Qi, X. Shen, X. Li, and J. Jia, “Rolling guidance filter,” in Proceedings completion,” in Proceedings of the IEEE Conference on Computer Vision
of the European conference on computer vision, 2014, pp. 815–830. and Pattern Recognition, 2014, pp. 3390–3397.
[34] D. Scharstein and C. Pal, “Learning conditional random fields for
[10] B. Stimpel, C. Syben, F. Schirrmacher, P. Hoelter, A. Dörfler, and
stereo,” in Proceedings of the IEEE Conference on Computer Vision
A. Maier, “Multi-modal super-resolution with deep guided filtering,” in
and Pattern Recognition, 2007, pp. 1–8.
Bildverarbeitung für die Medizin 2019, Wiesbaden, 2019, pp. 110–115.
[35] C. Yang, L. Zhang, H. Lu, X. Ruan, and M. Yang, “Saliency detec-
[11] L. Xu, Q. Yan, Y. Xia, and J. Jia, “Structure extraction from texture via
tion via graph-based manifold ranking,” in 2013 IEEE Conference on
relative total variation,” ACM transactions on graphics, vol. 31, no. 6,
Computer Vision and Pattern Recognition, 2013, pp. 3166–3173.
pp. 1–10, 2012.
[36] H. Hirschmüller and D. Scharstein, “Evaluation of cost functions for
[12] L. Karacan, E. Erdem, and A. Erdem, “Structure-preserving image stereo matching,” in Proceedings of the IEEE Conference on Computer
smoothing via region covariances,” ACM Transactions on Graphics, Vision and Pattern Recognition, 2007.
vol. 32, no. 6, pp. 1–11, 2013. [37] S. Gu, W. Zuo, S. Guo, Y. Chen, C. Chen, and L. Zhang, “Learning
[13] L. Xu, Q. Yan, Y. Xia, and J. Jia, “Structure extraction from texture via dynamic guidance for depth image enhancement,” in Proceedings of the
relative total variation,” ACM Trans. Graph., vol. 31, no. 6, 2012. IEEE Conference on Computer Vision and Pattern Recognition, 2017,
[14] J. T. Barron and B. Poole, “The fast bilateral solver,” in Proceedings of pp. 3769–3778.
the European Conference on Computer Vision, 2016, pp. 617–632. [38] J. Yang, X. Ye, K. Li, C. Hou, and Y. Wang, “Color-guided depth
[15] C. Tomasi and R. Manduchi, “Bilateral filtering for gray and color im- recovery from rgb-d data using an adaptive autoregressive model,” IEEE
ages,” in Proceedings of the Sixth International Conference on Computer Transactions on Image Processing, vol. 23, no. 8, pp. 3443–3458, 2015.
Vision, 1998, pp. 839–846. [39] D. Ferstl, C. Reinbacher, R. Ranftl, M. Ruether, and H. Bischof, “Image
[16] X. Jia, B. De Brabandere, T. Tuytelaars, and L. V. Gool, “Dynamic guided depth upsampling using anisotropic total generalized variation,”
filter networks,” in Advances in Neural Information Processing Systems, in Proceedings of the 2013 IEEE International Conference on Computer
vol. 29, 2016, pp. 667–675. Vision, 2013.
[17] R. D. Lutio, S. D’aronco, J. D. Wegner, and K. Schindler, “Guided [40] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
super-resolution as pixel-to-pixel transformation,” in Proceedings of “Deeplab: Semantic image segmentation with deep convolutional nets,
IEEE/CVF International Conference on Computer Vision, 2019, pp. atrous convolution, and fully connected crfs,” IEEE Transactions on
8828–8836. Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834–848,
[18] J. Kwak and D. Son, “Fractal residual network and solutions for real 2018.
super-resolution,” in Proceedings of IEEE/CVF Conference on Computer [41] P. Krhenbühl and V. Koltun, “Efficient inference in fully connected crfs
Vision and Pattern Recognition Workshops, 2019, pp. 2114–2121. with gaussian edge potentials,” Curran Associates Inc., 2012.
[19] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks [42] M. Everingham, S. Eslami, L. V. Gool, C. Williams, J. Winn, and A. Zis-
for biomedical image segmentation,” in International Conference on serman, “The pascal visual object classes challenge: A retrospective,”
Medical image computing and computer-assisted intervention, 2015, pp. International Journal of Computer Vision, vol. 111, no. 1, pp. 98–136,
234–241. 2015.
[20] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image [43] B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik, “Semantic
recognition,” in Proceedings of the IEEE Conference on Computer Vision contours from inverse detectors,” in Proceedings of the International
and Pattern Recognition, 2016, pp. 770–778. Conference on Computer Vision, 2011, pp. 991–998.
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 18

R EFERENCES [25] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation


and support inference from rgbd images,” in Proceedings of the 12th
[1] K. He, J. Sun, and X. Tang, “Guided image filtering,” IEEE Transactions European conference on Computer Vision - Volume Part V, 2012, pp.
on Pattern Analysis & Machine Intelligence, vol. 35, no. 6, pp. 1397– 746–760.
1409, 2013. [26] Y. Li, J.-B. Huang, N. Ahuja, and M.-H. Yang, “Deep joint image
[2] J. Kopf, M. F. Cohen, D. Lischinski, and M. Uyttendaele, “Joint bilateral filtering,” in Proceedings of the European Conference on Computer
upsampling,” Acm Transactions on Graphics, vol. 26, no. 3, pp. 96.1– Vision, 2016, pp. 154–169.
96.4, 2007. [27] J. Diebel and S. Thrun, “An application of markov random fields to
[3] X. Shen, C. Zhou, L. Xu, and J. Jia, “Mutual-structure for joint filtering,” range sensing.” Advances in Neural Information Processing Systems,
in Proceedings of the IEEE International Conference on Computer pp. 291–298, 2005.
Vision, 2015, pp. 3406–3414. [28] D. Ferstl, C. Reinbacher, R. Ranftl, M. Ruether, and H. Bischof, “Image
[4] X. Deng and P. L. Dragotti, “Deep convolutional neural network guided depth upsampling using anisotropic total generalized variation,”
for multi-modal image restoration and fusion,” IEEE Transactions on in Proceedings of the 2013 IEEE International Conference on Computer
Pattern Analysis and Machine Intelligence, pp. 1–1, 2020. Vision (ICCV), 2013, pp. 993–1000.
[5] Y. Li, J. B. Huang, N. Ahuja, and M. H. Yang, “Joint image filtering with [29] H. Wu, S. Zheng, J. Zhang, and K. Huang, “Fast end-to-end trainable
deep convolutional networks,” IEEE Transactions on Pattern Analysis guided filter,” in Proceedings of the IEEE Conference on Computer
and Machine Intelligence, vol. 41, no. 8, pp. 1909–1923, 2019. Vision and Pattern Recognition, 2018, pp. 1838–1847.
[6] H. Su, V. Jampani, D. Sun, O. Gallo, E. Learned-Miller, and J. Kautz, [30] T.-W. Hui, C. C. Loy, and X. Tang, “Depth map super-resolution by
“Pixel-adaptive convolutional neural networks,” in Proceedings of the deep multi-scale guidance,” in Proceedings of the European Conference
IEEE Conference on Computer Vision and Pattern Recognition, 2019, on Computer Vision, 2016, pp. 353–369.
pp. 11 166–11 175. [31] C. Guo, C. Li, J. Guo, R. Cong, H. Fu, and P. Han, “Hierarchical
[7] B. Kim, J. Ponce, and B. Ham, “Deformable kernel networks for joint features driven residual learning for depth map super-resolution,” IEEE
image filtering,” International Journal of Computer Vision, pp. 1–22, Transactions on Image Processing, vol. 28, no. 5, pp. 2545–2557, 2019.
2020. [32] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black, “A naturalistic
[8] B. Ham, M. Cho, and J. Ponce, “Robust guided image filtering using open source movie for optical flow evaluation,” in Proceedings of the
nonconvex potentials,” IEEE Transactions on Pattern Analysis and European Conference on Computer Vision, 2012, pp. 611–625.
Machine Intelligence, vol. 40, no. 1, pp. 192–207, 2018. [33] S. Lu, X. Ren, and F. Liu, “Depth enhancement via low-rank matrix
[9] Z. Qi, X. Shen, X. Li, and J. Jia, “Rolling guidance filter,” in Proceedings completion,” in Proceedings of the IEEE Conference on Computer Vision
of the European conference on computer vision, 2014, pp. 815–830. and Pattern Recognition, 2014, pp. 3390–3397.
[10] B. Stimpel, C. Syben, F. Schirrmacher, P. Hoelter, A. Dörfler, and [34] D. Scharstein and C. Pal, “Learning conditional random fields for
A. Maier, “Multi-modal super-resolution with deep guided filtering,” in stereo,” in Proceedings of the IEEE Conference on Computer Vision
Bildverarbeitung für die Medizin 2019, Wiesbaden, 2019, pp. 110–115. and Pattern Recognition, 2007, pp. 1–8.
[35] C. Yang, L. Zhang, H. Lu, X. Ruan, and M. Yang, “Saliency detec-
[11] L. Xu, Q. Yan, Y. Xia, and J. Jia, “Structure extraction from texture via
tion via graph-based manifold ranking,” in 2013 IEEE Conference on
relative total variation,” ACM transactions on graphics, vol. 31, no. 6,
Computer Vision and Pattern Recognition, 2013, pp. 3166–3173.
pp. 1–10, 2012.
[36] H. Hirschmüller and D. Scharstein, “Evaluation of cost functions for
[12] L. Karacan, E. Erdem, and A. Erdem, “Structure-preserving image
stereo matching,” in Proceedings of the IEEE Conference on Computer
smoothing via region covariances,” ACM Transactions on Graphics,
Vision and Pattern Recognition, 2007.
vol. 32, no. 6, pp. 1–11, 2013.
[37] S. Gu, W. Zuo, S. Guo, Y. Chen, C. Chen, and L. Zhang, “Learning
[13] L. Xu, Q. Yan, Y. Xia, and J. Jia, “Structure extraction from texture via dynamic guidance for depth image enhancement,” in Proceedings of the
relative total variation,” ACM Trans. Graph., vol. 31, no. 6, 2012. IEEE Conference on Computer Vision and Pattern Recognition, 2017,
[14] J. T. Barron and B. Poole, “The fast bilateral solver,” in Proceedings of pp. 3769–3778.
the European Conference on Computer Vision, 2016, pp. 617–632. [38] J. Yang, X. Ye, K. Li, C. Hou, and Y. Wang, “Color-guided depth
[15] C. Tomasi and R. Manduchi, “Bilateral filtering for gray and color im- recovery from rgb-d data using an adaptive autoregressive model,” IEEE
ages,” in Proceedings of the Sixth International Conference on Computer Transactions on Image Processing, vol. 23, no. 8, pp. 3443–3458, 2015.
Vision, 1998, pp. 839–846. [39] D. Ferstl, C. Reinbacher, R. Ranftl, M. Ruether, and H. Bischof, “Image
[16] X. Jia, B. De Brabandere, T. Tuytelaars, and L. V. Gool, “Dynamic guided depth upsampling using anisotropic total generalized variation,”
filter networks,” in Advances in Neural Information Processing Systems, in Proceedings of the 2013 IEEE International Conference on Computer
vol. 29, 2016, pp. 667–675. Vision, 2013.
[17] R. D. Lutio, S. D’aronco, J. D. Wegner, and K. Schindler, “Guided [40] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
super-resolution as pixel-to-pixel transformation,” in Proceedings of “Deeplab: Semantic image segmentation with deep convolutional nets,
IEEE/CVF International Conference on Computer Vision, 2019, pp. atrous convolution, and fully connected crfs,” IEEE Transactions on
8828–8836. Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834–848,
[18] J. Kwak and D. Son, “Fractal residual network and solutions for real 2018.
super-resolution,” in Proceedings of IEEE/CVF Conference on Computer [41] P. Krhenbühl and V. Koltun, “Efficient inference in fully connected crfs
Vision and Pattern Recognition Workshops, 2019, pp. 2114–2121. with gaussian edge potentials,” Curran Associates Inc., 2012.
[19] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks [42] M. Everingham, S. Eslami, L. V. Gool, C. Williams, J. Winn, and A. Zis-
for biomedical image segmentation,” in International Conference on serman, “The pascal visual object classes challenge: A retrospective,”
Medical image computing and computer-assisted intervention, 2015, pp. International Journal of Computer Vision, vol. 111, no. 1, pp. 98–136,
234–241. 2015.
[20] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image [43] B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik, “Semantic
recognition,” in Proceedings of the IEEE Conference on Computer Vision contours from inverse detectors,” in Proceedings of the International
and Pattern Recognition, 2016, pp. 770–778. Conference on Computer Vision, 2011, pp. 991–998.
[21] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:
Surpassing human-level performance on imagenet classification,” in
Proceedings of the IEEE International Conference on Computer Vision,
2015, pp. 1026–1034.
[22] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop,
D. Rueckert, and Z. Wang, “Real-time single image and video super-
resolution using an efficient sub-pixel convolutional neural network,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2016, pp. 1874–1883.
[23] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
arXiv: Learning, 2014.
[24] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin,
A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in
pytorch,” in NIPS-W, 2017.
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 19

Zhiwei Zhong received the B.S. degree in computer Xiangyang Ji received the B.S. degree in materials
science from the Heilongjiang University, Harbin, science and the M.S. degree in computer science
China, in 2017. He is currently pursing the Ph.D. from the Harbin Institute of Technology, Harbin,
degree in computer science from the Harbin Institute China, in 1999 and 2001, respectively, and the Ph.D.
of Technology (HIT), Harbin, China. His research degree in computer science from the Institute of
interests include image processing, computer vision Computing Technology, Chinese Academy of Sci-
and deep learning. ences, Beijing, China. He joined Tsinghua Univer-
sity, Beijing, in 2008, where he is currently a Pro-
fessor with the Department of Automation, School
of Information Science and Technology. He has
authored over 100 referred conference and journal
papers. His current research interests include signal processing, image/video
compressing, and intelligent imaging.

Xianming Liu received the B.S., M.S., and Ph.D.


degrees in computer science from the Harbin Insti-
tute of Technology (HIT), Harbin, China, in 2006,
2008, and 2012, respectively. In 2011, he spent half
a year at the Department of Electrical and Computer
Engineering, McMaster University, Canada, as a Vis-
iting Student, where he was a Post-Doctoral Fellow
from 2012 to 2013. He was a Project Researcher
with the National Institute of Informatics (NII),
Tokyo, Japan, from 2014 to 2017. He is currently
a Professor with the School of Computer Science
and Technology, HIT. He has published over 50 international conference and
journal publications, including top IEEE journals, such as T-IP, T-CSVT, T-
IFS, and T-MM, and top conferences, such as ICML, CVPR, IJCAI. He was
a receipt of the IEEE ICME 2016 Best Student Paper Award.

Junjun Jiang received the B.S. degree from the


Department of Mathematics, Huaqiao University,
Quanzhou, China, in 2009, and the Ph.D. degree
from the School of Computer, Wuhan University,
Wuhan, China, in 2014. From 2015 to 2018, he
was an Associate Professor at China University of
Geosciences, Wuhan. Since 2016, he has been a
Project Researcher with the National Institute of
Informatics, Tokyo, Japan. He is currently a Pro-
fessor with the School of Computer Science and
Technology, Harbin Institute of Technology, Harbin,
China. He won the Finalist of the World’s FIRST 10K Best Paper Award at
ICME 2017, and the Best Student Paper Runner-up Award at MMM 2015. He
received the 2016 China Computer Federation (CCF) Outstanding Doctoral
Dissertation Award and 2015 ACM Wuhan Doctoral Dissertation Award. His
research interests include image processing and computer vision.

Debin Zhao received the B.S., M.S., and Ph.D.


degrees in computer science from Harbin Institute of
Technology, China in 1985, 1988, and 1998, respec-
tively. He is now a professor in the Department of
Computer Science, Harbin Institute of Technology.
He has published over 200 technical articles in refer-
eed journals and conference proceedings in the areas
of image and video coding, video processing, video
streaming and transmission, and pattern recognition.

You might also like