0% found this document useful (0 votes)
19 views

Haar Wavelet Downsampling

This document introduces the Haar Wavelet-based Downsampling (HWD) module, a new downsampling method designed to enhance semantic segmentation in Convolutional Neural Networks (CNNs) by preserving important spatial information. The HWD module utilizes Haar wavelet transform to reduce spatial resolution while minimizing information loss, and it is integrated into CNN architectures without significantly increasing computational overhead. Additionally, a novel metric called Feature Entropy Index (FEI) is proposed to measure the uncertainty of information after downsampling, demonstrating improved segmentation performance across various datasets.

Uploaded by

2794411427
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Haar Wavelet Downsampling

This document introduces the Haar Wavelet-based Downsampling (HWD) module, a new downsampling method designed to enhance semantic segmentation in Convolutional Neural Networks (CNNs) by preserving important spatial information. The HWD module utilizes Haar wavelet transform to reduce spatial resolution while minimizing information loss, and it is integrated into CNN architectures without significantly increasing computational overhead. Additionally, a novel metric called Feature Entropy Index (FEI) is proposed to measure the uncertainty of information after downsampling, demonstrating improved segmentation performance across various datasets.

Uploaded by

2794411427
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Pattern Recognition 143 (2023) 109819

Contents lists available at ScienceDirect

Pattern Recognition
journal homepage: www.elsevier.com/locate/pr

Haar wavelet downsampling: A simple but effective downsampling module


for semantic segmentation
Guoping Xu a, Wentao Liao a, Xuan Zhang a, Chang Li b, Xinwei He c, Xinglong Wu *, a
a
School of Computer Science and Engineering, Hubei Key Laboratory of Intelligent Robot, Wuhan Institute of Technology, Wuhan, 430205, Hubei, PR China
b
Department of Biomedical Engineering, Hefei University of Technology, Hefei, 230009, Anhui, PR China
c
School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan, 430074, Hubei, PR China

A R T I C L E I N F O A B S T R A C T

Keywords: Downsampling operations such as max pooling or strided convolution are ubiquitously utilized in Convolutional
Semantic segmentation Neural Networks (CNNs) to aggregate local features, enlarge receptive field, and minimize computational
Downsampling overhead. However, for a semantic segmentation task, pooling features over the local neighbourhood may result
Haar wavelet
in the loss of important spatial information, which is conducive for pixel-wise predictions. To address this issue,
Information entropy
we introduce a simple yet effective pooling operation called the Haar Wavelet-based Downsampling (HWD)
module. This module can be easily integrated into CNNs to enhance the performance of semantic segmentation
models. The core idea of HWD is to apply Haar wavelet transform for reducing the spatial resolution of feature
maps while preserving as much information as possible. Furthermore, to investigate the benefits of HWD, we
propose a novel metric, named as feature entropy index (FEI), which measures the degree of information un­
certainty after downsampling in CNNs. Specifically, the FEI can be used to indicate the ability of downsampling
methods to preserve essential information in semantic segmentation. Our comprehensive experiments demon­
strate that the proposed HWD module could (1) effectively improve the segmentation performance across
different modality image datasets with various CNN architectures, and (2) efficiently reduce information un­
certainty compared to the conventional downsampling methods. Our implementation are available at https://
github.com/apple1986/HWD.

1. Introduction semantic segmentation [6,7]. To alleviate this problem, several ap­


proaches have been proposed, which can generally be summarized as
Semantic segmentation is a fundamental task in computer vision that follows: (1) feeding more information by skip connection to a decoder
involves assigning a label to each pixel of an input image [1]. It has sub-network like U-Net [8], LCU-Net [9], CENet [10], LinkNet [11], and
many real-world applications, including vehicle navigation [2] and RefineNet [12]. As with U-Net, the downsampled feature maps from the
medical image analysis [3]. The development of deep convolutional encoder are passed to the corresponding stage of the decoder via skip
neural networks (DCNNs), such as AlexNet [4] and ResNet [5], has led to connections. In RefineNet, the information at different stages of the
significant progress in semantic segmentation. These networks consist of decoder is fully exploited and fused with skip connections to obtain a
multiple layers of convolutions, normalizations, and activation func­ high-resolution prediction; (2) extracting multi-scale feature maps with
tions. As a basic component of DCNNs, the downsampling operation is spatial pyramid pooling or dilated convolution to the fusion module,
frequently applied to alter the resolution of inputs, helping reduce such as DeepLab [13], PSPNet [14], PCPLP-Net [15], BiSenet [16], and
computation overhead and enlarge the receptive field for neural ICNet [17]. Specifically, in DeepLab, the encoder module converts
networks. multi-scale information by the proposed atrous spatial pyramid pooing
Deep convolutional neural networks (DCNNs) commonly employ (ASPP) block to refine segmentation results by gradually recovering the
standard downsampling operations, such as max pooling, average spatial information. And for the PSPNet, a pyramid pooling module was
pooling, and strided convolution, which may lead to information loss. proposed to harvest different scale representations for compensating
The lost information, like boundary and texture, is likely essential for losing information; (3) providing multi-modality images to the encoder,

* Corresponding author.
E-mail address: [email protected] (X. Wu).

https://doi.org/10.1016/j.patcog.2023.109819
Received 4 August 2022; Received in revised form 15 June 2023; Accepted 12 July 2023
Available online 13 July 2023
0031-3203/© 2023 Elsevier Ltd. All rights reserved.
G. Xu et al. Pattern Recognition 143 (2023) 109819

Fig. 1. Downsampling examples of average pooling, max pooling, strided convolution and the HWD in DeepLabv3+ [13]. Compared with conventional down­
sampling methods, the feature after HWD preserves more boundary, texture and detail information, as marked by the four red squares where tree branches are better
preserved in (d). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

like DiSegNet [18], MMADT [19], CANet [20] and CCFFNet [21]. Due to DCNNs for semantic segmentation?
the complementary properties of images from different modalites, it Inspired by the lossless information transformation methods [30],
could also mitigate the effect of information loss after downsampling; we introduce the Haar wavelet transform into our downsampling
(4) increasing prior information. A contour-enhanced attention module module. Theories and practice of Haar wavelet transform have been
was proposed in [22], aiming to extract boundary and shape cues from extensive studied [31]. Early works of Haar wavelet transform in image
CT images for refining the segmented areas. In [23], superpixels are used processing have focused on image decomposition [32], compression
to help guide local and global consistency by considering rule-based [33], denoising [34], reconstruction [35] etc. The main advantage of the
appearance information. Besides, ensemble adversarial attacks are Haar wavelet is its multi-scale lossless fast signal decomposition. In this
introduced in [24] for generating highly diverse adversarial example, paper, we make use of this advantage from Haar wavelet transform, and
aiming to enhance the diversity of perturbation over class distribution. propose Haar wavelet downsampling (HWD) module, a new down­
In [25] and [26], category prior information is brought into CNNs to sampling method for segmentation task. Our main motivation is if the
increase robustness of DCNNs. information of an image could be preserved as much as possible at the
In a nutshell, the main purpose of these methods is to help establish a encoder part of deep neural network, it will be easier to extract more
good relationship between downsampled features and segmentation discriminative features which may help improve the segmentation per­
labels by providing more learned information or features based on formance. Hence, in contrast to conventional downsampling methods
various strategies, such as multi-scale, prior guidance, and multi- that directly decrease the spatial resolution of feature maps with losing
modality. information, we propose an alternative approach. Firstly, it explicitly
Although the previous works could partly alleviate the problem of increase the channel number of feature maps and reduce its resolution
information loss after downsampling and help to extract representative using Haar wavelet transform without loss of information. Then, we
feature maps by aggregating more information. Yet, the main problem is employ convolution operations for representative feature leaning to
that simple aggregation of features (e.g., summation or concatenation) filter redundant information. Figure 1 shows several downsampling
may introduce unrelated information into the network, which hinders examples using average pooling, max pooling, strided convolution and
for discriminative feature learning [27,28]. In other words, conven­ the proposed HWD in DeepLabv3+ [13]. We can see the HWD could
tional downsampling methods in DCNNs lead to information loss. The preserve more detail information comparing to other three down­
lost information, such as boundary, scale, and texture, is important for sampling methods.
object segmentation, especially for small-scale objects or the objects With the integration of HWD in DCNNs, the second question arises:
obscured between the background and foreground [7,29]. Despite the how to confirm or measure the amount of essential features which have
incorporation of various techniques such as skip connections, been learned by its subsequent convolution layers? Here, the essential
multi-scale features, and prior knowledge, recovering the essential lost features mean that they could help infer the DCNNs to generate better
information through subsequent layers remains a significant challenge. predictions comparing with the corresponding ground truth. In other
Hence, our first question is that if regular downsampling methods words, if extracted features are representative, the neural network will
would cause information loss, can we design an information-preserving have more confidence (certainty) to predict the desired results, sug­
downsampling module to keep information as much as possible in gesting that the uncertainty of information is relatively low. Previous

2
G. Xu et al. Pattern Recognition 143 (2023) 109819

Fig. 2. Illustration of four pooling methods.

studies have widely utilized the concept of information entropy [36] to that integrates multi-scale resolution analysis into CNNs was presented
assess the degree of uncertainty or randomness present in signals or and applied to the task of texture classification and image annotation,
images in a communication system [37]. Considering a neural network yielding superior accuracy compared to prior approaches. In [41], a
as a communication system, the primary purpose of this system in seg­ multi-level wavelet CNN (MWCNN) model was proposed for image
mentation tasks is to reduce uncertainty between input images and their restoration, aiming to strike a better balance between receptive field size
corresponding semantic labels [38,39]. Inspired by the concept of in­ and computational efficiency. In [42], the MWCNN was utilized as a
formation entropy to evaluate uncertainty of information, we proposed denoiser prior for restoring blurred images corrupted by Cauchy noise.
a novel metric called Feature Entropy Index (FEI) to quantify the un­ Similarly, a wavelet-based CNN (Wavelet-SRNet) was introduced for
certainty between the feature and prediction output in DCNNs. In multi-scale face super-resolution, where wavelet coefficients of
particular, we could employ the FEI to estimate the degree of uncer­ high-resolution feature maps were learned before reconstructing
tainty of the downsampled features, which could reflect the importance high-resolution images [43]. In [44], a wavelet-like transform was in­
of the features relative to the ground truth. tegrated into CNN for image compression, using an update-first lifting
In summary, our main contributions are as follows: scheme to support multi-resolution analysis. These studies leverage the
benefits of feature learning from CNNs and multi-scale resolution
(1) We propose a novel wavelet-based downsampling module (HWD) analysis offered by wavelet techniques. In contrast to these studies, we
for CNNs. To the best of our knowledge, our method is the first investigate the Haar wavelet transform as an alternative of the down­
attempt to explore the feasibility by prohibiting (impeding) in­ sampling method in CNNs, and focus on the image segmentation task.
formation loss in the downsampling stage of DCNNs for the se­
mantic segmentation task.
(2) We explore the measurement of information uncertainty across 2.2. Downsampling methods for feature maps
layers in CNNs, and propose a novel index (FEI) to evaluate the
information uncertainty or feature importance between the Downsampling operations offer inherent advantages, leading to their
downsampled feature maps and the prediction results. widespread utilization in diverse tasks, including ordinal classification
(3) The proposed HWD can directly replace the strided convolution [26], deep face recognition [45], semantic re-ranking systems [25], and
or pooling layer without significantly increasing the computa­ autonomous driving [24]. Numerous downsampling methods have been
tional overhead, and can be easily integrated into current seg­ proposed to reduce the spatial resolution of feature maps, which could
mentation architectures. Extensive experiments demonstrate the decrease the computational requirements of CNNs while increasing the
effectiveness of the HWD module in comparison with seven state- receptive field of subsequent convolutions such as ResNet [5] and U-Net
of-the-art (SOTA) segmentation methods. [8]. Currently, there are two types of downsampling methods that are
frequently used in CNNs, namely the pooling operation and the strided
The rest of this paper is organized as follows: Section 2 reviews the convolution operation. Figure 2 illustrates four types of pooling
related work of downsampling in DCNNs. Section 3 describes the pro­ methods. The pooling operations are based on a neighborhood approach
posed HWD module and the definition of FEI. Experiments and results and have no additional learning parameters. The max-pooing and the
are presented in Section 4. Section 5 gives the discussion and conclusion average pooling are the two main downsampling methods adopted in
of this study and points out the limitations and future works. many segmentation architectures, like FCN, U-Net and PSPNet. There
are also some works on adaptive pooling, like stochastic pooling [46]
2. Related work and softpool [47]. In [48], a wavelet pooling for CNNs has been intro­
duced. This method decomposes features into two-level decomposition,
2.1. Wavelet transform in CNNs and discards the first-level subbands to reduce feature dimensions,
which shows superiority in addressing the overfitting problem in image
Several studies have explored the utilization of wavelet transform in classification task. The second category of downsampling uses strided
CNNs to improve feature representation in various tasks such as classi­ convolution operation to reduce the size of feature maps. It aims to
fication, super-resolution, denoising, and more. In [40], a wavelet CNN preserve perceptually important details and align local image features in
a learning manner [49,50]. For example, Chen et al. [51] proposed a

3
G. Xu et al. Pattern Recognition 143 (2023) 109819

Fig. 3. The architecture of the proposed HWD module which consists of two main blocks: lossless feature encoding block and feature representation learning block.
Note that the channel number of feature maps can be adjusted by the representaion learning block.

convolutional block for learning fractional downsampling. If we deep neural networks have been emerged for IQA. For instance, Incep­
consider the downsampling operation as information encoding pro­ tion score (IS) [57] was found to evaluate the image quality in genera­
cessing, the above downsampling (pooling or strided-convolution) tive adversarial networks. It directly applied a pretrained Inception
methods will result in information loss. The lost information, such as model for generated image to get the conditional label distribution
boundary, scale, and texture, plays an essential role in semantic seg­ based on the assumption that high quality image should contain
mentation. In other words, subsequent layers will have a better ability to meaningful objects. The Fréchet Inception Distance (FID) [58] is a
learn representative features if more information is preserved after metric that calculates the distance between feature vectors calculated
downsampling. Therefore, unlike previous studies, we proposed a sim­ for real and generated images.
ple yet effective downsampling module HWD in this paper, aiming to In summary, the majority of IQA metrics are commonly employed to
information preserving while reducing spatial resolution of feature evaluate the quality of the original images. This study focuses on the
maps. Any kind of downsampling operation, like strided convolution or development of a metric for assessing the quality of downsampled
max pooling, can be directly replaced with HWD at a minimal cost. feature maps. The metric aims to quantify the degree of information
uncertainty or feature importance by comparing the downsampled
feature maps with the prediction results.
2.3. Image quality assessment

Extensive research has been conducted in the field of Image Quality 3. Methods
Assessment (IQA) to evaluate image distortion and human observers’
perception. Based on the availability of an original reference image, IQA As illustrated in Fig. 3, the proposed HWD module comprises two
methods can be broadly categorized into two approaches: full-reference block: (1) the lossless feature encoding block and (2) the feature rep­
(FR) and no-reference (NR). As FR IQA, peak signal-to-noise ratio, mean resentation learning block. The lossless feature encoding block is
square error and structural similarity index measure (SSIM) [52] have responsible for transforming features and reducing spatial resolution. To
become commonly used metrics. In particular, the SSIM, which is accomplish this, we utilize the Haar wavelet transform, a method that
designed by modeling a combination of correction loss, luminance efficiently decreases the resolution of feature maps while retaining all
distortion and contrast distortion, is widely accepted because of its information. The representation learning block consists of a standard
reasonable correlation with the perception of the human visual system convolution layer, batch normalization, and a ReLU activation layer. It is
(HVS). Several variants based on SSIM have been proposed, such as the employed to extract discriminative features. Each block will be thor­
Multi-Scale Structural Similarity Index (MSSIM) [53], which in­ oughly explained in the following subsection.
corporates multi-scale spatial information, and the
Information-Weighted Structural Similarity Index (IWSSIM) [54], which 3.1. Lossless feature encoding block
integrates information-weighted entropy. Furthermore, the researchers
found that the HVS is more sensitive to visual information [55,56]. The The lossless feature encoding block utilizes a Haar wavelet transform
Gradient and Structure Similarity Model (GSM) [55] integrates gradient layer to effectively reduce the spatial resolution of feature maps while
and luminance information to provide a comprehensive assessment of preserving all information. The Haar wavelet transform is a widely
image quality. Similarly, the Feature Similarity Index Measure (FSIM) recognized, compact, dyadic, and orthonormal transform that finds
[56] employs phase congruency and gradient magnitude to assess image extensive application in image coding, edge extraction, and binary logic
quality. design [31]. The wavelet basis function and scale function for the
With the development of deep learning, some NR metrics based on 1-stage, one-dimensional Haar transform can be defined as follows:

4
G. Xu et al. Pattern Recognition 143 (2023) 109819

Fig. 4. The illustration of wavelet transform method, which encodes the input image into four components with reduced spatial resolution comparing to the input.

Fig. 5. Illustration of HWD and max pooling downsampling methods for a RGB image. Top row: downsampled feature with max pooling operation. Bottom row:
feature extraction with HWD. (Note that the displayed feature is obtained by maximizing all its downsampled feature maps along channel direction.).


⎪ 1 1 to a two-dimensional signal, such as a grayscale image, yields four

⎨ ϕ1 (x) = √̅̅2̅ϕ1,0 (x) + √̅̅2̅ϕ1,1 (x)

components, each with half the spatial resolution of the original signal.
(1) Figure 4 depicts the decomposition process of an image with the

⎪ 1 1

⎩ ψ 1 (x) = √̅̅̅ϕ1,0 (x) − √̅̅̅ϕ1,1 (x). resolution of H × W using the Haar Wavelet transform. Here, H0 and H1
2 2
represent the low-pass and high-pass decomposition filters, respectively.
Here, ϕj,k (x) is defined as: These filters are employed to extract the approximate and high-
√̅̅̅̅ ( frequency information from an image, respectively. The symbol ↓2 de­
)
ϕj,k (x) = 2j ϕ 2j x − k , k = 0, 1, ⋯, 2j − 1. (2) notes downsampling applied to the approximate and detail components.
The Haar wavelet transform generates four components: the approxi­
In this context, the parameters j and k are denoted as the stage (or scale mate (low-frequency) component (A), as well as the detail components
in the image processing domain) and the order (or direction for 2D (high frequency) in the horizontal (H), vertical (V), and diagonal (D)
image) of the Haar basis function, respectively. Furthermore, ϕ0,0 (x) is directions. Each component has a size of H2 × W
2 . It should be noted that
defined as: the resolution of each component is reduced to H2 × W 2 , whereas the
⎧ channel number of feature maps has quadrupled. In other words, the
⎨ 0, x<0
Haar Wavelet transform can encode partial information from the spatial
ϕ0,0 (x) = ϕ0 (x) = 1, 0⩽x < 1, (3)

0, x⩾1 dimension into the channel dimension without any loss of information.
Consequently, the subsequent layers of DCNNs can extract representa­
Therefore, the 1-stage Haar transform can be expressed using the 0- tive features from the transformed components, which have reduced
stage Haar basis functions: spatial resolution.
{
ϕ1 (x) = ϕ0 (2x) + ϕ0 (2x − 1)
(4)
ψ 1 (x) = ϕ0 (2x) − ϕ0 (2x − 1). 3.2. The representation feature learning block

This implies that a signal of length L can be divided into two parts of
The feature representation learning block consists of a standard 1 ×
length L2, which can be interpreted as low-pass and high-pass decom­
1 convolution layer, a batch normalization layer, and a ReLU activation
position filters, respectively. When applying the Haar wavelet transform function. In this block, standard convolution is employed to adjust the

5
G. Xu et al. Pattern Recognition 143 (2023) 109819

Table 1 4.1. Datasets


Comparison of parameters and FLOPs for three downsampling methods.
Module Parameters FLOPs Camvid dataset: This dataset, originally designed for street scene
understanding, consists of 701 images. The dataset is divided into
HWD 4C2 2
2HWC + 3.75HWC
Average Pooling 0 HWC
training (367 images), validation (101 images), and testing (233 images)
Strided Convolution 9C2 4.5HWC2 -0.25HWC sets. It includes a total of 32 semantic classes. Following the previous
work [59], we focus on 11 classes: sky, building (BLD), pole (PoL), road
(Roa), pavement (Pav), tree (Tre), traffic sign (Sig), fence (Fen), car,
channel number of feature maps. This block serves two main purposes: pedestrian (Ped), and bike (Bik). The original resolution of images is
(1) to adjust the channel numbers of feature maps to align with the 720 × 960. We resize them into 360 × 480 for easy comparison with
subsequent layer, and (2) to filter redundant information as much as prior work, and take the intersection of union (IoU) to evaluate the
possible, enabling subsequent layers to learn representative features performance of HWD.
more effectively. This block allows the HWD module to be substituted Synapse multi-organ segmentation dataset (Synapse): This
with various downsampling methods, such as max pooling or strided dataset comprises 30 abdominal CT scans, consisting of a total of 3779
convolution. In Fig. 5, we compare outputs between the proposed HWD axial contrast-enhanced abdominal clinical CT images. Following the
module and max pooling operation. We can see that the output feature data splitting strategy in [60], we use 18 cases for training and the
map have more details from the HWD compared to the max pooling. remaining 12 cases for validation. The performance evaluation is based
In summary, the proposed HWD module consists of two main blocks. on the mean Dice Similarity Coefficient (mDSC) and the mean Hausdorff
The first block focuses on reducing the spatial resolution of feature maps Distance (mHD) for eight abdominal organs: aorta, gallbladder (Gall),
using Haar transform, while the second block aims to filter redundant spleen, left kidney (KidL), right kidney (KidR), liver, pancreas, spleen,
information through standard 1 × 1 convolution, batch normalization, and stomach (Stom).
and ReLU operations. Assuming the input size of feature maps is H × W × Micro-optical sectioning tomography dataset (MOST) [61]: The
C, and the size of downsampled feature maps is H2 × W 2 × C. When voxel size of this dataset is 0.35 × 0.35 × 1 μm. The soma and vessels
compared to traditional 3 × 3 convolution with stride 2 and average are labeled by two human experts. For our experiments, we randomly
pooling as downsampling, the total parameters and computational split the dataset into 1100 training images and 100 testing images, all
overhead are presented in Table 1. It can be seen that average pooling with the same resolution, following a previous study [62,63].
outperforms in terms of parameters and Floating Point Operations
(FLOPs). In comparison, our HWD module requires fewer than twice the 4.2. Implementation details
parameters of strided convolution. Additionally, when the channel
number C is greater than one, the computation overhead of strided All experiments were conducted on a workstation running Ubuntu
convolution exceeds that of HWD. As a result, the HWD module provides 18.04 LTS, utilizing Python 3.8 and PyTorch 1.8.0. The workstation was
a trade-off between average pooling and strided convolution in param­ equipped with two Nvidia 3090 GPUs with 24GB memory and an Intel i9
eters and FLOPs. CPU. Real-time augmentation strategies, such as random flipping and
rotations, were employed to enhance data diversity for all training in­
stances. The models were trained using the Adam optimizer with a
3.3. The feature entropy index initial learning rate of 5 × 10− 4 and a weight decay of 1 × 10− 4 . The
“poly” learning rate schedule proposed in [13] was also utilized, with a
To evaluate the uncertainty of the feature maps after downsampling, power of 0.9. Each iteration had a batch size of 8. The training epochs
we introduce a novel metric called the feature entropy index (FEI), were set to 350 for Camvid, 350 for Synapse, and 200 for MOST.
which is inspired by the concept of information entropy. The definition The sum of cross entropy and Generalized Dice loss is defined as the
of feature entropy is given by: loss function during the training [64], which is calculated below:
[( ) ] ( )
∑M
1 ∑C
FEI = − F log10 P , (5) 1 ∑ N ∑ C
c c
C c=1 Loss(g, p) = − g logpi +
m=1 N n=1 c=1 i
( ∑N ∑C ) (6)
where F is feature maps after downsampling which has total C number of 2 gc pc
c=1 i i
channels with the spatial resolution H × W after bilinear interpolation, 1− ∑N ∑C ( n=1 )2 ∑ N ∑C ( )2 ,
gci + pci
and P is the final prediction result with the size of H × W from seg­ n=1 c=1 n=1 c=1

mentation model. Each pixel value from P means the maximum pre­
diction probability for each class. M means the total pixel number of P, where g, p, N and C denote the ground truth, prediction results, the
which is equal to H × W. The FEI enables us to quantify the uncertainty number of image elements, and label classes, respectively. The first and
of feature maps generated by convolutional neural networks (CNNs). A second items refer to the definitions of cross-entropy and generalized
smaller FEI value indicates reduced uncertainty in the feature maps, Dice loss, respectively. According to Chen et al. [60], all 3D volume
implying that they offer more information for uncertainty reduction. datasets are trained slice by slice, and the predicted 2D slices are then
When FEI reaches zero, it signifies that the model can accurately infer stacked together to form a 3D prediction for evaluation on Synapse. For
the segmentation using the current feature maps. the other two datasets, we evaluate the segmentation performance using
2D predictions. The evaluation metrics, namely intersection-over-union
(IoU), Dice similarity coefficient (DSC), and Hausdorff distance (HD),
4. Experimental results
are used to evaluate the performance. The definitions of each metric are
as follows:
To evaluate the effectiveness of the HWD, extensive experiments are
conducted on three different image modalities, including natural image, IoU =
TP
, (7)
Computed Tomography (CT), and Micro-optical sectioning tomography TP + FP + FN
(MOST) images. In this section, we provide a brief introduction to the
2TP
datasets and discuss the implementation details. Subsequently, we DSC = , (8)
report the segmentation results and compare them with state-of-the-art 2TP + FP + FN
(SOTA) methods on these three datasets.

6
G. Xu et al. Pattern Recognition 143 (2023) 109819

Table 2
Performance of seven SOTA segmentation architecture with three types of ResNet as backbones on Camvid dataset.
Method Backbone mIoU↑ Sky BLD Pol Roa Pav Tre Sig Fen Car Ped Bik Time(ms)

PSPNet R-18 57.50 90.25 77.71 18.82 93.46 79.63 71.77 29.10 20.09 77.76 40.23 33.73 2.18
R-34 60.23 90.35 77.87 20.86 93.59 79.06 71.56 33.95 26.47 81.92 43.38 43.55 2.73
R-50 58.63 90.88 78.23 20.73 93.95 80.22 72.85 33.53 17.66 80.69 38.45 37.76 3.26
LinkNet R-18 60.30 91.71 78.01 24.72 94.35 81.35 70.55 31.25 23.63 80.26 43.21 44.29 3.88
R-34 61.78 91.73 78.85 25.91 94.22 81.31 71.05 35.44 23.11 81.76 45.56 50.61 5.35
R-50 63.04 92.23 80.22 28.15 94.22 81.12 72.90 36.02 31.14 82.03 46.08 49.37 6.39
FPN R-18 62.71 91.28 79.94 27.94 95.04 82.92 71.67 33.13 27.38 84.86 45.52 50.13 3.75
R-34 64.36 91.40 80.74 26.46 94.79 82.43 73.57 36.26 34.38 84.81 49.88 53.19 5.24
R-50 63.82 91.44 80.13 26.71 95.16 83.83 72.47 35.63 29.66 84.37 49.55 53.03 6.25
PAN R-18 61.47 90.64 79.72 25.33 94.66 81.95 70.88 34.08 27.58 82.73 44.96 43.66 4.41
R-34 62.94 90.73 80.70 27.19 94.84 82.41 72.47 37.31 28.40 85.79 46.24 46.31 5.82
R-50 60.66 89.86 78.66 25.70 94.38 81.15 71.93 30.85 22.52 82.75 46.70 42.72 6.98
DeepLabv3þ R-18 61.27 90.72 79.16 24.48 94.41 81.60 71.93 34.13 20.79 83.73 47.42 45.55 3.40
R-34 61.39 91.23 79.48 24.76 94.72 82.59 72.29 34.78 22.36 83.32 46.21 43.59 4.88
R-50 61.74 91.19 79.43 26.60 94.64 81.69 72.31 33.36 20.82 83.21 49.12 46.79 5.93
U-Net R-18 62.54 91.91 79.20 27.96 94.51 82.51 71.59 33.34 27.47 84.17 47.75 47.54 3.64
R-34 64.61 92.23 80.55 30.20 95.28 84.16 73.00 39.12 28.24 84.24 51.35 52.36 5.16
R-50 64.33 91.93 80.68 29.38 94.83 82.88 73.69 37.56 28.85 83.54 52.54 51.77 6.28
Unetþ R-18 63.50 92.18 80.17 28.34 94.66 82.61 72.04 37.21 30.33 84.40 49.85 46.66 5.38
R-34 64.89 92.11 80.94 30.99 95.06 83.90 73.55 38.84 26.96 85.82 53.59 52.00 6.91
R-50 64.86 92.24 80.93 29.81 94.96 82.75 74.38 40.61 28.74 83.18 53.25 52.63 8.24
Mean / 62.22 91.34 79.59 26.24 94.56 82.00 72.31 35.02 26.03 83.11 47.18 47.01 5.05

Table 3
Performance of seven SOTA segmentation architecture with three types of ResNet equipped with HWD module as backbones on Camvid dataset.
Method Backbone mIoU↑ Sky BLD Pol Roa Pav Tre Sig Fen Car Ped Bik Time(ms)

PSPNet_HWD R-18 58.83 90.38 77.38 20.47 94.18 80.12 72.46 29.42 28.50 79.49 39.67 35.02 2.86
R-34 61.15 90.53 78.99 21.93 94.57 81.58 72.77 36.32 27.10 81.52 43.40 43.97 3.44
R-50 60.08 90.91 78.32 22.43 94.34 80.55 73.61 32.14 25.41 81.34 42.13 39.74 3.90
LinkNet_HWD R-18 63.02 91.91 80.32 28.98 94.77 82.74 71.34 35.33 29.24 83.67 49.06 45.88 4.45
R-34 64.07 92.28 79.99 29.87 94.34 81.61 72.05 36.13 31.94 85.70 49.44 51.46 6.02
R-50 64.63 92.20 81.09 30.43 93.91 80.83 74.34 39.56 31.22 82.69 50.75 53.93 7.15
FPN_HWD R-18 64.66 91.57 80.40 28.90 95.14 82.87 73.36 35.95 34.07 84.53 50.00 54.50 4.44
R-34 65.15 91.49 81.33 28.47 95.20 83.66 73.77 38.11 33.47 86.51 49.98 54.70 6.04
R-50 64.74 91.46 81.44 28.39 94.80 82.54 73.59 38.86 31.22 85.80 50.56 53.51 7.04
PAN_HWD R-18 62.69 90.93 80.00 25.86 94.63 82.18 72.20 33.19 32.69 83.68 47.41 46.85 5.02
R-34 64.11 90.98 80.28 28.18 95.39 84.10 72.13 37.49 31.08 85.42 50.44 49.71 6.54
R-50 61.10 91.19 78.69 24.99 94.13 79.97 70.39 34.05 23.53 82.82 45.84 46.54 7.86
DeepLabv3þ_HWD R-18 62.75 91.33 80.24 26.54 94.42 80.64 72.06 35.49 30.26 84.53 49.09 45.61 4.00
R-34 63.07 90.99 80.71 28.93 94.43 81.14 72.96 36.17 23.78 85.25 49.56 49.88 5.55
R-50 62.31 91.56 80.02 27.80 94.20 80.78 73.19 33.02 21.84 82.60 50.99 49.42 6.62
U-Net_HWD R-18 64.31 91.83 80.51 30.40 94.93 83.34 71.87 36.54 37.14 82.44 50.13 48.27 4.29
R-34 65.62 92.23 80.93 30.91 95.41 84.68 73.52 38.98 37.97 83.43 53.83 49.92 5.84
R-50 64.28 92.35 80.82 29.97 94.49 82.23 74.07 39.03 29.51 83.51 52.87 48.25 6.87
Unetþ_HWD R-18 65.98 92.30 81.48 32.30 95.04 83.40 73.76 38.95 34.48 86.27 53.83 54.01 6.03
R-34 66.10 92.48 81.63 31.61 95.66 84.77 73.93 41.67 32.08 86.33 52.67 54.30 7.87
R-50 66.48 92.42 82.11 33.09 94.94 83.23 74.43 45.55 37.98 83.61 52.39 51.56 8.98
Mean / 63.58 91.59 80.32 28.12 94.71 82.24 72.94 36.76 30.69 83.86 49.24 48.91 5.75

HD = maxxϵX minyϵY ‖ x − y ‖2 , (9) 4.3. Experimental results on Camvid dataset

where TP, FP, and FN, x and y mean true positive, false positive, false (1) Comparison with SOTA methods
negative, and two points from two finite points sets X and Y. We evaluate our method on seven state-of-the-art (SOTA) segmen­
Considering that ResNets have become the prevailing backbone in tation architectures: DeepLabv3+, PSPNet, LinkNet, FPN, PAN, U-Net,
numerous segmentation architectures, we employ ResNet-18, ResNet- and Unet++. ResNet-18 (R-18), ResNet-34 (R-34), and ResNet-50 (R-
34, and ResNet-50 as the backbone for feature learning in this study. 50) are selected as the backbones for feature learning in each architec­
Additionally, in the Appendix, we present the segmentation perfor­ ture. Further details about the training loss can be found in the
mance achieved by adopting MobileNetv2 as the backbone, as well as Appendix.
other SOAT segmentation architectures based on convolution or trans­ Evaluation: Tables 2 and 3 present a comparison of our plug-in-
former operations. Basically, a standard ResNet comprises a stem with plug-off HWD module’s effectiveness against the baselines in terms of
two downsampling operations, followed by four stage blocks that consist mean IoU. The SOTA segmentation architectures equipped with the
of convolution layers and residual connections. Our primary objective is HWD module exhibit a performance improvement of 1∼2% for each
to maximize the retention of information at the stem of the ResNet. respective architecture. Specifically, our method demonstrates its
Therefore, we straightforwardly substitute the initial convolution layer effectiveness on this dataset by improving the mIoU of all 21 models by
(with a stride of 2) and the first max pooling layer in the stem of ResNet 1.21%. Particularly, our method significantly improves the performance
with HWD, while keeping all the subsequent layers identical to the of small-scale objects, such as pedestrian (+2.06%), bicycle (+1.9%),
original ResNet architecture. It is worth noting that all models are fence (+4.66%), and sign symbol (+1.74%). This suggests that seg­
trained from scratch. menting large-scale objects is relatively easy, even when downsampling

7
G. Xu et al. Pattern Recognition 143 (2023) 109819

Fig. 6. Visualized segmentation results on Camvid test set. The first column contains the input image and its corresponding ground truth. The second, fourth, and
sixth columns display the outputs of DeepLabv3+, LinkNet, and U-Net, respectively, with ResNet-34 as the backbone. The third, fifth, and seventh columns show
zoomed-in images that correspond to the regions highlighted in red boxes. More specifically, the even rows represent the segmentation results obtained with HWD,
while the odd rows correspond to the results using the original downsampling in ResNet-34. (For interpretation of the references to colour in this figure legend, the
reader is referred to the web version of this article.)

Table 4
Results of SSIM (%), PSNR (dB), FEI and FEI on boundary loop (testing on Camvid testing dataset).
Model Backbone SSIM↑ PSNR↑ FEI↓ FEI_B↓ Model SSIM↑ PSNR↑ FEI↓ FEI_B↓

DeepLabv3þ R-18 70.66 9.29 219.70 176.32 DeepLabv3þ_HWD 78.42 11.07 94.63 73.22
R-34 70.92 8.95 240.39 191.90 76.23 10.26 90.10 70.00
R-50 71.87 8.74 266.47 212.83 75.38 10.87 121.74 92.44
FPN R-18 67.33 7.71 244.62 197.79 FPN_HWD 79.35 10.94 100.22 78.97
R-34 67.84 7.49 255.78 205.79 79.30 10.84 96.98 77.07
R-50 71.67 8.46 253.36 204.54 78.94 11.33 105.83 82.89
Linknet R-18 72.12 11.36 200.28 152.82 Linknet_HWD 78.71 10.84 92.37 70.15
R-34 72.26 10.63 227.73 172.98 79.80 11.03 98.56 74.90
R-50 71.51 9.83 219.57 170.36 80.33 11.91 107.27 82.81
PAN R-18 70.00 8.56 237.82 191.30 PAN_HWD 77.44 10.54 95.52 73.29
R-34 70.39 8.54 241.81 193.92 77.83 10.77 100.73 78.02
R-50 72.31 9.05 262.54 206.88 78.85 11.30 120.69 92.36
PSPNet R-18 68.36 7.86 377.00 284.23 PSPNet_HWD 77.93 10.74 153.56 111.35
R-34 70.63 8.55 325.80 250.17 78.15 10.84 133.43 101.98
R-50 73.00 8.71 319.11 247.27 79.08 11.50 134.00 100.24
Unetþ R-18 66.91 7.81 224.88 174.71 Unetþ_HWD 76.72 10.20 85.42 65.27
R-34 66.86 7.51 229.25 181.18 75.22 9.94 89.85 68.27
R-50 69.39 7.69 234.76 184.64 74.84 10.13 101.74 77.89
U-Net R-18 67.11 8.07 238.13 182.41 U-Net_HWD 75.64 9.81 94.52 71.72
R-34 68.45 7.98 239.87 189.82 79.01 10.80 94.59 71.46
R-50 71.55 8.34 235.51 186.38 77.38 10.59 101.84 77.41
Mean / 70.05 8.63 252.11 198.01 Mean 77.83 10.77 105.41 80.56

operations may lead to information loss. Additionally, we recorded the which can be summarized in three aspects: 1) Pure ResNet-34-based
prediction time for each model on the CPU. The mean inference time methods tend to under-segment objects such as sign-symbol, bicycle,
increased by 0.7 ms when using the HWD in the stem of ResNet, which is and fence, highlighting the ability of HWD to alleviate the under-
acceptable considering the improved segmentation performance. segmentation issue. 2) When comparing with and without HWD, our
Qualitative results: Figure 6 presents the visualization of the seg­ method demonstrates improved results for small-scale objects. 3)
mentation results using three different architectures (DeepLabv3+, U- Comparing with the original ResNet-based models, the predictions using
Net, and LinkNet) with ResNet-34 (R-34) as the backbone, both with and HWD exhibit smoother boundaries and shapes (e.g., tree, sidewalk, and
without the HWD module. The three architectures equipped with the building) compared to those using conventional downsampling
proposed HWD downsampling module exhibit improved performance, methods. (2) Evaluation on feature effectiveness after downsampling

8
G. Xu et al. Pattern Recognition 143 (2023) 109819

Fig. 7. Comparion the performance by using low frequent component A (denoted as HWD_LL) in HWD and high frequent components (H, V and D, denoted as
HWD_HH) in HWD. The “+” in the figure means the improvement performance in terms of mIoU.

Table 5
Mean performance of total 21 segmentation models with ResNet as backbone, by using various number of HWD as dowsnsampling operation on Camvid dataset.
No. of HWD mIoU↑ Sky BLD Pol Roa Pav Tre Sig Fen Car Ped Bik

0 62.22 91.34 79.59 26.24 94.56 82.00 72.31 35.02 26.03 83.11 47.18 47.01
2 63.58 91.59 80.32 28.12 94.71 82.24 72.94 36.76 30.69 83.86 49.24 48.91
6 63.40 91.57 80.13 27.89 94.62 81.96 72.86 36.19 30.53 83.94 49.10 48.54
8 63.42 91.47 80.23 27.89 94.69 82.01 72.81 36.44 30.41 83.86 48.95 48.87
10 63.03 0.91 0.80 0.28 0.95 0.83 0.73 0.36 0.29 0.85 0.49 0.48

Table 6
Performance of seven SOTA models without HWD on the Synapse test set.
Methods Backbone mDSC↑ mHD↓ Aorta Gall KidL KidR Liver Pancreas Spleen Stomach

U-Net R-18 77.37 27.43 85.88 62.08 81.27 72.73 93.91 60.6 90.26 72.24
R-34 79.21 27.94 87.39 66.81 82.81 76.69 94.02 60.04 90.41 75.51
R-50 77.84 31.27 88.17 62.39 82.51 78.57 94.03 55.15 88.73 73.18
DeepLabv3þ R-18 75.92 24.23 85.08 61.48 83.32 74.16 93.44 54.46 85.39 70
R-34 77.37 22.08 85.56 62.69 81.99 75.45 93.17 59.53 87.31 73.27
R-50 73.87 31.94 83.78 57.23 76.91 71.42 92.91 50.22 87.03 71.43
FPN R-18 76.56 21.84 85.69 58.88 80.66 73.6 92.81 53.92 90.93 76.02
R-34 77.89 15.86 86.62 61.42 84.94 76.25 93.32 55.45 90.2 74.94
R-50 77.08 21.19 85.84 59.56 81.63 78.4 93.55 52.82 89.57 75.28
PSPNet R-18 74.19 20.65 82.05 58.58 80.86 73.69 92.16 46.22 86.44 73.52
R-34 75.89 16.43 82.39 62.65 82.38 77.24 93.22 49.54 85.66 74.05
R-50 76.3 21.3 82.74 65.13 81.58 74.82 92.71 51.1 88.05 74.3
PAN R-18 76.63 19.13 84.55 60.17 82.76 75.58 93.54 51.58 89.68 75.2
R-34 78.17 14.27 86.35 60.22 84.46 80.42 93.23 55.21 90.22 75.27
R-50 77.67 15.36 85.89 60.52 84.08 79.73 93.32 53.06 89.47 75.26
LinkNet R-18 77.56 32.26 86.66 60.49 79.26 75.37 94.05 60.33 89.13 75.23
R-34 79.09 21.46 88.14 62.49 86.3 82.3 94.59 58.94 87.74 72.23
R-50 78.44 28.14 87.61 64.5 80.83 74.51 94.34 58.67 89.76 77.31
Unetþ R-18 78.63 23.27 87.56 64.94 81.11 77.94 94.43 59.3 89.7 74.02
R-34 80.76 21.27 88.8 70.01 84.35 80.75 94.52 64.76 89.66 73.24
R-50 79.61 28.85 89.24 65.64 83.47 77.54 94.06 63 87.96 75.93
Mean / 77.43 23.15 86 62.28 82.26 76.53 93.59 55.9 88.73 74.16

In this study, we utilize Structural Similarity (SSIM), Peak Signal-to- with a kernel size of 5 is applied to extract object edges from the ground
Noise Ratio (PSNR), and the proposed Feature Entropy Index (FEI) to truth. The results in Table 4 demonstrate an improvement in SSIM
assess the effectiveness of the downsampling on the feature maps. Spe­ (7.78%) and PSNR (2.14 dB) across all 21 models. Furthermore, the
cifically, SSIM and PSNR are employed to evaluate the structural simi­ introduction of HWD in each architecture leads to a reduction in infor­
larity and information fidelity between the input image and the feature mation uncertainty. Across all 21 models, the HWD reduces feature
maps. Additionally, the FEI and FEI boundary (FEI_B) are computed to uncertainty by 58.2% (FEI) and 46.8% (FEI_B) comparing to the original
quantify the information uncertainty in the downscaled feature maps downsampling methods. Furthermore, we observe that both the stan­
and prediction outputs. In this study, a dilation morphological operation dard downsampling and HWD approaches exhibit higher uncertainty

9
G. Xu et al. Pattern Recognition 143 (2023) 109819

Table 7
Performance of seven SOTA models with HWD on the Synapse test set.
Methods Backbone mDSC↑ mHD↓ Aorta Gall KidL KidR Liver Pancreas Spleen Stomach

U-Net_HWD R-18 79.52 21.22 87.8 63.44 84.97 79.62 93.91 60.58 89.76 76.12
R-34 80.61 16.23 87.06 65.76 85.92 83.09 94.34 60.75 90.6 77.37
R-50 78.6 26.95 87.32 64.14 83.02 79.99 94.06 58.69 88.52 73.08
DeepLabv3þ_HWD R-18 77.11 23.02 86.19 63.63 81.9 76.04 93.36 51.87 88.83 75.1
R-34 77.82 22.09 87.3 61.56 82.61 78.69 93.69 53.39 89.74 75.57
R-50 75.69 26.76 85.32 61.9 80.33 70.32 93.48 51.93 89.19 73.09
FPN_HWD R-18 79.22 21.41 86.42 68.1 82.84 77.2 92.97 60.49 89.83 75.95
R-34 79.38 16.6 86.16 69.67 82.83 76.92 92.89 60.38 89.59 76.63
R-50 79.62 18.52 86.73 65.95 83.21 81.28 93.66 57.05 90.69 78.37
PSPNet_HWD R-18 76.15 19.55 83.05 61.17 82.73 77.92 93.47 49.62 84.03 77.21
R-34 76.94 18.3 82.51 65.29 82.83 77.46 93.49 51.25 87.13 75.52
R-50 76.92 17.39 84.19 60.63 83.22 78.85 93.28 51.29 87.46 76.42
PAN_HWD R-18 78.17 15.18 85.26 63.41 85.14 79.88 93.17 53.05 89.95 75.48
R-34 77.78 20.17 85.88 64.41 81.77 75.77 93.01 55.02 90.15 76.22
R-50 77.14 22.23 86.56 64.42 79.24 76.41 93 49.76 90.09 77.66
LinkNet_HWD R-18 79.15 22.12 87.63 68.03 81.31 76.66 94.67 58.33 87.84 78.75
R-34 79.59 24.1 87.72 62.23 83.19 79.97 94.29 63.23 89.07 76.97
R-50 79.29 25.17 87.44 64.53 84.55 80.22 93.46 55.43 91.53 77.13
Unetþ_HWD R-18 80.24 22.73 87.52 72.52 83.75 78.95 94.11 61.39 89.57 74.14
R-34 82.14 18.58 88.56 73.8 86.55 83.11 94.47 64.91 90.8 74.93
R-50 80.85 24.07 88.35 67.31 84.02 79.61 94.04 65.05 91.52 76.89
Mean / 78.66 21.07 86.43 65.33 83.14 78.47 93.66 56.83 89.33 76.12

Table 8
Comparison with the SOTA results w/o HWD on the MOST test set.
Model Backbone mDSC Vessel Soma Model mDSC Vessel Soma

U-Net R-18 84.39 82.47 86.30 U-Net_HWD 87.40 81.54 93.25


R-34 84.01 79.72 88.30 87.97 82.03 93.91
R-50 88.30 83.27 93.33 88.38 83.60 93.15
DeepLabv3þ R-18 82.91 76.50 89.32 DeepLabv3þ_HWD 86.32 80.72 91.92
R-34 83.59 78.23 88.94 85.24 79.17 91.30
R-50 84.00 77.57 90.43 86.29 80.46 92.11
FPN R-18 86.86 82.39 91.32 FPN_HWD 86.75 82.60 90.89
R-34 86.11 80.67 91.54 86.58 81.76 91.40
R-50 86.77 82.04 91.49 86.82 81.94 91.69
PSPNet R-18 83.86 78.96 88.76 PSPNet_HWD 83.86 79.03 88.68
R-34 82.61 78.47 86.75 83.32 79.10 87.54
R-50 83.46 78.35 88.56 82.12 75.88 88.35
PAN R-18 87.96 83.75 92.17 PAN_HWD 88.53 83.74 93.32
R-34 85.57 80.33 90.80 87.03 82.13 91.93
R-50 86.37 80.29 92.45 88.65 85.00 92.30
LinkNet R-18 85.67 77.57 93.76 LinkNet_HWD 86.48 79.02 93.94
R-34 86.09 79.06 93.12 87.73 81.74 93.71
R-50 85.36 77.23 93.49 89.10 84.32 93.87
Unetþ R-18 88.59 83.46 93.71 Unetþ_HWD 89.17 84.87 93.47
R-34 86.99 80.20 93.78 87.75 81.70 93.79
R-50 88.39 82.96 93.81 88.85 83.77 93.92
Mean / 85.61 80.17 91.05 Mean 86.87 81.62 92.12

near the object boundaries, indicating the difficulty in accurately seg­ The number of HWD modules in ResNet backbones:As demon­
menting these regions. strated by the aforementioned experiments, replacing the down­
(3) Ablation study sampling module in the stem of ResNet with HWD significantly
We observed that state-of-the-art (SOTA) architectures equipped improves the segmentation performance by incorporating additional
with HWD yield superior segmentation results compared to conven­ information into the models. The objective of this ablation study is to
tional methods such as max pooling or strided-convolution. Based on the assess the effects of replacing the remaining downsampling layers in
decomposition ability of HWD, which separates a feature map into a ResNets with HWD. In particular, ResNet architecture comprises of a
low-frequency part (A) and three high-frequency components (D, V, and stem and four stage blocks, consisting of convolution layers, batch
H) (Fig. 4), we conduct ablation studies to investigate the relative normalization, and ReLU operations, along with skip connections. The
importance of each part in the semantic segmentation task. stem involves two downsampling operations, namely, strided convolu­
Low frequency vs. high frequency: In Fig. 7, we assess the domi­ tion and max pooling. In stage 2, stage 3, and stage 4, each includes two
nant component, after Haar wavelet transform, for achieving optimal strided convolution operations-one in the skip connection and the other
segmentation performance. The low-frequency component (A) is found in the residual block. In this study, we replace the downsampling op­
to be crucial for enhancing the segmentation performance of objects. erations in various blocks of ResNet with HWD. Comparing with the
This component significantly improves the mean Dice Similarity Coef­ original backbone, it is evident from Table 5 that utilizing HWD as the
ficient (DSC) by 0.91% compared to using only high-frequency infor­ downsampling operation leads to improved segmentation performance.
mation (D, V, and H), despite the high-frequency components having Furthermore, we observe that it is more effective to use HWD solely in
three times the number of feature maps than their low-frequency the stem of ResNet. Therefore, we have implemented two HWD modules
counterpart. to replace the downsampling operations in the stem of ResNet, achieving

10
G. Xu et al. Pattern Recognition 143 (2023) 109819

Table A.1 a favorable balance between efficiency and accuracy in this study.
The definition of mathematical notations used in this paper.
Variable Definition 4.4. Experiment results on Synapse dataset
α
̃ The value after pooling
αi αi represents the values in a region R of the input data We perform experiments using the same seven SOTA models as those
|R| |R| denotes the cardinality or the number of elements in the region R employed in the Synapse dataset. The results demonstrate that inte­
ρi The probability of αi being selected undergoing stochastic pooling grating our HWD module into the architecture improves the DSC by
H The height of feature map
1.23% and reduces the HD by 2.08mm across the 21 models (Tables 6
W The width of feature map
H0 The low-pass decomposition filters and 7). Specifically, when compared to the seven models using R-50 as
H1 The high-pass decomposition filters the backbone, our HWD module enhances the mean IoU by 1.57% when
ϕ The wavelet basis function R-18 and R-34 are used as backbones. However, it only increases the
ψ The scale function mean IoU by 0.93% when R-18 and R-34 are used as backbones, indi­
g The ground truth
cating that our proposed method is more effective for architectures with
p The prediction results
gi The pixels in g fewer parameters.
pi The pixels in p
N The number of image pixels 4.5. Experiment results on MOST dataset
C The number of label classes
TP The number of positive instances that are correctly predicted
FP The number of negative instances that are incorrectly predicted The segmentation architecture integrated with our proposed HWD
TN The number of positive instances that are incorrectly predicted module demonstrates a 1.26% improvement in terms of DSC (Table 8).
x The point in set X Importantly, the HWD module significantly enhances the segmentation
y The point in set Y
performance when ResNet-18 and ResNet-34 are used as backbones for
feature extraction. For instance, the DSC values for U-Net_R18 and U-
Net_R34 are 84.39% and 84.01%, respectively. However, upon inte­
gration of our HWD module, the DSC improves by 3.01% and 3.96% for
U-Net_R18 and U-Net_R34, respectively. This demonstrates that shallow
CNNs have a higher demand for information compared to relatively

Fig. B.1. Comparison of training loss with/without HWD in DeepLabv3+ architecture with ResNet34 as backbone.

Table C.1
Mean performance of 21 segmentation models with ResNet as backbone, by using Max pooling, average pooling and HWD as dowsnsampling operation on Camvid
dataset.
Method mIoU↑ Sky BLD Pol Roa Pav Tre Sig Fen Car Ped Bik

MaxPooling in HWD 60.77 90.32 78.92 23.33 94.31 81.05 69.66 33.31 25.31 82.15 45.91 44.19
AveragePooling in HWD 61.84 91.40 79.12 26.13 94.47 81.42 71.62 34.45 24.45 82.91 47.82 46.46
HWD 63.58 91.59 80.32 28.12 94.71 82.24 72.94 36.76 30.69 83.86 49.24 48.91

Table D.1
Performance of seven SOTA segmentation architectures as MobileNetv2 backbone on Camvid dataset.
Method mIoU↑ Sky BLD Pol Roa Pav Tre Sig Fen Car Ped Bik

PSP 53.04 89.70 73.75 16.00 90.90 73.39 67.94 29.50 16.48 73.17 30.29 22.39
LinkNet 53.70 90.92 74.47 2.87 93.07 77.84 66.90 18.89 19.24 79.90 29.31 37.34
FPN 57.76 89.86 75.85 23.28 94.09 80.10 68.19 30.09 17.35 78.97 40.25 37.37
PAN 55.54 90.53 75.24 19.46 93.29 77.13 67.30 24.47 17.24 78.92 35.76 31.61
DeepLabv3þ 56.07 91.04 75.26 20.16 93.42 78.63 68.15 26.04 18.18 74.66 34.89 36.37
U-Net 57.43 90.13 75.46 22.29 93.46 79.85 68.48 27.89 17.32 74.13 43.58 39.17
Unetþ 59.00 90.49 76.93 25.09 93.46 79.49 69.60 30.38 21.30 79.30 42.07 40.89
Mean 56.08 90.38 75.28 18.45 93.10 78.06 68.08 26.75 18.16 77.01 36.59 35.02

11
G. Xu et al. Pattern Recognition 143 (2023) 109819

Table D.2
Performance of seven SOTA segmentation architectures as MobileNetv2 backbone equipped with HWD module on Camvid dataset.
Method mIoU↑ Sky BLD Pol Roa Pav Tre Sig Fen Car Ped Bik

PSP 53.81 89.60 74.25 16.21 91.41 73.69 69.85 27.23 24.90 72.71 31.65 20.38
LinkNet 55.72 89.97 75.15 8.26 93.13 77.62 68.06 22.09 19.61 82.93 34.81 41.34
FPN 58.38 91.07 77.54 22.59 93.16 77.58 69.51 28.85 19.76 80.03 40.53 41.57
PAN 55.76 90.60 73.94 19.50 93.59 78.46 66.94 26.44 15.09 76.54 32.93 39.28
DeepLabv3þ 56.44 90.89 75.45 19.76 93.09 77.92 66.05 29.87 14.85 77.30 37.39 38.30
U-Net 58.89 91.71 77.01 24.03 93.48 79.94 68.10 30.59 24.83 76.16 41.59 40.38
Unetþ 58.30 90.81 76.52 26.18 93.85 80.95 68.24 29.38 17.59 79.59 41.26 36.93
Mean 56.76 90.66 75.69 19.50 93.10 78.02 68.11 27.78 19.52 77.89 37.17 36.88

Table E.1
Performance of seven SOTA models without HWD on the Synapse test set.
Backbone mDSC↑ mHD↓ Aorta Gall KidL KidR Liver Pancreas Spleen Stomach

HRNet 77.10 21.36 87.61 62.69 79.71 75.65 93.49 54.20 89.79 73.65
HRNet_HWD 78.86 18.51 88.01 65.28 83.05 78.81 93.76 55.18 89.47 77.32
ConvNext 57.39 81.03 70.27 37.65 60.01 63.81 85.94 18.95 70.09 52.37
ConvNext_HWD 65.11 27.63 70.22 44.74 73.30 72.07 89.97 30.53 80.70 59.37
UNext 70.83 42.45 82.10 60.04 72.56 64.31 91.64 43.39 82.83 69.79
UNext_HWD 72.94 32.45 84.23 55.90 79.95 68.83 92.81 45.43 86.00 70.33
TransUNet 75.21 41.82 87.59 55.03 82.00 73.39 92.52 55.13 86.01 70.04
TransUNet_HWD 75.64 45.36 87.78 62.22 82.07 74.23 91.64 54.75 83.59 68.86
Swin-Unet 59.87 55.58 63.29 51.95 57.98 60.29 89.16 30.37 71.39 54.54
Swin-Unet_HWD 69.77 28.26 73.00 59.26 77.27 71.84 90.58 39.60 83.11 63.49
SegFormer 74.82 22.79 81.54 63.62 80.90 78.28 92.05 49.80 83.24 69.16
SegFormer_HWD 77.33 24.64 84.44 59.33 86.88 81.42 91.49 52.20 85.81 77.09
Mean Improved↑ 4.07 14.69 2.55 2.63 8.23 5.24 0.91 4.31 4.22 4.49

deep networks. process. Second, the proposed HWD offers a great potential for
improving segmentation performance on the current SOTA convoul­
5. Conclusion tional neural network benchmarks, like U-Net, DeepLabv3+ and
PSPNet. However, the HWD module lacks the ability to capture global
In conclusion, we present a general downsampling module (HWD) context and establish long-range spatial relations due to the localized
for semantic segmentation in the paper. The goal of the HWD module is nature of convolution operations. We are considering integrating both
to retain as much essential information as possible during down­ local and global features from convolution and Transformer operations
sampling. Extensive experiments and ablation studies conducted on into the HWD module in our future work.
three different image datasets with varying modalities demonstrate the
effectiveness of the proposed HWD module and the FEI metric. This
Declaration of Competing Interest
work has implications for various CNN-based computer vision tasks,
including instance segmentation, object detection, and pose estimation.
The authors declare that they have no known competing financial
Furthermore, to assess the quality of downsampled feature maps, we
interests or personal relationships that could have appeared to influence
introduce a new metric called Feature Entropy Index (FEI). The FEI
the work reported in this paper.
metric effectively reflects the degree of information uncertainty by
considering the downsampled feature maps and the prediction results.
Data availability
Experimental results further indicate that the HWD module provides
more information for object segmentation compared to conventional
Data will be made available on request.
downsampling methods.
The proposed HWD downsampling module and FEI assessment
Acknowledgments
metric have the following two main advantages. Firstly, the HWD
module can seamlessly integrate into existing segmentation architec­
This work is supported by the Guangdong Provincial Key Laboratory
tures due to its generality. It can directly replace existing downsampling
of Human Digital Twin (No. 2022B1212010004), the Open-Fund of
methods, such as max pooling, average pooling, or strided convolution,
WNLO (No. 2018WNLOKF027), the Hubei Key Laboratory of Intelligent
without introducing additional complexities. Furthermore, it signifi­
Robot in Wuhan Institute of Technology (No. HBIRL202202 and No.
cantly improves the segmentation performance. Secondly, regarding the
HBIR202206), and the Chongqing Science and Technology Bureau (No.
FEI, it can be applied to assess the quality of feature maps, serving as a
2022TIAD-KPX0190). We thank the Optical Bioimaging Core Facility of
quantitative indicator for assessing the amount of essential information
WNLO-HUST for the support in MOST data acquisition.
preserved after downsampling in segmentation architectures.
The primary limitations of the proposed HWD module lie in its ef­
ficiency and locality. Therefore, we intend to continue our research in Appendix A. mathematical notations table
two aspects. First, it is crucial to extract representative features from
input images for semantic segmentation. However, the source images
contain a significant amount of redundant information, which may
hinder the extraction of representative features. We aim to incorporate Appendix B. Training Loss with/without HWD
prior knowledge, such as boundaries and textures, into the HWD module
to efficiently filter out irrelevant information during the downsampling This section presents the training loss with and without HWD in the
ResNet backbones. As shown in Fig. B.1, the training loss exhibits a

12
G. Xu et al. Pattern Recognition 143 (2023) 109819

slightly faster decrease when utilizing the HWD module in the Unet++ [4] A. Krizhevsky, I. Sutskever, G. Hinton, ImageNet classification with deep
convolutional neural networks, Adv. Neural Inf. Process. Syst. 25 (2) (2012).
and DeepLabv3+ architectures compared to the original ResNet back­
[5] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition.
bone. Specifically, the training loss is consistently lower when employ­ Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
ing the HWD module compared to using conventional downsampling 2016, pp. 770–778.
operations such as strided convolution and max pooling. [6] F. Cheng, C. Chen, Y. Wang, H. Shi, Y. Cao, D. Tu, C. Zhang, Y. Xu, Learning
directional feature maps for cardiac MRI segmentation. International Conference
on Medical Image Computing and Computer-Assisted Intervention, Springer, 2020,
Appendix C. Replace lossless feature encoding block of HWD pp. 108–117.
with max pooling or average pooling [7] T. Cheng, X. Wang, L. Huang, W. Liu, Boundary-preserving mask R-CNN, arXiv e-
prints (2020b).
[8] O. Ronneberger, P. Fischer, T. Brox, U-Net: convolutional networks for biomedical
We also conduct experiments by replacing the lossless feature image segmentation. International Conference on Medical Image Computing and
encoding block (haar wavelet transform) of HWD with max pooling Computer-Assisted Intervention, Springer, 2015, pp. 234–241.
[9] J. Zhang, C. Li, S. Kosov, M. Grzegorzek, K. Shirahama, T. Jiang, C. Sun, Z. Li, H. Li,
layer or average pooling, in order to evalidate the performance LCU-Net: a novel low-cost U-Net for environmental microorganism image
improvement of HWD not only by increasing the representation feature segmentation, Pattern Recognit. 115 (2021) 107885.
learning block of HWD. Table C.1 shows the performance of mean IoU of [10] Q. Zhou, X. Wu, S. Zhang, B. Kang, Z. Ge, L.J. Latecki, Contextual ensemble
network for semantic segmentation, Pattern Recognit. 122 (2022) 108290.
total 21 segmentation models. These models comprise seven semantic [11] A. Chaurasia, E. Culurciello, LinkNet: exploiting encoder representations for
segmentation architectures that employ three different types of ResNet efficient semantic segmentation. 2017 IEEE Visual Communications and Image
as their backbones. It can be observed that the Haar wavelet transform Processing (VCIP), IEEE, 2017, pp. 1–4.
[12] G. Lin, A. Milan, C. Shen, I. Reid, RefineNet: multi-path refinement networks for
plays a more significant role compared to max pooling and average high-resolution semantic segmentation. Proceedings of the IEEE Conference on
pooling, resulting in improvements of 2.81% and 1.74% in mean IoU Computer Vision and Pattern Recognition, 2017, pp. 1925–1934.
across all 21 segmentation models. [13] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, H. Adam, Encoder-decoder with
atrous separable convolution for semantic image segmentation. Proceedings of the
European Conference on Computer Vision (ECCV), 2018, pp. 801–818.
Appendix D. MobileNetv2 as backbone under seven [14] H. Zhao, J. Shi, X. Qi, X. Wang, J. Jia, Pyramid scene parsing network. Proceedings
segmentation architectures of the IEEE Conference on Computer Vision and Pattern Recognition, 2017,
pp. 2881–2890.
[15] N. Mu, H. Wang, Y. Zhang, J. Jiang, J. Tang, Progressive global perception and
This section contains detailed results for CamVid dataset with local polishing network for lung infection segmentation of COVID-19 CT images,
Mobilenetv2 [65] as backbone under seven semantic segmentation Pattern Recognit. 120 (2021) 108168.
frameworks. MobileNetv2 was originally developed for mobile devices, [16] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, N. Sang, BiSeNet: bilateral segmentation
network for real-time semantic segmentation. Proceedings of the European
striking a balance between accuracy, latency, and parameter count for Conference on Computer Vision (ECCV), 2018, pp. 325–341.
classification tasks. There are 5 times downsampling operations in the [17] H. Zhao, X. Qi, X. Shen, J. Shi, J. Jia, ICNet for real-time semantic segmentation on
MobileNetv2. The fisrt downsampling (strided convolution) happens in high-resolution images. Proceedings of the European Conference on Computer
Vision (ECCV), 2018, pp. 405–420.
the stem of MobileNetv2, and the others are in the bottleneck of it. Here, [18] G. Xu, H. Cao, J.K. Udupa, Y. Tong, D.A. Torigian, DiSegNet: a deep dilated
we only use HWD to replace the first strided convolution operation by convolutional encoder-decoder architecture for lymph node segmentation on PET/
the considering of the accuracy and efficiency. CT images, Comput. Med. Imaging Graph. 88 (2021) 101851.
[19] S. Hu, F. Bonardi, S. Bouchafa, D. Sidibé, Multi-modal unsupervised domain
Tables D.1 and D.2 show the segmentation results on seven seg­ adaptation for semantic image segmentation, Pattern Recognit. (2023) 109299.
mentation architectures with and without HWD in MobileNetv2 on [20] H. Zhou, L. Qi, H. Huang, X. Yang, Z. Wan, X. Wen, CANet: co-attention network
CamVid testset in terms of mean IoU. We can find that MobileNetv2 for RGB-D semantic segmentation, Pattern Recognit. 124 (2022) 108468.
[21] W. Wu, T. Chu, Q. Liu, Complementarity-aware cross-modal feature fusion network
equipped with HWD could achieve the performance improvement of for RGB-T semantic segmentation, Pattern Recognit. 131 (2022) 108881.
0.8% mIoU on seven segmentation architectures, demonstrating our [22] R. Karthik, R. Menaka, M. Hariharan, D. Won, Contour-enhanced attention CNN for
method could be adapted in other backbones too. Specifically, our CT-based COVID-19 segmentation, Pattern Recognit. 125 (2022) 108538.
[23] F.Z. Xing, E. Cambria, W.-B. Huang, Y. Xu, Weakly supervised semantic
method could improve the performance of small-scale objects in a large
segmentation with superpixel embedding. 2016 IEEE International Conference on
margin, such as pole (1.05%), sign symbol (1.03%), fence (1.36%) and Image Processing (ICIP), IEEE, 2016, pp. 1269–1273.
bicycle (1.86%). [24] J. Shen, N. Robertson, BBAS: towards large scale effective ensemble adversarial
attacks against deep neural network learning, Inf. Sci. (Ny) 569 (2021) 469–478.
[25] L. Wang, X. Qian, Y. Zhang, J. Shen, X. Cao, Enhancing sketch-based image
Appendix E. Performance with the SOTA architectures based retrieval by CNN semantic re-ranking, IEEE Trans. Cybern. 50 (7) (2019)
convolution and transformer on synapse dataset 3330–3342.
[26] V.M. Vargas, P.A. Gutiérrez, C. Hervás-Martínez, Unimodal regularisation based on
beta distribution for deep ordinal regression, Pattern Recognit. 122 (2022) 108310.
In this section, we test six SOTA segmentation methods on Synapse [27] O. Oktay, J. Schlemper, L.L. Folgoc, M. Lee, M. Heinrich, K. Misawa, K. Mori, S.
dataset. They are HRNet [66], ConvNeXt [67], UNext [68], TransUNet McDonagh, N.Y. Hammerla, B. Kainz, et al., Attention U-Net: learning where to
[60], Swin-Unet [69], and SegFormer [70], respectively. Note that the look for the pancreas, arXiv preprint arXiv:1804.03999 (2018).
[28] C. Li, B. Wang, S. Zhang, Y. Liu, R. Song, J. Cheng, X. Chen, Emotion recognition
HRNet and ConvNeXt are pure convolution neural network, the UNext from eeg based on multi-task learning with capsule network and attention
and TransUNet are hybrid with convolution and transformer operations. mechanism, Comput. Biol. Med. 143 (2022) 105303.
Besides, the Swin-Unet and SegFore are based on transformer. We tested [29] Y. Yuan, J. Xie, X. Chen, J. Wang, SegFix: model-agnostic boundary refinement for
segmentation. European Conference on Computer Vision, Springer, 2020,
these methods on Synapse dataset, and the comparative results are pp. 489–506.
shown in Table E.1. Here, we simply replace the downsampling opera­ [30] R.N. Bracewell, R.N. Bracewell, The Fourier Transform and its Applications Vol.
tion, like maxpooling or convolution stride, with our proposed HWD to 31999, McGraw-hill New York, 1986.
[31] R.S. Stanković, B.J. Falkowski, The haar wavelet transform: its status and
reduce the resolution of feature maps on each architecture. achievements, Comput. Electr. Eng. 29 (1) (2003) 25–44.
[32] C.H. Ma, Y. Li, Y. Wang, Image analysis based on the haar wavelet transform.
References Applied Mechanics and Materials Vol. 391, Trans Tech Publ, 2013, pp. 564–567.
[33] A. Belov, Comparison of the efficiencies of image compression algorithms based on
separable and nonseparable two-dimensional haar wavelet bases, Pattern Recognit.
[1] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic
Image Anal. 18 (4) (2008) 602–605.
segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern
[34] F. Luisier, C. Vonesch, T. Blu, M. Unser, Fast haar-wavelet denoising of
Recognition, 2015, pp. 3431–3440.
multidimensional fluorescence microscopy data. 2009 IEEE International
[2] Z. Lv, L. Qiao, S. Yang, J. Li, H. Lv, F. Piccialli, Memory-augmented neural
Symposium on Biomedical Imaging: From Nano to Macro, IEEE, 2009,
networks based dynamic complex image segmentation in digital twins for self-
pp. 310–313.
driving vehicle, Pattern Recognit. 132 (2022) 108956.
[35] R. Duits, M. Felsberg, G. Granlund, B.t.H. Romeny, Image analysis and
[3] J. Wu, H. Xu, S. Zhang, X. Li, J. Chen, J. Zheng, Y. Gao, Y. Tian, Y. Liang, R. Ji,
reconstruction using a wavelet transform constructed from a reducible
Joint segmentation and detection of COVID-19 via a sequential region generation
network, Pattern Recognit. 118 (2021) 108006.

13
G. Xu et al. Pattern Recognition 143 (2023) 109819

representation of the Euclidean motion group, Int. J. Comput. Vis. 72 (1) (2007) [60] J. Chen, Y. Lu, Q. Yu, X. Luo, E. Adeli, Y. Wang, L. Lu, A.L. Yuille, Y. Zhou,
79–102. TransUNet: transformers make strong encoders for medical image segmentation,
[36] J. Liang, Z. Shi, D. Li, M.J. Wierman, Information entropy, rough entropy and arXiv preprint arXiv:2102.04306 (2021).
knowledge granulation in incomplete information systems, Int. J. Gen. Syst. 35 (6) [61] A. Li, H. Gong, B. Zhang, Q. Wang, C. Yan, J. Wu, Q. Liu, S. Zeng, Q. Luo, Micro-
(2006) 641–654. optical sectioning tomography to obtain a high-resolution atlas of the mouse brain,
[37] A. Namdari, Z. Li, A review of entropy measures for uncertainty quantification of Science 330 (6009) (2010) 1404–1408.
stochastic processes, Adv. Mech. Eng. 11 (6) (2019).1687814019857350 [62] G. Xu, X. Wu, X. Zhang, W. Liao, S. Chen, LGNet: local and global representation
[38] A. Kendall, Y. Gal, What uncertainties do we need in Bayesian deep learning for learning for fast biomedical image segmentation, J. Innov. Opt. Health Sci. (2022).
computer vision? Adv. Neural Inf. Process. Syst. 30 (2017). [63] G. Xu, X. Wu, X. Zhang, X. He, LeViT-UNet: make faster encoders with transformer
[39] A.K. Balan, L. Boyles, M. Welling, J. Kim, H. Park, Statistical optimization of non- for medical image segmentation, arXiv preprint arXiv:2107.08623 (2021).
negative matrix factorization. Proceedings of the Fourteenth International [64] X. Li, M. He, H. Li, H. Shen, A combined loss-based multiscale fully convolutional
Conference on Artificial Intelligence and Statistics, JMLR Workshop and network for high-resolution remote sensing image change detection, IEEE Geosci.
Conference Proceedings, 2011, pp. 128–136. Remote Sens. Lett. 19 (2021) 1–5.
[40] S. Fujieda, K. Takayama, T. Hachisuka, Wavelet convolutional neural networks, [65] A.G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M.
arXiv preprint arXiv:1805.08620 (2018). Andreetto, H. Adam, MobileNets: efficient convolutional neural networks for
[41] P. Liu, H. Zhang, K. Zhang, L. Lin, W. Zuo, Multi-level wavelet-CNN for image mobile vision applications, arXiv preprint arXiv:1704.04861 (2017).
restoration. Proceedings of the IEEE Conference on Computer Vision and Pattern [66] J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan,
Recognition Workshops, 2018, pp. 773–782. X. Wang, et al., Deep high-resolution representation learning for visual recognition,
[42] T. Wu, W. Li, S. Jia, Y. Dong, T. Zeng, Deep multi-level wavelet-CNN denoiser prior IEEE Trans. Pattern Anal. Mach. Intell. 43 (10) (2020) 3349–3364.
for restoring blurred image with Cauchy noise, IEEE Signal Process. Lett. 27 (2020) [67] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, S. Xie, A ConvNet for the
1635–1639. 2020s. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
[43] H. Huang, R. He, Z. Sun, T. Tan, Wavelet-SRNet: a wavelet-based CNN for multi- Recognition, 2022, pp. 11976–11986.
scale face super resolution. Proceedings of the IEEE International Conference on [68] J.M.J. Valanarasu, V.M. Patel, UNeXt: MLP-based rapid medical image
Computer Vision, 2017, pp. 1689–1697. segmentation network. Medical Image Computing and Computer Assisted
[44] H. Ma, D. Liu, R. Xiong, F. Wu, iWave: CNN-based wavelet-like transform for image Intervention–MICCAI 2022: 25th International Conference, Singapore, September
compression, IEEE Trans. Multimed. 22 (7) (2019) 1667–1679. 18–22, 2022, Proceedings, Part V, Springer, 2022, pp. 23–33.
[45] H. Ling, J. Wu, J. Huang, J. Chen, P. Li, Attention-based convolutional neural [69] H. Cao, Y. Wang, J. Chen, D. Jiang, X. Zhang, Q. Tian, M. Wang, Swin-UNet: UNet-
network for deep face recognition, Multimed. Tools Appl. 79 (2020) 5595–5616. like pure transformer for medical image segmentation. European Conference on
[46] M.D. Zeiler, R. Fergus, Stochastic pooling for regularization of deep convolutional Computer Vision, Springer, 2022, pp. 205–218.
neural networks (2013). [70] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J.M. Alvarez, P. Luo, SegFormer: simple
[47] A. Stergiou, R. Poppe, G. Kalliatakis, Refining activation downsampling with and efficient design for semantic segmentation with transformers, Adv. Neural Inf.
softpool. Proceedings of the IEEE/CVF International Conference on Computer Process. Syst. 34 (2021) 12077–12090.
Vision, 2021, pp. 10357–10366.
[48] T. Williams, R. Li, Wavelet pooling for convolutional neural networks.
Guoping Xu received his PhD in Communication and System from the School of Electronic
International Conference on Learning Representations, 2018.
Information and Communications, Huazhong University of Science and Technology. As a
[49] D.-H. Jang, S. Chu, J. Kim, B. Han, Pooling revisited: your receptive field is
lecturer at the School of Computer Science and Engineering in Wuhan Institute of Tech­
suboptimal. Proceedings of the IEEE/CVF Conference on Computer Vision and
nology, his research interests include medical image analysis and computer vision.
Pattern Recognition, 2022, pp. 549–558.
[50] D. Marin, Z. He, P. Vajda, P. Chatterjee, S. Tsai, F. Yang, Y. Boykov, Efficient
segmentation: learning downsampling near semantic boundaries. Proceedings of Wentao Liao is a master student in Computer Application Technology at Wuhan Institute
the IEEE/CVF International Conference on Computer Vision, 2019, pp. 2131–2141. of Technology. His main research interests are computer vision and medical image
[51] L.-H. Chen, C.G. Bampis, Z. Li, C. Chen, A.C. Bovik, Convolutional block design for processing.
learned fractional downsampling. 2022 56th Asilomar Conference on Signals,
Systems, and Computers, IEEE, 2022, pp. 640–644.
Xuan Zhang received his BS degree in computer science from Wuhan Institute of Tech­
[52] Z. Wang, A.C. Bovik, H.R. Sheikh, E.P. Simoncelli, Image quality assessment: from
nology. Her research interests are deep learning and computer vision.
error visibility to structural similarity, IEEE Trans. Image Process. 13 (4) (2004)
600–612.
[53] Z. Wang, E.P. Simoncelli, A.C. Bovik, Multiscale structural similarity for image Chang Li received the PhD degree in circuits and systems from the School of Electronic
quality assessment. The Thrity-Seventh Asilomar Conference on Signals, Systems & Information and Communications, Huazhong University of Science and Technology,
Computers, 2003 Vol. 2, Ieee, 2003, pp. 1398–1402. Wuhan. He is currently an associate professor with the Department of Biomedical Engi­
[54] Z. Wang, Q. Li, Information content weighting for perceptual image quality neering, Hefei University of Technology, Hefei, China. His research interests include
assessment, IEEE Trans. Image Process. 20 (5) (2010) 1185–1198. biomedical signal processing, hyperspectral image analysis, computer vision, and machine
[55] A. Liu, W. Lin, M. Narwaria, Image quality assessment based on gradient similarity, learning.
IEEE Trans. Image Process. 21 (4) (2011) 1500–1512.
[56] L. Zhang, L. Zhang, X. Mou, D. Zhang, FSIM: a feature similarity index for image Xinwei He received his PhD in Communication and System from the School of Electronic
quality assessment, IEEE Trans. Image Process. 20 (8) (2011) 2378–2386. Information and Communications, Huazhong University of Science and Technology. His
[57] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, Improved research interests include computer vision and machine learning.
techniques for training GANs, Adv. Neural Inf. Process. Syst. 29 (2016).
[58] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, S. Hochreiter, GANs trained by
a two time-scale update rule converge to a local Nash equilibrium, Adv. Neural Inf. Xinglong Wu received his PhD from University of Miami and was a Post-Doctoral
Process. Syst. 30 (2017). Researcher at the Center of Computational Science in University of Miami for three
[59] V. Badrinarayanan, A. Kendall, R. Cipolla, SegNet: a deep convolutional encoder- years. As an Associate Professor at the School of Computer Science and Engineering in
decoder architecture for image segmentation, IEEE Trans. Pattern Anal. Mach. Wuhan Institute of Technology, China, he is actively involved in Machine Learning/Deep
Intell. 39 (12) (2017) 2481–2495. learning and biomedical image analysis.

14

You might also like