Haar Wavelet Downsampling
Haar Wavelet Downsampling
Pattern Recognition
journal homepage: www.elsevier.com/locate/pr
A R T I C L E I N F O A B S T R A C T
Keywords: Downsampling operations such as max pooling or strided convolution are ubiquitously utilized in Convolutional
Semantic segmentation Neural Networks (CNNs) to aggregate local features, enlarge receptive field, and minimize computational
Downsampling overhead. However, for a semantic segmentation task, pooling features over the local neighbourhood may result
Haar wavelet
in the loss of important spatial information, which is conducive for pixel-wise predictions. To address this issue,
Information entropy
we introduce a simple yet effective pooling operation called the Haar Wavelet-based Downsampling (HWD)
module. This module can be easily integrated into CNNs to enhance the performance of semantic segmentation
models. The core idea of HWD is to apply Haar wavelet transform for reducing the spatial resolution of feature
maps while preserving as much information as possible. Furthermore, to investigate the benefits of HWD, we
propose a novel metric, named as feature entropy index (FEI), which measures the degree of information un
certainty after downsampling in CNNs. Specifically, the FEI can be used to indicate the ability of downsampling
methods to preserve essential information in semantic segmentation. Our comprehensive experiments demon
strate that the proposed HWD module could (1) effectively improve the segmentation performance across
different modality image datasets with various CNN architectures, and (2) efficiently reduce information un
certainty compared to the conventional downsampling methods. Our implementation are available at https://
github.com/apple1986/HWD.
* Corresponding author.
E-mail address: [email protected] (X. Wu).
https://doi.org/10.1016/j.patcog.2023.109819
Received 4 August 2022; Received in revised form 15 June 2023; Accepted 12 July 2023
Available online 13 July 2023
0031-3203/© 2023 Elsevier Ltd. All rights reserved.
G. Xu et al. Pattern Recognition 143 (2023) 109819
Fig. 1. Downsampling examples of average pooling, max pooling, strided convolution and the HWD in DeepLabv3+ [13]. Compared with conventional down
sampling methods, the feature after HWD preserves more boundary, texture and detail information, as marked by the four red squares where tree branches are better
preserved in (d). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
like DiSegNet [18], MMADT [19], CANet [20] and CCFFNet [21]. Due to DCNNs for semantic segmentation?
the complementary properties of images from different modalites, it Inspired by the lossless information transformation methods [30],
could also mitigate the effect of information loss after downsampling; we introduce the Haar wavelet transform into our downsampling
(4) increasing prior information. A contour-enhanced attention module module. Theories and practice of Haar wavelet transform have been
was proposed in [22], aiming to extract boundary and shape cues from extensive studied [31]. Early works of Haar wavelet transform in image
CT images for refining the segmented areas. In [23], superpixels are used processing have focused on image decomposition [32], compression
to help guide local and global consistency by considering rule-based [33], denoising [34], reconstruction [35] etc. The main advantage of the
appearance information. Besides, ensemble adversarial attacks are Haar wavelet is its multi-scale lossless fast signal decomposition. In this
introduced in [24] for generating highly diverse adversarial example, paper, we make use of this advantage from Haar wavelet transform, and
aiming to enhance the diversity of perturbation over class distribution. propose Haar wavelet downsampling (HWD) module, a new down
In [25] and [26], category prior information is brought into CNNs to sampling method for segmentation task. Our main motivation is if the
increase robustness of DCNNs. information of an image could be preserved as much as possible at the
In a nutshell, the main purpose of these methods is to help establish a encoder part of deep neural network, it will be easier to extract more
good relationship between downsampled features and segmentation discriminative features which may help improve the segmentation per
labels by providing more learned information or features based on formance. Hence, in contrast to conventional downsampling methods
various strategies, such as multi-scale, prior guidance, and multi- that directly decrease the spatial resolution of feature maps with losing
modality. information, we propose an alternative approach. Firstly, it explicitly
Although the previous works could partly alleviate the problem of increase the channel number of feature maps and reduce its resolution
information loss after downsampling and help to extract representative using Haar wavelet transform without loss of information. Then, we
feature maps by aggregating more information. Yet, the main problem is employ convolution operations for representative feature leaning to
that simple aggregation of features (e.g., summation or concatenation) filter redundant information. Figure 1 shows several downsampling
may introduce unrelated information into the network, which hinders examples using average pooling, max pooling, strided convolution and
for discriminative feature learning [27,28]. In other words, conven the proposed HWD in DeepLabv3+ [13]. We can see the HWD could
tional downsampling methods in DCNNs lead to information loss. The preserve more detail information comparing to other three down
lost information, such as boundary, scale, and texture, is important for sampling methods.
object segmentation, especially for small-scale objects or the objects With the integration of HWD in DCNNs, the second question arises:
obscured between the background and foreground [7,29]. Despite the how to confirm or measure the amount of essential features which have
incorporation of various techniques such as skip connections, been learned by its subsequent convolution layers? Here, the essential
multi-scale features, and prior knowledge, recovering the essential lost features mean that they could help infer the DCNNs to generate better
information through subsequent layers remains a significant challenge. predictions comparing with the corresponding ground truth. In other
Hence, our first question is that if regular downsampling methods words, if extracted features are representative, the neural network will
would cause information loss, can we design an information-preserving have more confidence (certainty) to predict the desired results, sug
downsampling module to keep information as much as possible in gesting that the uncertainty of information is relatively low. Previous
2
G. Xu et al. Pattern Recognition 143 (2023) 109819
studies have widely utilized the concept of information entropy [36] to that integrates multi-scale resolution analysis into CNNs was presented
assess the degree of uncertainty or randomness present in signals or and applied to the task of texture classification and image annotation,
images in a communication system [37]. Considering a neural network yielding superior accuracy compared to prior approaches. In [41], a
as a communication system, the primary purpose of this system in seg multi-level wavelet CNN (MWCNN) model was proposed for image
mentation tasks is to reduce uncertainty between input images and their restoration, aiming to strike a better balance between receptive field size
corresponding semantic labels [38,39]. Inspired by the concept of in and computational efficiency. In [42], the MWCNN was utilized as a
formation entropy to evaluate uncertainty of information, we proposed denoiser prior for restoring blurred images corrupted by Cauchy noise.
a novel metric called Feature Entropy Index (FEI) to quantify the un Similarly, a wavelet-based CNN (Wavelet-SRNet) was introduced for
certainty between the feature and prediction output in DCNNs. In multi-scale face super-resolution, where wavelet coefficients of
particular, we could employ the FEI to estimate the degree of uncer high-resolution feature maps were learned before reconstructing
tainty of the downsampled features, which could reflect the importance high-resolution images [43]. In [44], a wavelet-like transform was in
of the features relative to the ground truth. tegrated into CNN for image compression, using an update-first lifting
In summary, our main contributions are as follows: scheme to support multi-resolution analysis. These studies leverage the
benefits of feature learning from CNNs and multi-scale resolution
(1) We propose a novel wavelet-based downsampling module (HWD) analysis offered by wavelet techniques. In contrast to these studies, we
for CNNs. To the best of our knowledge, our method is the first investigate the Haar wavelet transform as an alternative of the down
attempt to explore the feasibility by prohibiting (impeding) in sampling method in CNNs, and focus on the image segmentation task.
formation loss in the downsampling stage of DCNNs for the se
mantic segmentation task.
(2) We explore the measurement of information uncertainty across 2.2. Downsampling methods for feature maps
layers in CNNs, and propose a novel index (FEI) to evaluate the
information uncertainty or feature importance between the Downsampling operations offer inherent advantages, leading to their
downsampled feature maps and the prediction results. widespread utilization in diverse tasks, including ordinal classification
(3) The proposed HWD can directly replace the strided convolution [26], deep face recognition [45], semantic re-ranking systems [25], and
or pooling layer without significantly increasing the computa autonomous driving [24]. Numerous downsampling methods have been
tional overhead, and can be easily integrated into current seg proposed to reduce the spatial resolution of feature maps, which could
mentation architectures. Extensive experiments demonstrate the decrease the computational requirements of CNNs while increasing the
effectiveness of the HWD module in comparison with seven state- receptive field of subsequent convolutions such as ResNet [5] and U-Net
of-the-art (SOTA) segmentation methods. [8]. Currently, there are two types of downsampling methods that are
frequently used in CNNs, namely the pooling operation and the strided
The rest of this paper is organized as follows: Section 2 reviews the convolution operation. Figure 2 illustrates four types of pooling
related work of downsampling in DCNNs. Section 3 describes the pro methods. The pooling operations are based on a neighborhood approach
posed HWD module and the definition of FEI. Experiments and results and have no additional learning parameters. The max-pooing and the
are presented in Section 4. Section 5 gives the discussion and conclusion average pooling are the two main downsampling methods adopted in
of this study and points out the limitations and future works. many segmentation architectures, like FCN, U-Net and PSPNet. There
are also some works on adaptive pooling, like stochastic pooling [46]
2. Related work and softpool [47]. In [48], a wavelet pooling for CNNs has been intro
duced. This method decomposes features into two-level decomposition,
2.1. Wavelet transform in CNNs and discards the first-level subbands to reduce feature dimensions,
which shows superiority in addressing the overfitting problem in image
Several studies have explored the utilization of wavelet transform in classification task. The second category of downsampling uses strided
CNNs to improve feature representation in various tasks such as classi convolution operation to reduce the size of feature maps. It aims to
fication, super-resolution, denoising, and more. In [40], a wavelet CNN preserve perceptually important details and align local image features in
a learning manner [49,50]. For example, Chen et al. [51] proposed a
3
G. Xu et al. Pattern Recognition 143 (2023) 109819
Fig. 3. The architecture of the proposed HWD module which consists of two main blocks: lossless feature encoding block and feature representation learning block.
Note that the channel number of feature maps can be adjusted by the representaion learning block.
convolutional block for learning fractional downsampling. If we deep neural networks have been emerged for IQA. For instance, Incep
consider the downsampling operation as information encoding pro tion score (IS) [57] was found to evaluate the image quality in genera
cessing, the above downsampling (pooling or strided-convolution) tive adversarial networks. It directly applied a pretrained Inception
methods will result in information loss. The lost information, such as model for generated image to get the conditional label distribution
boundary, scale, and texture, plays an essential role in semantic seg based on the assumption that high quality image should contain
mentation. In other words, subsequent layers will have a better ability to meaningful objects. The Fréchet Inception Distance (FID) [58] is a
learn representative features if more information is preserved after metric that calculates the distance between feature vectors calculated
downsampling. Therefore, unlike previous studies, we proposed a sim for real and generated images.
ple yet effective downsampling module HWD in this paper, aiming to In summary, the majority of IQA metrics are commonly employed to
information preserving while reducing spatial resolution of feature evaluate the quality of the original images. This study focuses on the
maps. Any kind of downsampling operation, like strided convolution or development of a metric for assessing the quality of downsampled
max pooling, can be directly replaced with HWD at a minimal cost. feature maps. The metric aims to quantify the degree of information
uncertainty or feature importance by comparing the downsampled
feature maps with the prediction results.
2.3. Image quality assessment
Extensive research has been conducted in the field of Image Quality 3. Methods
Assessment (IQA) to evaluate image distortion and human observers’
perception. Based on the availability of an original reference image, IQA As illustrated in Fig. 3, the proposed HWD module comprises two
methods can be broadly categorized into two approaches: full-reference block: (1) the lossless feature encoding block and (2) the feature rep
(FR) and no-reference (NR). As FR IQA, peak signal-to-noise ratio, mean resentation learning block. The lossless feature encoding block is
square error and structural similarity index measure (SSIM) [52] have responsible for transforming features and reducing spatial resolution. To
become commonly used metrics. In particular, the SSIM, which is accomplish this, we utilize the Haar wavelet transform, a method that
designed by modeling a combination of correction loss, luminance efficiently decreases the resolution of feature maps while retaining all
distortion and contrast distortion, is widely accepted because of its information. The representation learning block consists of a standard
reasonable correlation with the perception of the human visual system convolution layer, batch normalization, and a ReLU activation layer. It is
(HVS). Several variants based on SSIM have been proposed, such as the employed to extract discriminative features. Each block will be thor
Multi-Scale Structural Similarity Index (MSSIM) [53], which in oughly explained in the following subsection.
corporates multi-scale spatial information, and the
Information-Weighted Structural Similarity Index (IWSSIM) [54], which 3.1. Lossless feature encoding block
integrates information-weighted entropy. Furthermore, the researchers
found that the HVS is more sensitive to visual information [55,56]. The The lossless feature encoding block utilizes a Haar wavelet transform
Gradient and Structure Similarity Model (GSM) [55] integrates gradient layer to effectively reduce the spatial resolution of feature maps while
and luminance information to provide a comprehensive assessment of preserving all information. The Haar wavelet transform is a widely
image quality. Similarly, the Feature Similarity Index Measure (FSIM) recognized, compact, dyadic, and orthonormal transform that finds
[56] employs phase congruency and gradient magnitude to assess image extensive application in image coding, edge extraction, and binary logic
quality. design [31]. The wavelet basis function and scale function for the
With the development of deep learning, some NR metrics based on 1-stage, one-dimensional Haar transform can be defined as follows:
4
G. Xu et al. Pattern Recognition 143 (2023) 109819
Fig. 4. The illustration of wavelet transform method, which encodes the input image into four components with reduced spatial resolution comparing to the input.
Fig. 5. Illustration of HWD and max pooling downsampling methods for a RGB image. Top row: downsampled feature with max pooling operation. Bottom row:
feature extraction with HWD. (Note that the displayed feature is obtained by maximizing all its downsampled feature maps along channel direction.).
⎧
⎪ 1 1 to a two-dimensional signal, such as a grayscale image, yields four
⎪
⎨ ϕ1 (x) = √̅̅2̅ϕ1,0 (x) + √̅̅2̅ϕ1,1 (x)
⎪
components, each with half the spatial resolution of the original signal.
(1) Figure 4 depicts the decomposition process of an image with the
⎪
⎪ 1 1
⎪
⎩ ψ 1 (x) = √̅̅̅ϕ1,0 (x) − √̅̅̅ϕ1,1 (x). resolution of H × W using the Haar Wavelet transform. Here, H0 and H1
2 2
represent the low-pass and high-pass decomposition filters, respectively.
Here, ϕj,k (x) is defined as: These filters are employed to extract the approximate and high-
√̅̅̅̅ ( frequency information from an image, respectively. The symbol ↓2 de
)
ϕj,k (x) = 2j ϕ 2j x − k , k = 0, 1, ⋯, 2j − 1. (2) notes downsampling applied to the approximate and detail components.
The Haar wavelet transform generates four components: the approxi
In this context, the parameters j and k are denoted as the stage (or scale mate (low-frequency) component (A), as well as the detail components
in the image processing domain) and the order (or direction for 2D (high frequency) in the horizontal (H), vertical (V), and diagonal (D)
image) of the Haar basis function, respectively. Furthermore, ϕ0,0 (x) is directions. Each component has a size of H2 × W
2 . It should be noted that
defined as: the resolution of each component is reduced to H2 × W 2 , whereas the
⎧ channel number of feature maps has quadrupled. In other words, the
⎨ 0, x<0
Haar Wavelet transform can encode partial information from the spatial
ϕ0,0 (x) = ϕ0 (x) = 1, 0⩽x < 1, (3)
⎩
0, x⩾1 dimension into the channel dimension without any loss of information.
Consequently, the subsequent layers of DCNNs can extract representa
Therefore, the 1-stage Haar transform can be expressed using the 0- tive features from the transformed components, which have reduced
stage Haar basis functions: spatial resolution.
{
ϕ1 (x) = ϕ0 (2x) + ϕ0 (2x − 1)
(4)
ψ 1 (x) = ϕ0 (2x) − ϕ0 (2x − 1). 3.2. The representation feature learning block
This implies that a signal of length L can be divided into two parts of
The feature representation learning block consists of a standard 1 ×
length L2, which can be interpreted as low-pass and high-pass decom
1 convolution layer, a batch normalization layer, and a ReLU activation
position filters, respectively. When applying the Haar wavelet transform function. In this block, standard convolution is employed to adjust the
5
G. Xu et al. Pattern Recognition 143 (2023) 109819
mentation model. Each pixel value from P means the maximum pre
diction probability for each class. M means the total pixel number of P, where g, p, N and C denote the ground truth, prediction results, the
which is equal to H × W. The FEI enables us to quantify the uncertainty number of image elements, and label classes, respectively. The first and
of feature maps generated by convolutional neural networks (CNNs). A second items refer to the definitions of cross-entropy and generalized
smaller FEI value indicates reduced uncertainty in the feature maps, Dice loss, respectively. According to Chen et al. [60], all 3D volume
implying that they offer more information for uncertainty reduction. datasets are trained slice by slice, and the predicted 2D slices are then
When FEI reaches zero, it signifies that the model can accurately infer stacked together to form a 3D prediction for evaluation on Synapse. For
the segmentation using the current feature maps. the other two datasets, we evaluate the segmentation performance using
2D predictions. The evaluation metrics, namely intersection-over-union
(IoU), Dice similarity coefficient (DSC), and Hausdorff distance (HD),
4. Experimental results
are used to evaluate the performance. The definitions of each metric are
as follows:
To evaluate the effectiveness of the HWD, extensive experiments are
conducted on three different image modalities, including natural image, IoU =
TP
, (7)
Computed Tomography (CT), and Micro-optical sectioning tomography TP + FP + FN
(MOST) images. In this section, we provide a brief introduction to the
2TP
datasets and discuss the implementation details. Subsequently, we DSC = , (8)
report the segmentation results and compare them with state-of-the-art 2TP + FP + FN
(SOTA) methods on these three datasets.
6
G. Xu et al. Pattern Recognition 143 (2023) 109819
Table 2
Performance of seven SOTA segmentation architecture with three types of ResNet as backbones on Camvid dataset.
Method Backbone mIoU↑ Sky BLD Pol Roa Pav Tre Sig Fen Car Ped Bik Time(ms)
PSPNet R-18 57.50 90.25 77.71 18.82 93.46 79.63 71.77 29.10 20.09 77.76 40.23 33.73 2.18
R-34 60.23 90.35 77.87 20.86 93.59 79.06 71.56 33.95 26.47 81.92 43.38 43.55 2.73
R-50 58.63 90.88 78.23 20.73 93.95 80.22 72.85 33.53 17.66 80.69 38.45 37.76 3.26
LinkNet R-18 60.30 91.71 78.01 24.72 94.35 81.35 70.55 31.25 23.63 80.26 43.21 44.29 3.88
R-34 61.78 91.73 78.85 25.91 94.22 81.31 71.05 35.44 23.11 81.76 45.56 50.61 5.35
R-50 63.04 92.23 80.22 28.15 94.22 81.12 72.90 36.02 31.14 82.03 46.08 49.37 6.39
FPN R-18 62.71 91.28 79.94 27.94 95.04 82.92 71.67 33.13 27.38 84.86 45.52 50.13 3.75
R-34 64.36 91.40 80.74 26.46 94.79 82.43 73.57 36.26 34.38 84.81 49.88 53.19 5.24
R-50 63.82 91.44 80.13 26.71 95.16 83.83 72.47 35.63 29.66 84.37 49.55 53.03 6.25
PAN R-18 61.47 90.64 79.72 25.33 94.66 81.95 70.88 34.08 27.58 82.73 44.96 43.66 4.41
R-34 62.94 90.73 80.70 27.19 94.84 82.41 72.47 37.31 28.40 85.79 46.24 46.31 5.82
R-50 60.66 89.86 78.66 25.70 94.38 81.15 71.93 30.85 22.52 82.75 46.70 42.72 6.98
DeepLabv3þ R-18 61.27 90.72 79.16 24.48 94.41 81.60 71.93 34.13 20.79 83.73 47.42 45.55 3.40
R-34 61.39 91.23 79.48 24.76 94.72 82.59 72.29 34.78 22.36 83.32 46.21 43.59 4.88
R-50 61.74 91.19 79.43 26.60 94.64 81.69 72.31 33.36 20.82 83.21 49.12 46.79 5.93
U-Net R-18 62.54 91.91 79.20 27.96 94.51 82.51 71.59 33.34 27.47 84.17 47.75 47.54 3.64
R-34 64.61 92.23 80.55 30.20 95.28 84.16 73.00 39.12 28.24 84.24 51.35 52.36 5.16
R-50 64.33 91.93 80.68 29.38 94.83 82.88 73.69 37.56 28.85 83.54 52.54 51.77 6.28
Unetþ R-18 63.50 92.18 80.17 28.34 94.66 82.61 72.04 37.21 30.33 84.40 49.85 46.66 5.38
R-34 64.89 92.11 80.94 30.99 95.06 83.90 73.55 38.84 26.96 85.82 53.59 52.00 6.91
R-50 64.86 92.24 80.93 29.81 94.96 82.75 74.38 40.61 28.74 83.18 53.25 52.63 8.24
Mean / 62.22 91.34 79.59 26.24 94.56 82.00 72.31 35.02 26.03 83.11 47.18 47.01 5.05
Table 3
Performance of seven SOTA segmentation architecture with three types of ResNet equipped with HWD module as backbones on Camvid dataset.
Method Backbone mIoU↑ Sky BLD Pol Roa Pav Tre Sig Fen Car Ped Bik Time(ms)
PSPNet_HWD R-18 58.83 90.38 77.38 20.47 94.18 80.12 72.46 29.42 28.50 79.49 39.67 35.02 2.86
R-34 61.15 90.53 78.99 21.93 94.57 81.58 72.77 36.32 27.10 81.52 43.40 43.97 3.44
R-50 60.08 90.91 78.32 22.43 94.34 80.55 73.61 32.14 25.41 81.34 42.13 39.74 3.90
LinkNet_HWD R-18 63.02 91.91 80.32 28.98 94.77 82.74 71.34 35.33 29.24 83.67 49.06 45.88 4.45
R-34 64.07 92.28 79.99 29.87 94.34 81.61 72.05 36.13 31.94 85.70 49.44 51.46 6.02
R-50 64.63 92.20 81.09 30.43 93.91 80.83 74.34 39.56 31.22 82.69 50.75 53.93 7.15
FPN_HWD R-18 64.66 91.57 80.40 28.90 95.14 82.87 73.36 35.95 34.07 84.53 50.00 54.50 4.44
R-34 65.15 91.49 81.33 28.47 95.20 83.66 73.77 38.11 33.47 86.51 49.98 54.70 6.04
R-50 64.74 91.46 81.44 28.39 94.80 82.54 73.59 38.86 31.22 85.80 50.56 53.51 7.04
PAN_HWD R-18 62.69 90.93 80.00 25.86 94.63 82.18 72.20 33.19 32.69 83.68 47.41 46.85 5.02
R-34 64.11 90.98 80.28 28.18 95.39 84.10 72.13 37.49 31.08 85.42 50.44 49.71 6.54
R-50 61.10 91.19 78.69 24.99 94.13 79.97 70.39 34.05 23.53 82.82 45.84 46.54 7.86
DeepLabv3þ_HWD R-18 62.75 91.33 80.24 26.54 94.42 80.64 72.06 35.49 30.26 84.53 49.09 45.61 4.00
R-34 63.07 90.99 80.71 28.93 94.43 81.14 72.96 36.17 23.78 85.25 49.56 49.88 5.55
R-50 62.31 91.56 80.02 27.80 94.20 80.78 73.19 33.02 21.84 82.60 50.99 49.42 6.62
U-Net_HWD R-18 64.31 91.83 80.51 30.40 94.93 83.34 71.87 36.54 37.14 82.44 50.13 48.27 4.29
R-34 65.62 92.23 80.93 30.91 95.41 84.68 73.52 38.98 37.97 83.43 53.83 49.92 5.84
R-50 64.28 92.35 80.82 29.97 94.49 82.23 74.07 39.03 29.51 83.51 52.87 48.25 6.87
Unetþ_HWD R-18 65.98 92.30 81.48 32.30 95.04 83.40 73.76 38.95 34.48 86.27 53.83 54.01 6.03
R-34 66.10 92.48 81.63 31.61 95.66 84.77 73.93 41.67 32.08 86.33 52.67 54.30 7.87
R-50 66.48 92.42 82.11 33.09 94.94 83.23 74.43 45.55 37.98 83.61 52.39 51.56 8.98
Mean / 63.58 91.59 80.32 28.12 94.71 82.24 72.94 36.76 30.69 83.86 49.24 48.91 5.75
where TP, FP, and FN, x and y mean true positive, false positive, false (1) Comparison with SOTA methods
negative, and two points from two finite points sets X and Y. We evaluate our method on seven state-of-the-art (SOTA) segmen
Considering that ResNets have become the prevailing backbone in tation architectures: DeepLabv3+, PSPNet, LinkNet, FPN, PAN, U-Net,
numerous segmentation architectures, we employ ResNet-18, ResNet- and Unet++. ResNet-18 (R-18), ResNet-34 (R-34), and ResNet-50 (R-
34, and ResNet-50 as the backbone for feature learning in this study. 50) are selected as the backbones for feature learning in each architec
Additionally, in the Appendix, we present the segmentation perfor ture. Further details about the training loss can be found in the
mance achieved by adopting MobileNetv2 as the backbone, as well as Appendix.
other SOAT segmentation architectures based on convolution or trans Evaluation: Tables 2 and 3 present a comparison of our plug-in-
former operations. Basically, a standard ResNet comprises a stem with plug-off HWD module’s effectiveness against the baselines in terms of
two downsampling operations, followed by four stage blocks that consist mean IoU. The SOTA segmentation architectures equipped with the
of convolution layers and residual connections. Our primary objective is HWD module exhibit a performance improvement of 1∼2% for each
to maximize the retention of information at the stem of the ResNet. respective architecture. Specifically, our method demonstrates its
Therefore, we straightforwardly substitute the initial convolution layer effectiveness on this dataset by improving the mIoU of all 21 models by
(with a stride of 2) and the first max pooling layer in the stem of ResNet 1.21%. Particularly, our method significantly improves the performance
with HWD, while keeping all the subsequent layers identical to the of small-scale objects, such as pedestrian (+2.06%), bicycle (+1.9%),
original ResNet architecture. It is worth noting that all models are fence (+4.66%), and sign symbol (+1.74%). This suggests that seg
trained from scratch. menting large-scale objects is relatively easy, even when downsampling
7
G. Xu et al. Pattern Recognition 143 (2023) 109819
Fig. 6. Visualized segmentation results on Camvid test set. The first column contains the input image and its corresponding ground truth. The second, fourth, and
sixth columns display the outputs of DeepLabv3+, LinkNet, and U-Net, respectively, with ResNet-34 as the backbone. The third, fifth, and seventh columns show
zoomed-in images that correspond to the regions highlighted in red boxes. More specifically, the even rows represent the segmentation results obtained with HWD,
while the odd rows correspond to the results using the original downsampling in ResNet-34. (For interpretation of the references to colour in this figure legend, the
reader is referred to the web version of this article.)
Table 4
Results of SSIM (%), PSNR (dB), FEI and FEI on boundary loop (testing on Camvid testing dataset).
Model Backbone SSIM↑ PSNR↑ FEI↓ FEI_B↓ Model SSIM↑ PSNR↑ FEI↓ FEI_B↓
DeepLabv3þ R-18 70.66 9.29 219.70 176.32 DeepLabv3þ_HWD 78.42 11.07 94.63 73.22
R-34 70.92 8.95 240.39 191.90 76.23 10.26 90.10 70.00
R-50 71.87 8.74 266.47 212.83 75.38 10.87 121.74 92.44
FPN R-18 67.33 7.71 244.62 197.79 FPN_HWD 79.35 10.94 100.22 78.97
R-34 67.84 7.49 255.78 205.79 79.30 10.84 96.98 77.07
R-50 71.67 8.46 253.36 204.54 78.94 11.33 105.83 82.89
Linknet R-18 72.12 11.36 200.28 152.82 Linknet_HWD 78.71 10.84 92.37 70.15
R-34 72.26 10.63 227.73 172.98 79.80 11.03 98.56 74.90
R-50 71.51 9.83 219.57 170.36 80.33 11.91 107.27 82.81
PAN R-18 70.00 8.56 237.82 191.30 PAN_HWD 77.44 10.54 95.52 73.29
R-34 70.39 8.54 241.81 193.92 77.83 10.77 100.73 78.02
R-50 72.31 9.05 262.54 206.88 78.85 11.30 120.69 92.36
PSPNet R-18 68.36 7.86 377.00 284.23 PSPNet_HWD 77.93 10.74 153.56 111.35
R-34 70.63 8.55 325.80 250.17 78.15 10.84 133.43 101.98
R-50 73.00 8.71 319.11 247.27 79.08 11.50 134.00 100.24
Unetþ R-18 66.91 7.81 224.88 174.71 Unetþ_HWD 76.72 10.20 85.42 65.27
R-34 66.86 7.51 229.25 181.18 75.22 9.94 89.85 68.27
R-50 69.39 7.69 234.76 184.64 74.84 10.13 101.74 77.89
U-Net R-18 67.11 8.07 238.13 182.41 U-Net_HWD 75.64 9.81 94.52 71.72
R-34 68.45 7.98 239.87 189.82 79.01 10.80 94.59 71.46
R-50 71.55 8.34 235.51 186.38 77.38 10.59 101.84 77.41
Mean / 70.05 8.63 252.11 198.01 Mean 77.83 10.77 105.41 80.56
operations may lead to information loss. Additionally, we recorded the which can be summarized in three aspects: 1) Pure ResNet-34-based
prediction time for each model on the CPU. The mean inference time methods tend to under-segment objects such as sign-symbol, bicycle,
increased by 0.7 ms when using the HWD in the stem of ResNet, which is and fence, highlighting the ability of HWD to alleviate the under-
acceptable considering the improved segmentation performance. segmentation issue. 2) When comparing with and without HWD, our
Qualitative results: Figure 6 presents the visualization of the seg method demonstrates improved results for small-scale objects. 3)
mentation results using three different architectures (DeepLabv3+, U- Comparing with the original ResNet-based models, the predictions using
Net, and LinkNet) with ResNet-34 (R-34) as the backbone, both with and HWD exhibit smoother boundaries and shapes (e.g., tree, sidewalk, and
without the HWD module. The three architectures equipped with the building) compared to those using conventional downsampling
proposed HWD downsampling module exhibit improved performance, methods. (2) Evaluation on feature effectiveness after downsampling
8
G. Xu et al. Pattern Recognition 143 (2023) 109819
Fig. 7. Comparion the performance by using low frequent component A (denoted as HWD_LL) in HWD and high frequent components (H, V and D, denoted as
HWD_HH) in HWD. The “+” in the figure means the improvement performance in terms of mIoU.
Table 5
Mean performance of total 21 segmentation models with ResNet as backbone, by using various number of HWD as dowsnsampling operation on Camvid dataset.
No. of HWD mIoU↑ Sky BLD Pol Roa Pav Tre Sig Fen Car Ped Bik
0 62.22 91.34 79.59 26.24 94.56 82.00 72.31 35.02 26.03 83.11 47.18 47.01
2 63.58 91.59 80.32 28.12 94.71 82.24 72.94 36.76 30.69 83.86 49.24 48.91
6 63.40 91.57 80.13 27.89 94.62 81.96 72.86 36.19 30.53 83.94 49.10 48.54
8 63.42 91.47 80.23 27.89 94.69 82.01 72.81 36.44 30.41 83.86 48.95 48.87
10 63.03 0.91 0.80 0.28 0.95 0.83 0.73 0.36 0.29 0.85 0.49 0.48
Table 6
Performance of seven SOTA models without HWD on the Synapse test set.
Methods Backbone mDSC↑ mHD↓ Aorta Gall KidL KidR Liver Pancreas Spleen Stomach
U-Net R-18 77.37 27.43 85.88 62.08 81.27 72.73 93.91 60.6 90.26 72.24
R-34 79.21 27.94 87.39 66.81 82.81 76.69 94.02 60.04 90.41 75.51
R-50 77.84 31.27 88.17 62.39 82.51 78.57 94.03 55.15 88.73 73.18
DeepLabv3þ R-18 75.92 24.23 85.08 61.48 83.32 74.16 93.44 54.46 85.39 70
R-34 77.37 22.08 85.56 62.69 81.99 75.45 93.17 59.53 87.31 73.27
R-50 73.87 31.94 83.78 57.23 76.91 71.42 92.91 50.22 87.03 71.43
FPN R-18 76.56 21.84 85.69 58.88 80.66 73.6 92.81 53.92 90.93 76.02
R-34 77.89 15.86 86.62 61.42 84.94 76.25 93.32 55.45 90.2 74.94
R-50 77.08 21.19 85.84 59.56 81.63 78.4 93.55 52.82 89.57 75.28
PSPNet R-18 74.19 20.65 82.05 58.58 80.86 73.69 92.16 46.22 86.44 73.52
R-34 75.89 16.43 82.39 62.65 82.38 77.24 93.22 49.54 85.66 74.05
R-50 76.3 21.3 82.74 65.13 81.58 74.82 92.71 51.1 88.05 74.3
PAN R-18 76.63 19.13 84.55 60.17 82.76 75.58 93.54 51.58 89.68 75.2
R-34 78.17 14.27 86.35 60.22 84.46 80.42 93.23 55.21 90.22 75.27
R-50 77.67 15.36 85.89 60.52 84.08 79.73 93.32 53.06 89.47 75.26
LinkNet R-18 77.56 32.26 86.66 60.49 79.26 75.37 94.05 60.33 89.13 75.23
R-34 79.09 21.46 88.14 62.49 86.3 82.3 94.59 58.94 87.74 72.23
R-50 78.44 28.14 87.61 64.5 80.83 74.51 94.34 58.67 89.76 77.31
Unetþ R-18 78.63 23.27 87.56 64.94 81.11 77.94 94.43 59.3 89.7 74.02
R-34 80.76 21.27 88.8 70.01 84.35 80.75 94.52 64.76 89.66 73.24
R-50 79.61 28.85 89.24 65.64 83.47 77.54 94.06 63 87.96 75.93
Mean / 77.43 23.15 86 62.28 82.26 76.53 93.59 55.9 88.73 74.16
In this study, we utilize Structural Similarity (SSIM), Peak Signal-to- with a kernel size of 5 is applied to extract object edges from the ground
Noise Ratio (PSNR), and the proposed Feature Entropy Index (FEI) to truth. The results in Table 4 demonstrate an improvement in SSIM
assess the effectiveness of the downsampling on the feature maps. Spe (7.78%) and PSNR (2.14 dB) across all 21 models. Furthermore, the
cifically, SSIM and PSNR are employed to evaluate the structural simi introduction of HWD in each architecture leads to a reduction in infor
larity and information fidelity between the input image and the feature mation uncertainty. Across all 21 models, the HWD reduces feature
maps. Additionally, the FEI and FEI boundary (FEI_B) are computed to uncertainty by 58.2% (FEI) and 46.8% (FEI_B) comparing to the original
quantify the information uncertainty in the downscaled feature maps downsampling methods. Furthermore, we observe that both the stan
and prediction outputs. In this study, a dilation morphological operation dard downsampling and HWD approaches exhibit higher uncertainty
9
G. Xu et al. Pattern Recognition 143 (2023) 109819
Table 7
Performance of seven SOTA models with HWD on the Synapse test set.
Methods Backbone mDSC↑ mHD↓ Aorta Gall KidL KidR Liver Pancreas Spleen Stomach
U-Net_HWD R-18 79.52 21.22 87.8 63.44 84.97 79.62 93.91 60.58 89.76 76.12
R-34 80.61 16.23 87.06 65.76 85.92 83.09 94.34 60.75 90.6 77.37
R-50 78.6 26.95 87.32 64.14 83.02 79.99 94.06 58.69 88.52 73.08
DeepLabv3þ_HWD R-18 77.11 23.02 86.19 63.63 81.9 76.04 93.36 51.87 88.83 75.1
R-34 77.82 22.09 87.3 61.56 82.61 78.69 93.69 53.39 89.74 75.57
R-50 75.69 26.76 85.32 61.9 80.33 70.32 93.48 51.93 89.19 73.09
FPN_HWD R-18 79.22 21.41 86.42 68.1 82.84 77.2 92.97 60.49 89.83 75.95
R-34 79.38 16.6 86.16 69.67 82.83 76.92 92.89 60.38 89.59 76.63
R-50 79.62 18.52 86.73 65.95 83.21 81.28 93.66 57.05 90.69 78.37
PSPNet_HWD R-18 76.15 19.55 83.05 61.17 82.73 77.92 93.47 49.62 84.03 77.21
R-34 76.94 18.3 82.51 65.29 82.83 77.46 93.49 51.25 87.13 75.52
R-50 76.92 17.39 84.19 60.63 83.22 78.85 93.28 51.29 87.46 76.42
PAN_HWD R-18 78.17 15.18 85.26 63.41 85.14 79.88 93.17 53.05 89.95 75.48
R-34 77.78 20.17 85.88 64.41 81.77 75.77 93.01 55.02 90.15 76.22
R-50 77.14 22.23 86.56 64.42 79.24 76.41 93 49.76 90.09 77.66
LinkNet_HWD R-18 79.15 22.12 87.63 68.03 81.31 76.66 94.67 58.33 87.84 78.75
R-34 79.59 24.1 87.72 62.23 83.19 79.97 94.29 63.23 89.07 76.97
R-50 79.29 25.17 87.44 64.53 84.55 80.22 93.46 55.43 91.53 77.13
Unetþ_HWD R-18 80.24 22.73 87.52 72.52 83.75 78.95 94.11 61.39 89.57 74.14
R-34 82.14 18.58 88.56 73.8 86.55 83.11 94.47 64.91 90.8 74.93
R-50 80.85 24.07 88.35 67.31 84.02 79.61 94.04 65.05 91.52 76.89
Mean / 78.66 21.07 86.43 65.33 83.14 78.47 93.66 56.83 89.33 76.12
Table 8
Comparison with the SOTA results w/o HWD on the MOST test set.
Model Backbone mDSC Vessel Soma Model mDSC Vessel Soma
near the object boundaries, indicating the difficulty in accurately seg The number of HWD modules in ResNet backbones:As demon
menting these regions. strated by the aforementioned experiments, replacing the down
(3) Ablation study sampling module in the stem of ResNet with HWD significantly
We observed that state-of-the-art (SOTA) architectures equipped improves the segmentation performance by incorporating additional
with HWD yield superior segmentation results compared to conven information into the models. The objective of this ablation study is to
tional methods such as max pooling or strided-convolution. Based on the assess the effects of replacing the remaining downsampling layers in
decomposition ability of HWD, which separates a feature map into a ResNets with HWD. In particular, ResNet architecture comprises of a
low-frequency part (A) and three high-frequency components (D, V, and stem and four stage blocks, consisting of convolution layers, batch
H) (Fig. 4), we conduct ablation studies to investigate the relative normalization, and ReLU operations, along with skip connections. The
importance of each part in the semantic segmentation task. stem involves two downsampling operations, namely, strided convolu
Low frequency vs. high frequency: In Fig. 7, we assess the domi tion and max pooling. In stage 2, stage 3, and stage 4, each includes two
nant component, after Haar wavelet transform, for achieving optimal strided convolution operations-one in the skip connection and the other
segmentation performance. The low-frequency component (A) is found in the residual block. In this study, we replace the downsampling op
to be crucial for enhancing the segmentation performance of objects. erations in various blocks of ResNet with HWD. Comparing with the
This component significantly improves the mean Dice Similarity Coef original backbone, it is evident from Table 5 that utilizing HWD as the
ficient (DSC) by 0.91% compared to using only high-frequency infor downsampling operation leads to improved segmentation performance.
mation (D, V, and H), despite the high-frequency components having Furthermore, we observe that it is more effective to use HWD solely in
three times the number of feature maps than their low-frequency the stem of ResNet. Therefore, we have implemented two HWD modules
counterpart. to replace the downsampling operations in the stem of ResNet, achieving
10
G. Xu et al. Pattern Recognition 143 (2023) 109819
Table A.1 a favorable balance between efficiency and accuracy in this study.
The definition of mathematical notations used in this paper.
Variable Definition 4.4. Experiment results on Synapse dataset
α
̃ The value after pooling
αi αi represents the values in a region R of the input data We perform experiments using the same seven SOTA models as those
|R| |R| denotes the cardinality or the number of elements in the region R employed in the Synapse dataset. The results demonstrate that inte
ρi The probability of αi being selected undergoing stochastic pooling grating our HWD module into the architecture improves the DSC by
H The height of feature map
1.23% and reduces the HD by 2.08mm across the 21 models (Tables 6
W The width of feature map
H0 The low-pass decomposition filters and 7). Specifically, when compared to the seven models using R-50 as
H1 The high-pass decomposition filters the backbone, our HWD module enhances the mean IoU by 1.57% when
ϕ The wavelet basis function R-18 and R-34 are used as backbones. However, it only increases the
ψ The scale function mean IoU by 0.93% when R-18 and R-34 are used as backbones, indi
g The ground truth
cating that our proposed method is more effective for architectures with
p The prediction results
gi The pixels in g fewer parameters.
pi The pixels in p
N The number of image pixels 4.5. Experiment results on MOST dataset
C The number of label classes
TP The number of positive instances that are correctly predicted
FP The number of negative instances that are incorrectly predicted The segmentation architecture integrated with our proposed HWD
TN The number of positive instances that are incorrectly predicted module demonstrates a 1.26% improvement in terms of DSC (Table 8).
x The point in set X Importantly, the HWD module significantly enhances the segmentation
y The point in set Y
performance when ResNet-18 and ResNet-34 are used as backbones for
feature extraction. For instance, the DSC values for U-Net_R18 and U-
Net_R34 are 84.39% and 84.01%, respectively. However, upon inte
gration of our HWD module, the DSC improves by 3.01% and 3.96% for
U-Net_R18 and U-Net_R34, respectively. This demonstrates that shallow
CNNs have a higher demand for information compared to relatively
Fig. B.1. Comparison of training loss with/without HWD in DeepLabv3+ architecture with ResNet34 as backbone.
Table C.1
Mean performance of 21 segmentation models with ResNet as backbone, by using Max pooling, average pooling and HWD as dowsnsampling operation on Camvid
dataset.
Method mIoU↑ Sky BLD Pol Roa Pav Tre Sig Fen Car Ped Bik
MaxPooling in HWD 60.77 90.32 78.92 23.33 94.31 81.05 69.66 33.31 25.31 82.15 45.91 44.19
AveragePooling in HWD 61.84 91.40 79.12 26.13 94.47 81.42 71.62 34.45 24.45 82.91 47.82 46.46
HWD 63.58 91.59 80.32 28.12 94.71 82.24 72.94 36.76 30.69 83.86 49.24 48.91
Table D.1
Performance of seven SOTA segmentation architectures as MobileNetv2 backbone on Camvid dataset.
Method mIoU↑ Sky BLD Pol Roa Pav Tre Sig Fen Car Ped Bik
PSP 53.04 89.70 73.75 16.00 90.90 73.39 67.94 29.50 16.48 73.17 30.29 22.39
LinkNet 53.70 90.92 74.47 2.87 93.07 77.84 66.90 18.89 19.24 79.90 29.31 37.34
FPN 57.76 89.86 75.85 23.28 94.09 80.10 68.19 30.09 17.35 78.97 40.25 37.37
PAN 55.54 90.53 75.24 19.46 93.29 77.13 67.30 24.47 17.24 78.92 35.76 31.61
DeepLabv3þ 56.07 91.04 75.26 20.16 93.42 78.63 68.15 26.04 18.18 74.66 34.89 36.37
U-Net 57.43 90.13 75.46 22.29 93.46 79.85 68.48 27.89 17.32 74.13 43.58 39.17
Unetþ 59.00 90.49 76.93 25.09 93.46 79.49 69.60 30.38 21.30 79.30 42.07 40.89
Mean 56.08 90.38 75.28 18.45 93.10 78.06 68.08 26.75 18.16 77.01 36.59 35.02
11
G. Xu et al. Pattern Recognition 143 (2023) 109819
Table D.2
Performance of seven SOTA segmentation architectures as MobileNetv2 backbone equipped with HWD module on Camvid dataset.
Method mIoU↑ Sky BLD Pol Roa Pav Tre Sig Fen Car Ped Bik
PSP 53.81 89.60 74.25 16.21 91.41 73.69 69.85 27.23 24.90 72.71 31.65 20.38
LinkNet 55.72 89.97 75.15 8.26 93.13 77.62 68.06 22.09 19.61 82.93 34.81 41.34
FPN 58.38 91.07 77.54 22.59 93.16 77.58 69.51 28.85 19.76 80.03 40.53 41.57
PAN 55.76 90.60 73.94 19.50 93.59 78.46 66.94 26.44 15.09 76.54 32.93 39.28
DeepLabv3þ 56.44 90.89 75.45 19.76 93.09 77.92 66.05 29.87 14.85 77.30 37.39 38.30
U-Net 58.89 91.71 77.01 24.03 93.48 79.94 68.10 30.59 24.83 76.16 41.59 40.38
Unetþ 58.30 90.81 76.52 26.18 93.85 80.95 68.24 29.38 17.59 79.59 41.26 36.93
Mean 56.76 90.66 75.69 19.50 93.10 78.02 68.11 27.78 19.52 77.89 37.17 36.88
Table E.1
Performance of seven SOTA models without HWD on the Synapse test set.
Backbone mDSC↑ mHD↓ Aorta Gall KidL KidR Liver Pancreas Spleen Stomach
HRNet 77.10 21.36 87.61 62.69 79.71 75.65 93.49 54.20 89.79 73.65
HRNet_HWD 78.86 18.51 88.01 65.28 83.05 78.81 93.76 55.18 89.47 77.32
ConvNext 57.39 81.03 70.27 37.65 60.01 63.81 85.94 18.95 70.09 52.37
ConvNext_HWD 65.11 27.63 70.22 44.74 73.30 72.07 89.97 30.53 80.70 59.37
UNext 70.83 42.45 82.10 60.04 72.56 64.31 91.64 43.39 82.83 69.79
UNext_HWD 72.94 32.45 84.23 55.90 79.95 68.83 92.81 45.43 86.00 70.33
TransUNet 75.21 41.82 87.59 55.03 82.00 73.39 92.52 55.13 86.01 70.04
TransUNet_HWD 75.64 45.36 87.78 62.22 82.07 74.23 91.64 54.75 83.59 68.86
Swin-Unet 59.87 55.58 63.29 51.95 57.98 60.29 89.16 30.37 71.39 54.54
Swin-Unet_HWD 69.77 28.26 73.00 59.26 77.27 71.84 90.58 39.60 83.11 63.49
SegFormer 74.82 22.79 81.54 63.62 80.90 78.28 92.05 49.80 83.24 69.16
SegFormer_HWD 77.33 24.64 84.44 59.33 86.88 81.42 91.49 52.20 85.81 77.09
Mean Improved↑ 4.07 14.69 2.55 2.63 8.23 5.24 0.91 4.31 4.22 4.49
deep networks. process. Second, the proposed HWD offers a great potential for
improving segmentation performance on the current SOTA convoul
5. Conclusion tional neural network benchmarks, like U-Net, DeepLabv3+ and
PSPNet. However, the HWD module lacks the ability to capture global
In conclusion, we present a general downsampling module (HWD) context and establish long-range spatial relations due to the localized
for semantic segmentation in the paper. The goal of the HWD module is nature of convolution operations. We are considering integrating both
to retain as much essential information as possible during down local and global features from convolution and Transformer operations
sampling. Extensive experiments and ablation studies conducted on into the HWD module in our future work.
three different image datasets with varying modalities demonstrate the
effectiveness of the proposed HWD module and the FEI metric. This
Declaration of Competing Interest
work has implications for various CNN-based computer vision tasks,
including instance segmentation, object detection, and pose estimation.
The authors declare that they have no known competing financial
Furthermore, to assess the quality of downsampled feature maps, we
interests or personal relationships that could have appeared to influence
introduce a new metric called Feature Entropy Index (FEI). The FEI
the work reported in this paper.
metric effectively reflects the degree of information uncertainty by
considering the downsampled feature maps and the prediction results.
Data availability
Experimental results further indicate that the HWD module provides
more information for object segmentation compared to conventional
Data will be made available on request.
downsampling methods.
The proposed HWD downsampling module and FEI assessment
Acknowledgments
metric have the following two main advantages. Firstly, the HWD
module can seamlessly integrate into existing segmentation architec
This work is supported by the Guangdong Provincial Key Laboratory
tures due to its generality. It can directly replace existing downsampling
of Human Digital Twin (No. 2022B1212010004), the Open-Fund of
methods, such as max pooling, average pooling, or strided convolution,
WNLO (No. 2018WNLOKF027), the Hubei Key Laboratory of Intelligent
without introducing additional complexities. Furthermore, it signifi
Robot in Wuhan Institute of Technology (No. HBIRL202202 and No.
cantly improves the segmentation performance. Secondly, regarding the
HBIR202206), and the Chongqing Science and Technology Bureau (No.
FEI, it can be applied to assess the quality of feature maps, serving as a
2022TIAD-KPX0190). We thank the Optical Bioimaging Core Facility of
quantitative indicator for assessing the amount of essential information
WNLO-HUST for the support in MOST data acquisition.
preserved after downsampling in segmentation architectures.
The primary limitations of the proposed HWD module lie in its ef
ficiency and locality. Therefore, we intend to continue our research in Appendix A. mathematical notations table
two aspects. First, it is crucial to extract representative features from
input images for semantic segmentation. However, the source images
contain a significant amount of redundant information, which may
hinder the extraction of representative features. We aim to incorporate Appendix B. Training Loss with/without HWD
prior knowledge, such as boundaries and textures, into the HWD module
to efficiently filter out irrelevant information during the downsampling This section presents the training loss with and without HWD in the
ResNet backbones. As shown in Fig. B.1, the training loss exhibits a
12
G. Xu et al. Pattern Recognition 143 (2023) 109819
slightly faster decrease when utilizing the HWD module in the Unet++ [4] A. Krizhevsky, I. Sutskever, G. Hinton, ImageNet classification with deep
convolutional neural networks, Adv. Neural Inf. Process. Syst. 25 (2) (2012).
and DeepLabv3+ architectures compared to the original ResNet back
[5] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition.
bone. Specifically, the training loss is consistently lower when employ Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
ing the HWD module compared to using conventional downsampling 2016, pp. 770–778.
operations such as strided convolution and max pooling. [6] F. Cheng, C. Chen, Y. Wang, H. Shi, Y. Cao, D. Tu, C. Zhang, Y. Xu, Learning
directional feature maps for cardiac MRI segmentation. International Conference
on Medical Image Computing and Computer-Assisted Intervention, Springer, 2020,
Appendix C. Replace lossless feature encoding block of HWD pp. 108–117.
with max pooling or average pooling [7] T. Cheng, X. Wang, L. Huang, W. Liu, Boundary-preserving mask R-CNN, arXiv e-
prints (2020b).
[8] O. Ronneberger, P. Fischer, T. Brox, U-Net: convolutional networks for biomedical
We also conduct experiments by replacing the lossless feature image segmentation. International Conference on Medical Image Computing and
encoding block (haar wavelet transform) of HWD with max pooling Computer-Assisted Intervention, Springer, 2015, pp. 234–241.
[9] J. Zhang, C. Li, S. Kosov, M. Grzegorzek, K. Shirahama, T. Jiang, C. Sun, Z. Li, H. Li,
layer or average pooling, in order to evalidate the performance LCU-Net: a novel low-cost U-Net for environmental microorganism image
improvement of HWD not only by increasing the representation feature segmentation, Pattern Recognit. 115 (2021) 107885.
learning block of HWD. Table C.1 shows the performance of mean IoU of [10] Q. Zhou, X. Wu, S. Zhang, B. Kang, Z. Ge, L.J. Latecki, Contextual ensemble
network for semantic segmentation, Pattern Recognit. 122 (2022) 108290.
total 21 segmentation models. These models comprise seven semantic [11] A. Chaurasia, E. Culurciello, LinkNet: exploiting encoder representations for
segmentation architectures that employ three different types of ResNet efficient semantic segmentation. 2017 IEEE Visual Communications and Image
as their backbones. It can be observed that the Haar wavelet transform Processing (VCIP), IEEE, 2017, pp. 1–4.
[12] G. Lin, A. Milan, C. Shen, I. Reid, RefineNet: multi-path refinement networks for
plays a more significant role compared to max pooling and average high-resolution semantic segmentation. Proceedings of the IEEE Conference on
pooling, resulting in improvements of 2.81% and 1.74% in mean IoU Computer Vision and Pattern Recognition, 2017, pp. 1925–1934.
across all 21 segmentation models. [13] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, H. Adam, Encoder-decoder with
atrous separable convolution for semantic image segmentation. Proceedings of the
European Conference on Computer Vision (ECCV), 2018, pp. 801–818.
Appendix D. MobileNetv2 as backbone under seven [14] H. Zhao, J. Shi, X. Qi, X. Wang, J. Jia, Pyramid scene parsing network. Proceedings
segmentation architectures of the IEEE Conference on Computer Vision and Pattern Recognition, 2017,
pp. 2881–2890.
[15] N. Mu, H. Wang, Y. Zhang, J. Jiang, J. Tang, Progressive global perception and
This section contains detailed results for CamVid dataset with local polishing network for lung infection segmentation of COVID-19 CT images,
Mobilenetv2 [65] as backbone under seven semantic segmentation Pattern Recognit. 120 (2021) 108168.
frameworks. MobileNetv2 was originally developed for mobile devices, [16] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, N. Sang, BiSeNet: bilateral segmentation
network for real-time semantic segmentation. Proceedings of the European
striking a balance between accuracy, latency, and parameter count for Conference on Computer Vision (ECCV), 2018, pp. 325–341.
classification tasks. There are 5 times downsampling operations in the [17] H. Zhao, X. Qi, X. Shen, J. Shi, J. Jia, ICNet for real-time semantic segmentation on
MobileNetv2. The fisrt downsampling (strided convolution) happens in high-resolution images. Proceedings of the European Conference on Computer
Vision (ECCV), 2018, pp. 405–420.
the stem of MobileNetv2, and the others are in the bottleneck of it. Here, [18] G. Xu, H. Cao, J.K. Udupa, Y. Tong, D.A. Torigian, DiSegNet: a deep dilated
we only use HWD to replace the first strided convolution operation by convolutional encoder-decoder architecture for lymph node segmentation on PET/
the considering of the accuracy and efficiency. CT images, Comput. Med. Imaging Graph. 88 (2021) 101851.
[19] S. Hu, F. Bonardi, S. Bouchafa, D. Sidibé, Multi-modal unsupervised domain
Tables D.1 and D.2 show the segmentation results on seven seg adaptation for semantic image segmentation, Pattern Recognit. (2023) 109299.
mentation architectures with and without HWD in MobileNetv2 on [20] H. Zhou, L. Qi, H. Huang, X. Yang, Z. Wan, X. Wen, CANet: co-attention network
CamVid testset in terms of mean IoU. We can find that MobileNetv2 for RGB-D semantic segmentation, Pattern Recognit. 124 (2022) 108468.
[21] W. Wu, T. Chu, Q. Liu, Complementarity-aware cross-modal feature fusion network
equipped with HWD could achieve the performance improvement of for RGB-T semantic segmentation, Pattern Recognit. 131 (2022) 108881.
0.8% mIoU on seven segmentation architectures, demonstrating our [22] R. Karthik, R. Menaka, M. Hariharan, D. Won, Contour-enhanced attention CNN for
method could be adapted in other backbones too. Specifically, our CT-based COVID-19 segmentation, Pattern Recognit. 125 (2022) 108538.
[23] F.Z. Xing, E. Cambria, W.-B. Huang, Y. Xu, Weakly supervised semantic
method could improve the performance of small-scale objects in a large
segmentation with superpixel embedding. 2016 IEEE International Conference on
margin, such as pole (1.05%), sign symbol (1.03%), fence (1.36%) and Image Processing (ICIP), IEEE, 2016, pp. 1269–1273.
bicycle (1.86%). [24] J. Shen, N. Robertson, BBAS: towards large scale effective ensemble adversarial
attacks against deep neural network learning, Inf. Sci. (Ny) 569 (2021) 469–478.
[25] L. Wang, X. Qian, Y. Zhang, J. Shen, X. Cao, Enhancing sketch-based image
Appendix E. Performance with the SOTA architectures based retrieval by CNN semantic re-ranking, IEEE Trans. Cybern. 50 (7) (2019)
convolution and transformer on synapse dataset 3330–3342.
[26] V.M. Vargas, P.A. Gutiérrez, C. Hervás-Martínez, Unimodal regularisation based on
beta distribution for deep ordinal regression, Pattern Recognit. 122 (2022) 108310.
In this section, we test six SOTA segmentation methods on Synapse [27] O. Oktay, J. Schlemper, L.L. Folgoc, M. Lee, M. Heinrich, K. Misawa, K. Mori, S.
dataset. They are HRNet [66], ConvNeXt [67], UNext [68], TransUNet McDonagh, N.Y. Hammerla, B. Kainz, et al., Attention U-Net: learning where to
[60], Swin-Unet [69], and SegFormer [70], respectively. Note that the look for the pancreas, arXiv preprint arXiv:1804.03999 (2018).
[28] C. Li, B. Wang, S. Zhang, Y. Liu, R. Song, J. Cheng, X. Chen, Emotion recognition
HRNet and ConvNeXt are pure convolution neural network, the UNext from eeg based on multi-task learning with capsule network and attention
and TransUNet are hybrid with convolution and transformer operations. mechanism, Comput. Biol. Med. 143 (2022) 105303.
Besides, the Swin-Unet and SegFore are based on transformer. We tested [29] Y. Yuan, J. Xie, X. Chen, J. Wang, SegFix: model-agnostic boundary refinement for
segmentation. European Conference on Computer Vision, Springer, 2020,
these methods on Synapse dataset, and the comparative results are pp. 489–506.
shown in Table E.1. Here, we simply replace the downsampling opera [30] R.N. Bracewell, R.N. Bracewell, The Fourier Transform and its Applications Vol.
tion, like maxpooling or convolution stride, with our proposed HWD to 31999, McGraw-hill New York, 1986.
[31] R.S. Stanković, B.J. Falkowski, The haar wavelet transform: its status and
reduce the resolution of feature maps on each architecture. achievements, Comput. Electr. Eng. 29 (1) (2003) 25–44.
[32] C.H. Ma, Y. Li, Y. Wang, Image analysis based on the haar wavelet transform.
References Applied Mechanics and Materials Vol. 391, Trans Tech Publ, 2013, pp. 564–567.
[33] A. Belov, Comparison of the efficiencies of image compression algorithms based on
separable and nonseparable two-dimensional haar wavelet bases, Pattern Recognit.
[1] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic
Image Anal. 18 (4) (2008) 602–605.
segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern
[34] F. Luisier, C. Vonesch, T. Blu, M. Unser, Fast haar-wavelet denoising of
Recognition, 2015, pp. 3431–3440.
multidimensional fluorescence microscopy data. 2009 IEEE International
[2] Z. Lv, L. Qiao, S. Yang, J. Li, H. Lv, F. Piccialli, Memory-augmented neural
Symposium on Biomedical Imaging: From Nano to Macro, IEEE, 2009,
networks based dynamic complex image segmentation in digital twins for self-
pp. 310–313.
driving vehicle, Pattern Recognit. 132 (2022) 108956.
[35] R. Duits, M. Felsberg, G. Granlund, B.t.H. Romeny, Image analysis and
[3] J. Wu, H. Xu, S. Zhang, X. Li, J. Chen, J. Zheng, Y. Gao, Y. Tian, Y. Liang, R. Ji,
reconstruction using a wavelet transform constructed from a reducible
Joint segmentation and detection of COVID-19 via a sequential region generation
network, Pattern Recognit. 118 (2021) 108006.
13
G. Xu et al. Pattern Recognition 143 (2023) 109819
representation of the Euclidean motion group, Int. J. Comput. Vis. 72 (1) (2007) [60] J. Chen, Y. Lu, Q. Yu, X. Luo, E. Adeli, Y. Wang, L. Lu, A.L. Yuille, Y. Zhou,
79–102. TransUNet: transformers make strong encoders for medical image segmentation,
[36] J. Liang, Z. Shi, D. Li, M.J. Wierman, Information entropy, rough entropy and arXiv preprint arXiv:2102.04306 (2021).
knowledge granulation in incomplete information systems, Int. J. Gen. Syst. 35 (6) [61] A. Li, H. Gong, B. Zhang, Q. Wang, C. Yan, J. Wu, Q. Liu, S. Zeng, Q. Luo, Micro-
(2006) 641–654. optical sectioning tomography to obtain a high-resolution atlas of the mouse brain,
[37] A. Namdari, Z. Li, A review of entropy measures for uncertainty quantification of Science 330 (6009) (2010) 1404–1408.
stochastic processes, Adv. Mech. Eng. 11 (6) (2019).1687814019857350 [62] G. Xu, X. Wu, X. Zhang, W. Liao, S. Chen, LGNet: local and global representation
[38] A. Kendall, Y. Gal, What uncertainties do we need in Bayesian deep learning for learning for fast biomedical image segmentation, J. Innov. Opt. Health Sci. (2022).
computer vision? Adv. Neural Inf. Process. Syst. 30 (2017). [63] G. Xu, X. Wu, X. Zhang, X. He, LeViT-UNet: make faster encoders with transformer
[39] A.K. Balan, L. Boyles, M. Welling, J. Kim, H. Park, Statistical optimization of non- for medical image segmentation, arXiv preprint arXiv:2107.08623 (2021).
negative matrix factorization. Proceedings of the Fourteenth International [64] X. Li, M. He, H. Li, H. Shen, A combined loss-based multiscale fully convolutional
Conference on Artificial Intelligence and Statistics, JMLR Workshop and network for high-resolution remote sensing image change detection, IEEE Geosci.
Conference Proceedings, 2011, pp. 128–136. Remote Sens. Lett. 19 (2021) 1–5.
[40] S. Fujieda, K. Takayama, T. Hachisuka, Wavelet convolutional neural networks, [65] A.G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M.
arXiv preprint arXiv:1805.08620 (2018). Andreetto, H. Adam, MobileNets: efficient convolutional neural networks for
[41] P. Liu, H. Zhang, K. Zhang, L. Lin, W. Zuo, Multi-level wavelet-CNN for image mobile vision applications, arXiv preprint arXiv:1704.04861 (2017).
restoration. Proceedings of the IEEE Conference on Computer Vision and Pattern [66] J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan,
Recognition Workshops, 2018, pp. 773–782. X. Wang, et al., Deep high-resolution representation learning for visual recognition,
[42] T. Wu, W. Li, S. Jia, Y. Dong, T. Zeng, Deep multi-level wavelet-CNN denoiser prior IEEE Trans. Pattern Anal. Mach. Intell. 43 (10) (2020) 3349–3364.
for restoring blurred image with Cauchy noise, IEEE Signal Process. Lett. 27 (2020) [67] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, S. Xie, A ConvNet for the
1635–1639. 2020s. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
[43] H. Huang, R. He, Z. Sun, T. Tan, Wavelet-SRNet: a wavelet-based CNN for multi- Recognition, 2022, pp. 11976–11986.
scale face super resolution. Proceedings of the IEEE International Conference on [68] J.M.J. Valanarasu, V.M. Patel, UNeXt: MLP-based rapid medical image
Computer Vision, 2017, pp. 1689–1697. segmentation network. Medical Image Computing and Computer Assisted
[44] H. Ma, D. Liu, R. Xiong, F. Wu, iWave: CNN-based wavelet-like transform for image Intervention–MICCAI 2022: 25th International Conference, Singapore, September
compression, IEEE Trans. Multimed. 22 (7) (2019) 1667–1679. 18–22, 2022, Proceedings, Part V, Springer, 2022, pp. 23–33.
[45] H. Ling, J. Wu, J. Huang, J. Chen, P. Li, Attention-based convolutional neural [69] H. Cao, Y. Wang, J. Chen, D. Jiang, X. Zhang, Q. Tian, M. Wang, Swin-UNet: UNet-
network for deep face recognition, Multimed. Tools Appl. 79 (2020) 5595–5616. like pure transformer for medical image segmentation. European Conference on
[46] M.D. Zeiler, R. Fergus, Stochastic pooling for regularization of deep convolutional Computer Vision, Springer, 2022, pp. 205–218.
neural networks (2013). [70] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J.M. Alvarez, P. Luo, SegFormer: simple
[47] A. Stergiou, R. Poppe, G. Kalliatakis, Refining activation downsampling with and efficient design for semantic segmentation with transformers, Adv. Neural Inf.
softpool. Proceedings of the IEEE/CVF International Conference on Computer Process. Syst. 34 (2021) 12077–12090.
Vision, 2021, pp. 10357–10366.
[48] T. Williams, R. Li, Wavelet pooling for convolutional neural networks.
Guoping Xu received his PhD in Communication and System from the School of Electronic
International Conference on Learning Representations, 2018.
Information and Communications, Huazhong University of Science and Technology. As a
[49] D.-H. Jang, S. Chu, J. Kim, B. Han, Pooling revisited: your receptive field is
lecturer at the School of Computer Science and Engineering in Wuhan Institute of Tech
suboptimal. Proceedings of the IEEE/CVF Conference on Computer Vision and
nology, his research interests include medical image analysis and computer vision.
Pattern Recognition, 2022, pp. 549–558.
[50] D. Marin, Z. He, P. Vajda, P. Chatterjee, S. Tsai, F. Yang, Y. Boykov, Efficient
segmentation: learning downsampling near semantic boundaries. Proceedings of Wentao Liao is a master student in Computer Application Technology at Wuhan Institute
the IEEE/CVF International Conference on Computer Vision, 2019, pp. 2131–2141. of Technology. His main research interests are computer vision and medical image
[51] L.-H. Chen, C.G. Bampis, Z. Li, C. Chen, A.C. Bovik, Convolutional block design for processing.
learned fractional downsampling. 2022 56th Asilomar Conference on Signals,
Systems, and Computers, IEEE, 2022, pp. 640–644.
Xuan Zhang received his BS degree in computer science from Wuhan Institute of Tech
[52] Z. Wang, A.C. Bovik, H.R. Sheikh, E.P. Simoncelli, Image quality assessment: from
nology. Her research interests are deep learning and computer vision.
error visibility to structural similarity, IEEE Trans. Image Process. 13 (4) (2004)
600–612.
[53] Z. Wang, E.P. Simoncelli, A.C. Bovik, Multiscale structural similarity for image Chang Li received the PhD degree in circuits and systems from the School of Electronic
quality assessment. The Thrity-Seventh Asilomar Conference on Signals, Systems & Information and Communications, Huazhong University of Science and Technology,
Computers, 2003 Vol. 2, Ieee, 2003, pp. 1398–1402. Wuhan. He is currently an associate professor with the Department of Biomedical Engi
[54] Z. Wang, Q. Li, Information content weighting for perceptual image quality neering, Hefei University of Technology, Hefei, China. His research interests include
assessment, IEEE Trans. Image Process. 20 (5) (2010) 1185–1198. biomedical signal processing, hyperspectral image analysis, computer vision, and machine
[55] A. Liu, W. Lin, M. Narwaria, Image quality assessment based on gradient similarity, learning.
IEEE Trans. Image Process. 21 (4) (2011) 1500–1512.
[56] L. Zhang, L. Zhang, X. Mou, D. Zhang, FSIM: a feature similarity index for image Xinwei He received his PhD in Communication and System from the School of Electronic
quality assessment, IEEE Trans. Image Process. 20 (8) (2011) 2378–2386. Information and Communications, Huazhong University of Science and Technology. His
[57] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, Improved research interests include computer vision and machine learning.
techniques for training GANs, Adv. Neural Inf. Process. Syst. 29 (2016).
[58] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, S. Hochreiter, GANs trained by
a two time-scale update rule converge to a local Nash equilibrium, Adv. Neural Inf. Xinglong Wu received his PhD from University of Miami and was a Post-Doctoral
Process. Syst. 30 (2017). Researcher at the Center of Computational Science in University of Miami for three
[59] V. Badrinarayanan, A. Kendall, R. Cipolla, SegNet: a deep convolutional encoder- years. As an Associate Professor at the School of Computer Science and Engineering in
decoder architecture for image segmentation, IEEE Trans. Pattern Anal. Mach. Wuhan Institute of Technology, China, he is actively involved in Machine Learning/Deep
Intell. 39 (12) (2017) 2481–2495. learning and biomedical image analysis.
14