0% found this document useful (0 votes)

19 views13 pages

Information Fusion: D. Sun, F. Dornaika

Uploaded by

keerthana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views13 pages

Information Fusion: D. Sun, F. Dornaika

Uploaded by

keerthana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Information Fusion 107 (2024) 102308

Contents lists available at ScienceDirect

Information Fusion
journal homepage: www.elsevier.com/locate/inffus

Full length article

Data augmentation for deep visual recognition using superpixel based

pairwise image fusion
D. Sun a , F. Dornaika a,b,c ,∗
a University of the Basque Country UPV/EHU, San Sebastian, Spain
b
IKERBASQUE, Basque Foundation for Science, Bilbao, Spain
c
Ho Chi Minh City Open University, Ho Chi Minh City, Viet Nam

ARTICLE INFO ABSTRACT

Keywords: Data augmentation is an important paradigm for boosting the generalization capability of deep learning in
Superpixel image classification tasks. Image augmentation using cut-and-paste strategies has shown very good performance
Image fusion improvement for deep learning. However, these existing methods often overlook the image’s discriminative
Data augmentation
local context and rely on ad hoc regions consisting of square or rectangular local regions, leading to
Weighted contrastive learning loss
the loss of complete semantic object parts. In this work, we attempt to overcome these limitations and
Local context
propose a superpixel-wise local-context-aware efficient image fusion approach for data augmentation. Our
approach requires only one forward propagation using a superpixel attention-based label fusion with less
computational complexity. The model is trained using a combination of a global classification of the fused
(augmented) image loss, a superpixel-wise weighted local classification loss, and a superpixel-based weighted
contrastive learning loss. The last two losses are based on the superpixel-aware attentive embeddings. Thus,
the resulting deep encoder can learn both local and global features of the images while capturing object-
part local context and information. Experiments on diverse benchmark image datasets indicate that the
proposed method out-performs many region-based augmentation methods for visual recognition. We have
demonstrated its effectiveness not only on CNN models but also on transformer models. The codes are accessible
at https://github.com/DanielaPlusPlus/SAFuse.

1. Introduction suffers from poor interpretability. CutMix [10] first proposes data aug-
mentation with the cut-and-paste technique based on pairwise images,
Deep learning has advanced image classification [1,2], image seg- which can provide more information by the fusion within two images.
mentation [3,4], and object detection [5,6] by extracting information Nevertheless, there are three drawbacks to existing data augmentation
from data effectively. As the quantity of data grows, deep learning methods with cutmix strategy. (I) Most methods only utilize the global
gains more prominence, particularly with Vision Transformers [7,8]. semantics along with the image-level constraints and overlook the local
However, the cost and impracticality of manual data annotation present context constraints. (II) Existing methods perform cutting and pasting
ongoing challenges. Overfitting can occur in supervised deep learning with square patches, leading to incomplete object-part information.
when there is insufficient data, resulting in limited performance. (III) Fused labels should be consistent with the fused images. Other-
Data augmentation is frequently used to prevent overfitting [13]. In
wise, a mismatch problem between the fused augmented image and
this paper, we investigate data augmentation from the perspective of
its fused label occurs. Some existing methods address the mismatch
image fusion. Traditional data augmentation methods operate on a sin-
problem by object centering with forward propagation twice, which
gle image, which applies various transformations to the original data,
is computation-consuming, and may compromise the diversification of
such as rotating, flipping or cropping. CutOut [14] randomly masks
data augmentation.
a square region with zero. Random Erasing [15] randomly masks a
square region with a random value. However, the supplementary infor- To mitigate the above shortcomings, we propose SAFuse, an effi-
mation provided by traditional data augmentation through operations cient Superpixel Attentive image Fusion approach for data augmenta-
within a single image remains restricted. Mixup [16] proposes pixel- tion and a framework for training a strong classifier. We aim to enhance
by-pixel image fusion between two images for data augmentation but feature representation through image fusion.

∗ Corresponding author at: University of the Basque Country UPV/EHU, San Sebastian, Spain.
E-mail address: [email protected] (F. Dornaika).

https://doi.org/10.1016/j.inffus.2024.102308
Received 31 July 2023; Received in revised form 25 January 2024; Accepted 15 February 2024
Available online 16 February 2024
1566-2535/© 2024 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
D. Sun and F. Dornaika Information Fusion 107 (2024) 102308

Fig. 1. Comparison of label fusion methods and augmented images. (a) AdvMix [9] generates augmented samples by randomly removing square regions from sensitive points in
an image, leading to a loss of contour information. (b) CutMix [10] randomly fuses the images in square form with area-based label fusion, resulting in discrepancies between the
fused image and its label, as well as loss of contour. (c) Attentive CutMix [11] fuses the discriminative square regions with a pretrained external object centering network, which
is not efficient and loses contour information. (d) OcCaMix [12] preserves the contour information by fusing discriminative superpixel regions, but needs forward propagation
twice for object centering, which is inefficient. (e) SAFuse generates object-part-aware augmented images using superpixel attention-based label fusion after only one forward
propagation, ensuring both diversification and efficiency.

Our proposed SAFuse facilitates the extraction of more object-part- The rest of the paper is structured as follows: we give a brief
aware and context-aware information in an efficient way. First, We fuse overview of related work on data augmentation and the use of superpix-
the images randomly by cutting and pasting superpixels to generate els in deep learning in Section 2. We elaborate on the proposed SAFuse
augmented images with diversification, which is object-part-aware. method in detail in Section 3. Section 4 compares the classification ac-
Second, We fuse the labels with superpixel attention to keep the seman- curacy of various data augmentation methods on different models and
tic consistency of the augmented images and labels with only a single image benchmark datasets. In Section 5, ablation studies are conducted
forward propagation, which is efficient. Third, We fuse the high-level on SAFuse. Finally, our work is concluded in Section 6.
feature and low-level feature for better feature representation. We se-
lect discriminative superpixel features for weighted local classification 2. Related work
and weighted contrastive learning, which are context-aware.
Fig. 1 visually compares various representative data augmentation 2.1. Data augmentation
methods. The target image is denoted by 𝑥1 , and the source image by
𝑥2 . Their corresponding labels are represented by 𝑦1 and 𝑦2 , respec- Traditional data augmentation is based on a single image and
tively. AdvMix [9], which removes random square regions in sensitive applies various transformations to the original data, such as rotating,
points within a single image, is demonstrated in Fig. 1(a). Fig. 1(b) flipping, or cropping. CutOut [14] randomly removes a square region
displays an augmented image of CutMix [10], which fuses the target of an image. AutoAugment [17], Fast AutoAugment [18], Random
image and source image with a non-semantic square region. In Fig. 1(c), Augment [17], and Trivial Augment [19] are the automatic augmen-
the augmented image of Attentive CutMix [11] shows the fusion of tation methods that jointly explore the augmentation spaces of multi-
two images using attentional square patches guided by an additional augmentation strategies to achieve optimal performance. The auto-
pre-trained network. Fig. 1(d) corresponds to OcCaMix [12], which matic augmentation methods usually have to make trade-offs between
fuses the two images with discriminative superpixels. Fig. 1(b)(c)(d) complexity, cost, and performance. The aforementioned methods that
depict augmentation methods that use area-based label fusion. Our augment samples based on a single image usually suffer from the in-
approach, illustrated in Fig. 1(e), produces the augmented image by sufficient information provided by the augmented images. Mixup [16]
pairwise fusion of random superpixels, and generates the fused label and CutMix [10] can provide more augmented data information from
using superpixel attention. As a result, our approach requires only pairwise image fusion. However, Mixup [20] fuses two images pixel
one single forward propagation making it more efficient compared to to pixel, making it difficult to interpret. CutMix [10] randomly fuses
the methods depicted in Fig. 1(a)(c)(d). Additionally, we are able to two images with a square region, and fuses the pairwise image labels
maintain more complete object-part information when compared to the with the proportion of area. CutMix can cause a mismatch between
methods depicted in Fig. 1(a)(b)(c). the augmented image and its fused label when the fusion region is
Our main contributions are as follows: the background, instead of the object. Additionally, the real object-part
information can be lost when the image fusion regions are square. Oc-
• We discuss the potential shortcomings of existing cutmix-based CaMix [12], Attentive CutMix [11], PuzzleMix [21], SaliencyMix [22],
data augmentation methods from the viewpoint of image fusion. and AutoMix [20] have proposed solutions to overcome the label and
• We introduce a novel data augmentation method that employs image mismatch problem by selecting regions guided by saliency or
superpixel fusion for the augmented image, and for the first time, attention. But they either require double forward propagations or an
we put forward superpixel attention-based label fusion, which is extra network, leading to inefficiency. ResizeMix [23], PatchUp [24],
object-part-aware and efficient. GridMix [25] and Random SuperpixelGridMix [26] conduct data aug-
• We propose a pioneering training framework for a strong clas- mentation totally randomly with area-based label fusion, which also
sifier, incorporating feature fusion and sparse superpixel feature potentially lead to discrepancies of the augmented image and the fused
constraints. To the best of our knowledge, it is the first time labels. Saliency Grafting [27] generates mixed labels through saliency-
weighted superpixel-wise contrastive loss and weighted local su- based semantic label fusion. However, it grafts square regions, losing
perpixel classification loss are proposed, which is context-aware. object-part information. Our proposed SAFuse fuses the images using
• We present extensive evaluations on various benchmarks and randomly selected superpixels for the largest diversification. We fuse
backbones, which provide evidence of SAFuse’s superiority. the labels using superpixel attention semantics with a single forward

2
D. Sun and F. Dornaika Information Fusion 107 (2024) 102308

Fig. 2. The overall framework of SAFuse. Image Fusion (Section 3.3) produces the augmented object-part-aware sample and its corresponding superpixel map from two training
images. Feature Fusion (Section 3.3) concatenates the high-level and low-level feature vectors after GAP (Global Average Pooling). Superpixel pooling and Self-Attention module
(Section 3.4) aggregates feature 𝐙̂ into superpixel vectors and learns the contextual information through self-attention module. 𝜆𝑎𝑡𝑡 is used for attention-based label fusion
(Section 3.3). Then we perform global classification (Section 3.5) with the fused feature and the fused label, conduct weighted local superpixel classification (Section 3.6)
on the selected top discriminative superpixel vectors, and execute weighted superpixel contrastive learning (Section 3.7) on the discriminative superpixel vectors selected across
images.

propagation. Our method is object-part-aware and efficient. Further- 3. Proposed approach

more, we propose a framework of superpixel-based weighted local
classification and weighted contrastive learning alongside global classi- 3.1. Overview
fication to capture both the self and related contextual information, to
extract both the global and local features, which is context-aware and
The proposed SAFuse is illustrated in Fig. 2. The main steps of the
effective.
proposed fusion and training are shown in Algorithm 1. Two random
2.2. The use of superpixels in deep learning images in the current training batch are fused using their superpixels
(more details are provided in Section 3.3). Image Fusion aims to create
Superpixels are clusters of pixels that share similar visual proper- augmented images and the corresponding fused superpixel maps. 𝐙
ties. Compared to a single pixel, superpixels contain more semantic represents the encoded high-level feature, while 𝐙̂ denotes the decoded
information. Superpixels fit the image boundaries better than square low-level feature. After global average pooling (GAP), we separately
or rectangular image patches, thus preserving local object-part infor- obtain the high-level feature vector 𝐞𝐡𝐢𝐠𝐡 and the low-level feature
mation better. Superpixels have been extensively researched and used vector 𝐞𝐥𝐨𝐰 . 𝐞𝐡𝐢𝐠𝐡 and 𝐞𝐥𝐨𝐰 are concatenated in Feature Fusion to extract
in vision tasks. Here we focus on work that uses precomputed superpix- comprehensive features for global classification. Simultaneously, the
els [28] in Deep Learning. Supercnn [29] uses superpixels to convert decoded feature 𝐙̂ inputs the Superpixel Pooling and Self-Attention
2D image patterns into sequential 1D representations that enable a module to generate the sequence of attentive superpixel vectors based
DNN to efficiently explore the long-range context for saliency detection.
on the fused superpixel map of the augmented sample. The proportion
Several studies [4,30,31] use superpixels to summarize features for
of attention weights of superpixel vectors, denoted 𝜆𝑎𝑡𝑡 , serves as the
weakly supervised semantic segmentation or semantic segmentation.
fusion coefficient in label fusion for global classification. We select
The works described in [32,33] present superpixel-based feature ex-
the most discriminative superpixel vectors with the largest semantic
traction and fusion for hyperspectral image classification. Random
SuperpixelMix [26] and OcCaMix [12] use superpixel grids for cutmix- attention weights to use in weighted local superpixel classification and
based data augmentation, which are similar to our method. However, weighted contrastive superpixel learning.
our method solves the mismatch issue between the fused image and Next, in Section 3.2, we introduce the background knowledge of
its fused label that exists with Random SuperpixelMix [26]. Our aug- traditional cutmix-based data augmentation methods. In Section 3.3,
mentation method does not require object centering and requires single we elaborate on image fusion, feature fusion and label fusion. We
forward propagation with more efficiency than OcCaMix [12]. present the process and effects of superpixel pooling and self-attention

3
D. Sun and F. Dornaika Information Fusion 107 (2024) 102308

in Section 3.4. We separately explain the global classification in Sec- Algorithm 1: SAFuse
tion 3.5, weighted local superpixel classification in Section 3.6 and Input : Batch of 𝐵 images 𝐱 = [𝑥1 , 𝑥2 , ..., 𝑥𝐵 ] with image size
weighted superpixel contrastive learning in Section 3.7. Finally we 𝑊 × 𝐻, One-hot labels corresponding to the batch of
summarize the training and inference in Section 3.8. images 𝐲; Minimum and maximum number of
superpixels 𝑞𝑚𝑖𝑛 , 𝑞𝑚𝑎𝑥 ; Superpixel selection probability
3.2. Background 𝑝
Output: Global classification loss 𝑔𝑙𝑜𝑏𝑎𝑙 , Weighted local
Derived from CutMix [10], selected local regions are typically cut superpixel classification loss 𝑙𝑜𝑐𝑎𝑙 , Weighted
out of one image and pasted into another image in cutmix-based data superpixel contrastive loss 𝑐𝑜𝑛𝑡𝑟𝑎𝑠𝑡 , Total loss 𝑡𝑜𝑡𝑎𝑙
augmentation, which is image fusion in Eq. (1). Labels of the two
images are fused with a certain proportion for the label corresponding 1 𝐼𝑑𝑥1 = [1, 2, 3, ..., 𝐵 − 1, 𝐵], 𝐼𝑑𝑥2 = shuffle(𝐼𝑑𝑥1 )
to the augmented images, which is label fusion in Eq. (2). 2 for 𝑚 ← 1 to 𝐵 do
3 𝑥1 = 𝐱[𝑚], 𝑦1 = 𝐲[𝑚]
𝑥𝑚𝑖𝑥 = (𝟏 − 𝐌) ⊙ 𝑥1 + 𝐌 ⊙ 𝑥2 (1) 4 𝑥2 = 𝐱[𝐼𝑑𝑥2 ][𝑚], 𝑦2 = 𝐲[𝐼𝑑𝑥2 ][𝑚]
/* Randomly choose the pairwise images and
𝑦𝑚𝑖𝑥 = (1 − 𝜆) 𝑦1 + 𝜆 𝑦2 (2) corresponding labels */
5 𝑞1 ∼ 𝑈 (𝑞𝑚𝑖𝑛 , 𝑞𝑚𝑎𝑥 ), 𝑞2 ∼ 𝑈 (𝑞𝑚𝑖𝑛 , 𝑞𝑚𝑎𝑥 )
where 𝑥 ∈ R𝑊 ×𝐻×𝐶 denotes any training sample and 𝑦 is its annotated 6 Superpixel map 𝐒𝟏 ← Superpixel algorithm(𝑥1 ,𝑞1 )
label. 𝐻, 𝑊 and 𝐶 are the height, width, channel number of the 7 Superpixel map 𝐒𝟐 ← Superpixel algorithm(𝑥2 ,𝑞2 )
image separately. A new augmented training image (𝑥𝑚𝑖𝑥 , 𝑦𝑚𝑖𝑥 ) is gen- 8 𝑃 {𝑋 = 𝑘} = 𝑝𝑘 (1 − 𝑝)1−𝑘 , 𝑋 ∼ 𝐵(1, 𝑝), 𝑘 = 0, 1
erated from two random distinct training images (𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ). 𝐌 ∈ 9 𝐌 ← 𝐼𝑛𝑑(𝑆𝑒𝑙𝑒𝑐𝑡(𝐒𝟐 , 𝑋)) /* Select superpixels in 𝐒𝟐
{0, 1}𝑊 ×𝐻 denotes a binary mask whose pixels are from 𝑥2 , 𝟏 denotes a by Bernoulli Distribution for 𝐌 */
mask of value ones, and ⊙ represents multiplication by elements. The 10 Generate 𝐒𝐦𝐢𝐱 and 𝑥𝑚𝑖𝑥 with Eq. (4) and 𝐌 /* Image
traditional cutmix-based method uses area-based proportion for label fusion */
fusion. Usually, the blending parameter 𝜆 is set to the proportion of 11 Encoded feature 𝐙 ∈ R𝑤×ℎ×𝑐 ← 𝜃𝑒𝑛𝑐 (𝑥𝑚𝑖𝑥 )
the pixel number from image 𝑥2 to the total pixel number in image 𝑥1 , 12 Decoded feature 𝐙̂ ∈ R𝑊 ×𝐻×𝐷 ← 𝜃𝑑𝑒𝑐 (𝐙)
which is area-based and described in Eq. (3). 13 Vector sequence 𝐅 ← Average pooling(𝐙) ̂ by 𝐒𝐦𝐢𝐱
∑𝑊 ∑𝐻 14 𝐂 ∈ R𝐿×𝑑 ← self-attention(𝐅)
𝑖=1 𝑗=1 𝑀𝑖𝑗
𝜆𝑎𝑟𝑒𝑎 = (3) 15 𝐰𝐦 = {𝑤1 , 𝑤2 , ..., 𝑤𝐿 } ← Sigmoid (𝐂.𝑠𝑢𝑚(𝑑𝑖𝑚 = 1))
𝑊 ×𝐻
16 Calculate 𝜆𝑎𝑡𝑡𝑚 with Eq. (5) and 𝐰
It is worthy of note that the pixels in the background contribute less
17 Generate 𝑦𝑚𝑖𝑥𝑚 with Eq. (2) and 𝜆𝑎𝑡𝑡 /* Label Fusion */
to the semantic label compared to those in the object regions. Conse-
18 𝐜𝑠𝑚 ∈ R𝑁×𝑑 ← top-N(𝐂), 𝑁 = 𝑖𝑛𝑡(𝐿 × 𝑡)
quently, conventional cutmix-based data augmentation with area-based
label fusion often faces mismatch issues between the fused augmented 19 High-level feature vector 𝐞𝐡𝐢𝐠𝐡 ← Global Average Pooling(𝐙)
20 ̂
Low-level feature vector 𝐞𝐥𝐨𝐰 ← Global Average Pooling(𝐙)
image and its fused label, as illustrated in Fig. 1(b). Numerous existing
methods have addressed the issue of mismatch between the fused aug- 21 Fused feature vector 𝐞𝐦 ← Concatenate(𝐞𝐡𝐢𝐠𝐡 , 𝐞𝐥𝐨𝐰 )
mented image and its fused label by first centering the discriminative /* Feature Fusion */
regions before creating the augmented images (e.g. Fig. 1(c)(d)), which 22 𝐰, 𝜆𝑎𝑡𝑡 , 𝐜𝑠 , 𝐞 ← Record 𝐰𝐦 , 𝜆𝑎𝑡𝑡𝑚 , 𝐜𝑠𝑚 , 𝐞𝐦 in a batch
typically requires forward propagation twice, for object centering and 23 Update 𝑔𝑙𝑜𝑏𝑎𝑙 with Eq. (9) and 𝐞, 𝐲, 𝐲[𝐼𝑑𝑥2 ], 𝜆𝑎𝑡𝑡
training separately. Hence, exiting methods utilizing object centering 24 Update 𝑙𝑜𝑐𝑎𝑙 with Eq. (10) and 𝐜𝑠 , 𝐲, 𝐲[𝐼𝑑𝑥2 ], 𝐰
for the consistency between the fused label and the fused augmented 25 Update 𝑐𝑜𝑛𝑡𝑟𝑎𝑠𝑡 with Eq. (11) and 𝐜𝑠 , 𝐲, 𝐲[𝐼𝑑𝑥2 ], 𝐰
image is inefficient due to increased computational complexity. Worse, 26 Update 𝑡𝑜𝑡𝑎𝑙 with Eq. (12)
object centering leads to a lack of augmentation diversification, which
in turn will harm the performance.

3.3. Image fusion, feature fusion and label fusion Note that fusion with superpixels for the augmented image 𝑥𝑚𝑖𝑥 may
slightly cut off some superpixels in 𝑥1 due to potential overlap of su-
We start with two images from the current training batch in order perpixels from both images. Nevertheless, it is not troublesome because
to create an augmented image to replace the target image. As shown the superpixels of the object parts from image 𝑥2 are fully inserted. And
in Fig. 2 and Algorithm 1, the objective of Image Fusion is to generate the random occlusion and clipping of superpixels in image 𝑥1 increases
the augmented sample 𝑥𝑚𝑖𝑥 and the associated superpixel grid map 𝐒𝐦𝐢𝐱 . the generalization.
Feature Fusion aims to combine the high-level feature vector 𝐞𝐡𝐢𝐠𝐡 and To overcome the limitations of area-based label fusion mentioned
low-level feature vector 𝐞𝐥𝐨𝐰 for a more comprehensive feature vector 𝐞 in Section 3.2, we introduce superpixel attention-based label fusion.
for classification. We randomly select the number of superpixels 𝑞1 , 𝑞2 We mitigate the discrepancy between the fused image and label by
generating the fused label with superpixel attention in label space,
separately from the uniform distribution 𝑈 (𝑞𝑚𝑖𝑛 , 𝑞𝑚𝑎𝑥 ) to achieve greater
rather than by generating the fused image with image centering in
diversification. For the target image 𝑥1 and the source image 𝑥2 , we
image space. Specifically, we augment images by randomly cutting and
obtain the pre-computed associated superpixel maps 𝐒𝟏 and 𝐒𝟐 . Then,
pasting superpixels. We do not require object centering, only one single
superpixels from image 𝑥2 are randomly selected in 𝐒𝟐 by Bernoulli
forward propagation is necessary for training. To ensure consistency
Distribution parameterized by 𝑝 = 0.5 for the largest diversification.
in image and label, we fuse labels with superpixel attention weights
We cut the selected superpixels from image 𝑥2 to paste onto image
{𝑤1 , 𝑤2 , … , 𝑤𝐿 }. 𝐿 presents the total amount of superpixels of the fused
𝑥1 for image augmentation, which is described in Eq. (4) and Line
image 𝑥𝑚𝑖𝑥 . The details for the superpixel attention weights are in Sec-
10 in Algorithm 1. We generate the fused superpixel map 𝐒𝐦𝐢𝐱 for the
tion 3.4 and in Algorithm 1, Line 15. Then superpixel attention-based
augmented image 𝑥𝑚𝑖𝑥 simultaneously. proportion 𝜆𝑎𝑡𝑡 is calculated by the ratio of the superpixel semantics
𝑥𝑚𝑖𝑥 = (𝟏 − 𝐌) ⊙ 𝑥1 + 𝐌 ⊙ 𝑥2 from 𝑥2 to the total superpixel semantics in 𝑥𝑚𝑖𝑥 in Eq. (5). Multi-
(4) plying the number of pixels by the associated attention weight of the
𝐒𝐦𝐢𝐱 = (𝟏 − 𝐌) ⊙ 𝐒𝟏 + 𝐌 ⊙ 𝐒𝟐

4
D. Sun and F. Dornaika Information Fusion 107 (2024) 102308

Fig. 3. (a) Our method boost capturing local context information and complete object-part information; (b) Outline of superpixel pooling and self-attention, with top selection.

superpixel gives the semantics of the superpixels. The description of the final output feature vector 𝐂 = {𝐂𝟏 , 𝐂𝟐 , … ,
∑ 𝐂𝐋 } ∈ R𝐿×𝑑 after layer normalization can be found in Eq. (8).
𝑖∈𝐈𝐱 𝑤𝑖 ⋅ |𝐒𝐦𝐢𝐱 [𝑖]|
𝜆𝑎𝑡𝑡 = ∑𝐿 𝟐 (5) 𝐂 = 𝐿𝑎𝑦𝑒𝑟𝑁𝑜𝑟𝑚(𝐅 + 𝑆𝐴(𝐐, 𝐊, 𝐕)) (8)
𝑗=1 𝑤𝑗 ⋅ |𝐒𝐦𝐢𝐱 [𝑗]|
We capture local contextual information as well as object-part in-
where 𝐈𝐱𝟐 = [𝐼𝑥1 , 𝐼𝑥2 , … , 𝐼𝑥𝑚 ] are the indices of superpixels from 𝑥2 to
2 2 2 formation with this step, as indicated in Fig. 3(a).
𝑥1 , 𝐿 is the total amount of superpixels in the fused image 𝑥𝑚𝑖𝑥 .
Compared to traditional cutmix-based data augmentation meth- 3.4.3. Attention-based superpixel selection
ods, our approach fuses images based on superpixel grids to preserve After applying superpixel pooling and self-attention, attention
complete object-part information. We fuse the high-level features and weights can be obtained for each superpixel based on attentional
low-level features for better feature representation. Moreover, we fuse superpixel vectors (detailed in line 15 in Algorithm 1), which is the
the labels with superpixel attention weights, which alleviates the mis- superpixel weights vector 𝐰. Each weight is the sum of the super-
match issue between the fused image and its fused label with less pixel features followed by the application of the sigmoid function.
computational complexity. All the fusion operations aim at training a First, we compute the proportion 𝜆𝑎𝑡𝑡 from superpixel attention with
strong classifier by data augmentation in Section 3.5. superpixel attention-based label fusion (discussed in Section 3.3). At
the same time, we select the top most discriminative superpixels (as
3.4. Superpixel pooling, self-attention and selection described in line 18 of Algorithm 1) for downstream weighted local
superpixel classification and weighted superpixel contrastive learning.
To further maintain the complete object-part information of su- Superpixel attention-based selection drives the models to focus on the
perpixels and learn the discriminative contextual information, three most discriminative and informative superpixels and reduce noise.
steps are conducted: (i) Superpixel pooling, (ii) Self-attention, and (iii)
Selection, as shown in Fig. 3. 3.5. Global classification

3.4.1. Superpixel pooling We work on image-level classification, so global classification is

Superpixel pooling converts the decoded features of each super- essential, as shown in Fig. 2. Section 3.2 explains the necessity of label
pixel in the map 𝐙̂ (using the spatial footprint of that superpixel) fusion in Eq. (2). Eq. (9) formulates the global classification loss with
into a vector, resulting in feature vectors sequence corresponding to the augmented images.
[ ( )]
the superpixels of the augmented image. Superpixel pooling applies 𝑔𝑙𝑜𝑏𝑎𝑙 = E  𝑓𝑔𝑙𝑜𝑏𝑎𝑙 (𝑒𝑗 ), (1 − 𝜆𝑎𝑡𝑡𝑗 ) 𝑦1𝑗 + 𝜆𝑎𝑡𝑡𝑗 𝑦2𝑗 (9)
average pooling in each superpixel of the fused image, instead of in the
traditional square patches, as indicated in Fig. 3(b). The details can be where  represents the cross entropy loss, 𝑓𝑔𝑙𝑜𝑏𝑎𝑙 represents the fully
seen in line 13 of Algorithm 1. Thus, pooling on superpixels preserves connected layer providing the global projection. 𝑒𝑗 denotes the fused
more contour information and more complete object part information feature vector of the 𝑗th augmented image. 𝑦1𝑗 and 𝑦2𝑗 are the ground-
than traditional square-based average pooling. truth labels of the original images corresponding to the augmented
image. 𝜆𝑎𝑡𝑡 is the calculated superpixel attention-based proportion in
3.4.2. Self-attention representation Eq. (5) for label fusion.
The vector sequence 𝐅 = {𝐅1 , 𝐅2 , … , 𝐅𝐿 } ∈ R𝐿×𝐷 after pooling
from the superpixel sequence inputs the self-attention module for the 3.6. Weighted local superpixel classification
contextual relationships, presented in Eq. (6).
To enhance the model’s local focus on the discriminative superpix-
𝐐 = 𝐅 ⋅ 𝐖𝑞 , 𝐊 = 𝐅 ⋅ 𝐖𝑘 , 𝐕 = 𝐅 ⋅ 𝐖𝑣 (6) els, local superpixel classification constraint is applied. Furthermore,
we apply weighted cross entropy loss to penalize the model for being
here 𝐖𝑞 , 𝐖𝑘 , 𝐖𝑣 ∈ R𝐷×𝑑 are the parameters to learn and 𝑑 is the
more sensitive to local superpixels with higher semantics. As demon-
feature dimension of (𝐐, 𝐊, 𝐕). Finally, Self-attention is formulated as
strated in Fig. 2, we perform weighted local superpixel classification
Eq. (7).
on the selected top attentional vectors of the discriminative superpixels
𝐐 𝐊𝑇 (as described in Section 3.4.3). We normalize the selected superpixel
𝑆𝐴(𝐐, 𝐊, 𝐕) = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥( √ ) 𝐕 (7)
𝑑 attention weights so that they sum up to one in an image. The local

5
D. Sun and F. Dornaika Information Fusion 107 (2024) 102308

classification loss is multiplied by the normalized superpixel attention Table 1

weights. Here, we use a shared FC (fully connected) layer for all the Input size and standard split of datasets.

superpixel vectors. The FC layer used for superpixel-level classification Datasets Number of classes Input size Standard split
differs from the one used for image-level classification. Training set Test set
The weighted local superpixel classification loss is formulated in CIFAR100 100 32 × 32 50,000 10,000
Eq. (10). TinyImageNet 200 64 × 64 100,000 10,000
𝑁𝑚
CUB-200-2011 200 224 × 224 5,994 5771⋆
1 ∑
𝐵 ∑
Stanford Dogs 120 224 × 224 12,000 8580
𝑙𝑜𝑐𝑎𝑙 = ∑𝐵 𝑤̂ 𝑖 (𝑓𝑙𝑜𝑐𝑎𝑙 (𝑐𝑖 ), 𝑦𝑠(𝑖) ) (10) ImageNet1K 1000 224 × 224 1,281,167 50,000
𝑚=1 𝑁𝑚 𝑚=1 𝑖=1

where {𝑤̂ 1 , 𝑤̂ 2 , … , 𝑤̂ 𝑁𝑚 } is the normalized superpixel attention weights.

𝐵 is the number of augmented images in a batch, 𝑚 = 1, 2, 3, … , 𝐵.
𝑁𝑚 denotes the amount of selected superpixels in the 𝑚th image, with 4.1. Datasets, encoders, and competing methods
𝑖 = 1, 2, 3, … , 𝑁𝑚 .  denotes typical cross entropy loss. 𝑐𝑖 presents the
feature vector of the 𝑖𝑡ℎ selected discriminative superpixel. 𝑦𝑠(𝑖) is the SAFuse is evaluated using ImageNet-1K [35], CIFAR100 [1], Tiny-
ground-truth label of the 𝑖𝑡ℎ superpixel, known as 𝑦1 or 𝑦2 . 𝑓𝑙𝑜𝑐𝑎𝑙 is the ImageNet [36], CUB-200-2011 [37], Stanford Dogs [38]. The datasets
FC layer which supplies local projection. are split in the standard manner as depicted in Table 1, with the input
image size and number of classes also detailed. ⋆ means that we remove
3.7. Weighted superpixel contrastive learning the possible overlapped 23 test images mentioned in [39].
The backbone encoders involved are ResNet18, ResNet50 [40],
To achieve better feature representation in both the encoder and ResNeXt50 [41], Wide ResNet [42] in CNN structure and ViT [7],
decoder, we apply supervised contrastive learning [34] on the selected TinyViT [43] in transformer structure. The kernel of the first convolu-
top attentional vectors of the discriminative superpixels (as described
tion layer in ResNet encoders is changed as 3×3 from 7×7 for CIFAR100
in Section 3.4.3). We extend the instance of contrastive learning from
and TinyImageNet. The last residual module stride of ResNet is changed
image-level features to attentional local superpixel features. The core
as 1 from 2 for Stanford Dogs and CUB-200-2011.
idea is to pull together the feature embeddings of intra-class discrimi-
native superpixels and push away the feature embeddings of inter-class
discriminative superpixels. Simultaneously, we propose weighted su- 4.2. Experimental setup
perpixel contrastive learning by multiplying the normalized superpixel
attention weights. In this way, the anchor superpixel with greater The superpixel map is generated with skimage.segmentation.slic,1
semantics will be more focused. Since a large amount of negative which comes from the superpixel algorithm of SLIC [28].
instances is beneficial for contrastive learning, we perform weighted For CIFAR100, the batch size is set as 32 and the initial learning
contrastive learning on the selected discriminative superpixels across rate as 0.02. The base augmentations are random cropping and random
images in a batch, not limited to one image. We normalize the selected horizontal flipping. For TinyImageNet, the batch size is set as 100 and
superpixel attention weights so that they sum up to one in a batch. the initial learning rate as 0.02. The base augmentations are random
The weighted superpixel-wise contrastive loss is formed in Eq. (11). horizontal flipping followed by random clipping. The batch size is fixed
as 8 with the initial learning rate 0.001 for CUB-200-2011. We first
1 ∑ 𝑤̃ 𝑖 ∑
𝑁𝐵
𝑒𝑥𝑝(𝐜𝑖 ⋅ 𝐜𝑗 ∕𝜏) resize the training images in CUB-200-2011 to 256 × 256 and then ran-
𝑐𝑜𝑛𝑡𝑟𝑎𝑠𝑡 = − 𝑙𝑜𝑔 ∑ (11)
𝑁𝐵 𝑖=1 ||𝑃𝑖 || 𝑗∈𝑃 𝑒𝑥𝑝(𝐜𝑖 ⋅ 𝐜𝑗 ∕𝜏) + 𝑘∈𝑁𝑖 𝑒𝑥𝑝(𝐜𝑖 ⋅ 𝐜𝑘 ∕𝜏) domly crop them to 224 × 224, and horizontally flip them. For Stanford
𝑖

where 𝐜𝑖 is the unit-normalized features for superpixel 𝑖 in a batch. Dogs, the batch size is set as 16 and the initial learning rate as 0.01. The
𝑃𝑖 and 𝑁𝑖 are the positive set (intra-class superpixels) and negative base augmentations are random cropping and horizontal flipping. For
set (inter-class superpixels). {𝑤̃ 1 , 𝑤̃ 2 , … , 𝑤̃ 𝑁𝐵 } is the normalized super- ImageNet-1K, we follow the setting of Saliency Grafting [27]. The batch
pixels weights in a batch. 𝑁𝐵 denotes the selected superpixel number size is 256 and the initial learning rate is 0.1. The base augmentation is
across all images during a batch. We fix the temperature 𝜏 as 0.7. random cropping and random horizontal flipping for training images,
and center cropping for the test images. We use the SGD optimizer with
3.8. Training and inference the momentum value 0.9 and weight decay value 0.0005. The baseline
results are trained only with the aforementioned base augmentation.
For training, the objective of the global classification loss in Eq. (9) Following CutMix [10], the proposed augmentation scheme and all the
is to extract the global semantic feature of the training images; the competing methods are combined with the base augmentation with the
objective of the weighted local superpixel classification loss in Eq. (10) probability value 0.5 . We use bold and underlined to mark the best and
is to enhance the focus and sensitivity on the discriminative local second best results.
superpixels; the objective of weighted superpixel contrastive loss in
Eq. (11) is to optimize an embedding representation with enhanced
4.3. Experimental results
intra-class superpixel-wise compactness and inter-class superpixel-wise
separation. The overall training loss is denoted as Eq. (12).
Tables 2–4 illustrate the top-1 classification accuracies with
𝑡𝑜𝑡𝑎𝑙 = 𝑔𝑙𝑜𝑏𝑎𝑙 + 𝛾1 𝑙𝑜𝑐𝑎𝑙 + 𝛾2 𝑐𝑜𝑛𝑡𝑟𝑎𝑠𝑡 (12) ResNet18 and ResNeXt50 as encoders on CIFAR100, TinyImageNet,
CUB-200-2011, respectively. Table 5 presents the top-1 classification
where 𝛾1 > 0, 𝛾2 > 0 are the two coefficients.
accuracy with ResNet50 as the encoder on Stanford Dogs dataset.
The training of the model is carried out using back-propagation.
Table 6 shows the top-1 classification accuracies of CUB-200-2011
The inference is performed only with the model for global classifi-
cation. with TinyViT and ViT as the encoders. Table 7 presents the top-1
classification accuracy with ResNet50 as the encoder on ImageNet-1K
4. Performance evaluation dataset. These tables also illustrate the values of the hyperparameters
used in each method. In our proposed method, we randomly select
SAFuse is evaluated with Top.1 classification accuracy. We first the number of superpixels 𝑞 from a uniform distribution (𝑞𝑚𝑖𝑛 , 𝑞𝑚𝑎𝑥 ). 𝑡
introduce the used datasets and models in Section 4.1 and experimental
details in Section 4.2, then present the results in Section 4.3.
All of the experiments have been implemented in PyTorch. Source 1
https://scikit-image.org/docs/stable/api/skimage.segmentation.html#
code can be found in https://github.com/DanielaPlusPlus/SAFuse. skimage.segmentation.slic.

6
D. Sun and F. Dornaika Information Fusion 107 (2024) 102308

Table 2
Performance on CIFAR100 with ResNet18, ResNeXt50 as encoders.
Method Hyperparameters Top-1 Acc.
R18 RX50
Baseline – 78.58% 80.67%
CutMix [10] 𝛼=1 79.69% 83.23%
Attentive CutMix [11] 𝑁 =3 79.29% 82.51%
SaliencyMix [22] – 79.57% 82.56%
ResizeMix [23] 𝛼 = 0.1, 𝛽 = 0.8 79.71% 82.34%
GridMix [25] 𝑔𝑟𝑖𝑑 = 4 × 4, 𝑝 = 0.8, 𝛾 = 0.15 79.45% 82.47%
Random SuperpixelGridMix [26] 𝑞 = 200, 𝑁 = 50 79.06% 82.22%
Random SuperpixelGridMix [26] 𝑞 = 16, 𝑁 = 3 80.30% 83.25%
OcCaMix† [12] 𝑞 ∼ 𝑈 (15, 50), 𝑁 = 3 81.42% 84.01%
PatchUp (input space) [24] pr = 0.7, block = 7, 𝛼 = 2, 𝛾 = 0.5 80.13% 83.46%
PatchUp (hidden space)[24] pr = 0.7, block = 7, 𝛼 = 2, 𝛾 = 0.5 80.91% 83.65%
Saliency Grafting [27] 𝛼 = 2, 𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 = 0.2 80.83% 83.10%
AutoMix† [20] 𝛼 = 2, 𝑙 = 3 82.04% 83.64%
SAFuse(Ours) 𝑞 ∼ 𝑈 (25, 30), 𝑡 = 76.4%, 𝛾1 = 0.8, 𝛾2 = 0.08 82.54% 84.33%

Table 3
Performance on TinyImageNet with ResNet18, ResNetXt50 as encoders.
Method Hyperparameters Top-1 Acc.
R18 RX50
Baseline – 61.66% 65.69%
CutMix [10] – 64.35% 66.97%
Attentive CutMix [11] 𝑁 =7 64.01% 66.84%
SaliencyMix [22] – 63.52% 66.52%
ResizeMix [23] 𝛼 = 0.1, 𝛽 = 0.8 64.63% 67.33%
GridMix [25] 𝑔𝑟𝑖𝑑 = 8 × 8, 𝑝 = 0.8, 𝛾 = 0.15 64.79% 67.43%
Random SuperpixelGridMix [26] 𝑞 = 200, 𝑁 = 50 65.59% 69.37%
Random SuperpixelGridMix [26] 𝑞 = 64, 𝑁 = 7 66.46% 71.53%
OcCaMix† [12] 𝑞 ∼ 𝑈 (30, 70), 𝑁 = 7 67.35% 72.23%
PatchUp (input space) [24] pr = 0.7, block = 7, 𝛼 = 2, 𝛾 = 0.5 66.14% 70.49%
PatchUp (hidden space) [24] pr = 0.7, block = 7, 𝛼 = 2, 𝛾 = 0.5 67.06% 71.51%
Saliency Grafting [27] 𝛼 = 2, 𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 = 0.2 64.96% † 67.83%
AutoMix † [20] 𝛼 = 2, 𝑙 = 3 67.33% 70.72%
SAFuse(Ours) 𝑞 ∼ 𝑈 (20, 35), 𝑡 = 76.4%, 𝛾1 = 0.8, 𝛾2 = 0.08 68.31% 73.12%

Table 4
Performance on CUB-200-2011 with ResNet18 and ResNeXt50 as encoders.
Method Hyperparameters Top-1 Acc.
R18 RX50
Baseline – 75.56% 81.41%
CutMix [10] 𝛼=1 76.90% 82.63%
Attentive CutMix [11] 𝑁 =9 76.73% 82.34%
SaliencyMix [22] – 76.88% 82.81%
ResizeMix [23] 𝛼 = 0.1, 𝛽 = 0.8 76.23% 81.94%
GridMix [25] 𝑔𝑟𝑖𝑑 = 14 × 14, 𝑝 = 0.8, 𝛾 = 0.15 77.13% 82.17%
Random SuperpixelGridMix [26] 𝑞 = 200, 𝑁 = 50 77.58% 83.03%
Random SuperpixelGridMix [26] 𝑞 = 196, 𝑁 = 9 76.98% 82.19%
OcCaMix† [12] 𝑞 ∼ 𝑈 (30, 100), 𝑁 = 9 78.40% 83.69%
PatchUp (Input space) [24] pr = 0.7, block = 7, 𝛼 = 2, 𝛾 = 0.5 77.05% 82.66%
PatchUp (Hidden space) [24] pr = 0.7, block = 7, 𝛼 = 2, 𝛾 = 0.5 77.96% 83.27%
Saliency Grafting [27] 𝛼 = 2, 𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 = 0.2 77.43% 82.93%
AutoMix [20] 𝛼 = 2, 𝑙 = 3 78.17% 83.52%
SAFuse(Ours) 𝑞 ∼ 𝑈 (30, 40), 𝑡 = 85.4%, 𝛾1 = 0.85, 𝛾2 = 0.15 79.24% 84.61%

denotes the top percentage for the selection of local superpixels. 𝛾1 and Dogs and CUB-200-2011. Table 6 indicates that our SAFuse performs
𝛾2 are two loss coefficients to weight different loss values in Eq. (12). the best not only when the encoder is based on the CNN structure,
We tuned the Attentive CutMix [11] and Rand SuperpixelGridMix [26] but also when the encoder is based on the transformer structure. Our
for better results. The hyperparameters of the other competing methods SAFuse outperforms the baseline by 1.08% with TinyViT as the encoder
are set according to the suggestions in the corresponding paper. All and by 2.03% with ViT-B/16 as the encoder for CUB-200-2011. As can
the experiments on CUB-200-2011 load the models pre-trained on be seen in Table 7, SAFuse consistently shows the best performance on
the ImageNet-1K dataset with the ResNet50 encoder.
ImageNet. † marks the results which are published in the corresponding
Table 8 displays our method’s results compared to representative
paper.
data augmentation methods based on a single image. On the same
Our method outperforms the baseline by 3.96% with ResNet18 as dataset and with the same encoder, our approach outperforms the
the encoder, and by 3.66% with ResNeXt50 as the encoder on CI- second best method AdvMask [9] by 5.17% when ResNet50 is used as
FAR100, as shown in Table 2. In Table 3 on TinyImageNet, our method the encoder and by 1.07% when WRN-28-10 is used as the encoder.
outperforms the second best by 0.96% with ResNet18 as encoder, and These results confirm our hypothesis that data augmentation by image
by 0.89% with ResNeXt50 as encoder. According to Tables 4 and 5, our fusion, especially fusion with additional object-part information, can
SAFuse still outperforms on the fine-grained datasets, such as Stanford significantly improve performance.

7
D. Sun and F. Dornaika Information Fusion 107 (2024) 102308

Table 5
Performance on Stanford Dogs with encoder ResNet50.
Method Hyperparameters Top-1 Acc. with R50
Baseline – 61.46%
CutMix [10] 𝛼=1 63.92%
Attentive CutMix [11] 𝑁 = 12 62.87%
SaliencyMix [22] – 64.28%
ResizeMix [23] 𝛼 = 0.1, 𝛽 = 0.8 64.58%
GridMix [25] 𝑔𝑟𝑖𝑑 = 14 × 14, 𝑝 = 0.8, 𝛾 = 0.15 62.55%
Random SuperpixelGridMix [26] 𝑞 = 200, 𝑁 = 50 68.79%
Random SuperpixelGridMix [26] 𝑞 = 196, 𝑁 = 12 67.76%
OcCaMix† [12] 𝑞 ∼ 𝑈 (50, 95), 𝑁 = 12 69.34%
PatchUp (Input space) [24] pr = 0.7, block = 7, 𝛼 = 2, 𝛾 = 0.5 64.03%
PatchUp (Hidden space)[24] pr = 0.7, block = 7, 𝛼 = 2, 𝛾 = 0.5 65.19%
Saliency Grafting [27] 𝛼 = 2, 𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 = 0.2 66.32%
AutoMix [20] 𝛼 = 2, 𝑙 = 3 69.12%
SAFuse(Ours) 𝑞 ∼ 𝑈 (40, 60), 𝑡 = 76.4%, 𝛾1 = 0.8, 𝛾2 = 0.08 70.36%

Table 6
Performance on CUB-200-2011 with encoder ViT-B/16 and TinyViT-11m-224.
Method Hyperparameters Top-1 Acc.
ViT-B/16 TinyViT11m
Baseline – 86.96% 80.45%
Random SuperpixelGridMix [26] 𝑞 = 200, 𝑁 = 50 81.32% 87.19%
OcCaMix [12] 𝑞 ∼ 𝑈 (30, 100), 𝑁 = 9 81.70% 87.88%
SAFuse(Ours) 𝑈 (30, 40), 𝑡 = 76.4%, 𝛾1 = 0.85, 𝛾2 = 0.15 82.48% 88.04%

Table 7 data augmentation methods that rely on complete randomness, which

Performance on ImageNet-1K using the ResNet50 encoder.
can lead to a mismatch between fused images and their generated
Method Hyperparameters Acc. with R50 labels. These methods include CutMix [10], GridMix [25], Random
Baseline – 76.47% SuperpixelGridMix [26] and PatchUp [24]. Attentive CutMix [11],
Saliency Grafting† [27] 𝛼 = 2, 𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 = 0.2 77.65%
SaliencyMix [22] and Saliency Grafting [27] employ data augmentation
OcCaMix [12] 𝑞 ∼ 𝑈 (30, 50), 𝑁 = 9 78.03%
SAFuse(Ours) 𝑈 (25, 30), 𝑡 = 70%, 𝛾1 = 0.9, 𝛾2 = 0.1 78.31%
guided by attention or saliency but compromise complete object-part
information by fusing with square regions. In contrast, our method
preserves comprehensive object-part information through superpixel
Table 8 fusion. ResizeMix [23] inefficiently fuses pairwise images with the
Performance on CIFAR100 using ResNet50 and WRN-28-10 as encoders compared to
data augmentation based on one single image.
entire image in different scales. AutoMix [20] trains both the encoder
Method Acc. with R50 Acc. with WRN-28-10
and a parameterized mix block in a momentum pipeline, involving two
forward propagations and four losses. OcCaMix [12] fuses images with
Baselinea 77.41% 77.78%
CutOuta [14] 78.62% 78.13%
attention-guided superpixels, leading to high computational complexity
AdvMaska [9] 78.99% 80.70% with two forward propagations. Our SAFuse method addresses the
SAFuse(Ours) 84.16% 81.77% image and label mismatch issue not through image fusion but label
a Marks the results published in AdvMaks [9].
fusion, resulting in a more efficient and effective approach that requires
only one forward propagation with reduced computational complexity.

5. Ablation studies
Table 9 illustrates the performance when using ResNet50 as the
encoder. Note that for various datasets, the input image sizes differ, and
5.1. Effect of superpixel grid-based fusion
the encoders have slight variations (Detailed in Section 4.1). However,
the datasets and encoders remain entirely consistent across different
Superpixel grid-based fusion involves generating augmented images
methods. At the same time, we compare the model size, inference using a superpixel grid map rather than a square grid map. We can
speed, computational complexity on different datasets. Our method out- see the performance improvement resulting from superpixel grid-based
performs OcCaMix [12] in all three datasets, which also uses superpixel fusion in Table 10, with an increase from 80.49% to 81.33%. This is
grid-based image fusion. OcCaMix [12] performs second best most of due to the ability preserving object-part information when employing
the time, but needs one forward propagation for object concentration the superpixel grid. Superpixel grid-based fusion drives our model to
and one for training. We only need one forward propagation by using become object-part-aware.
superpixel attention based label fusion. Therefore, our proposal is more
efficient and has a lower computational cost, as shown in Table 9. 5.2. Effect of weighted local superpixel classification
We emphasize that our approach does not require any superpixel-
based operation in the inference phase. As can be seen in Table 9, our Weighted local superpixel classification involves performing local
method achieves better performance despite the larger model size and classifications on the superpixel-based local regions considering the
lower inference speed due to feature fusion during inference. Moreover, semantic attention weights of superpixels. As shown in Table 10, the
compared to many sophisticated augmentation models, our method has results show that weighted local superpixel mapping increases the
lower computational complexity, which makes it both effective and performance from 81.33% to 82.10%. Compared to local superpixel
efficient. classification, which does not consider superpixel attention weight-
In summary, our pairwise fusion method surpasses single-image ing, using weighted local superpixel classification improves perfor-
data augmentation methods, such as CutOut [14] and AdvMask [9], mance from 81.72% to 82.10%. The improvements result from the
by providing richer information. Our approach outperforms comparison fact that local superpixel classification drives the model to concentrate

8
D. Sun and F. Dornaika Information Fusion 107 (2024) 102308

Table 9
Comparison of model size (Param.), computational complexity (FLOPs), inference speed (FPS), and top-1 accuracy. Measured
with NVIDIA GeForce RTX 2070 Super. ↑ means the larger the better, ↓ means the smaller the better. Bold marks the best.
Dataset (Input size) Method Param.(M)↓ FLOPs(G)↓ FPS↑ Top-1 Acc.↑
Baseline 23.71 1.30 211 80.19%
CIFAR100 (32 × 32) OcCaMix [12] 23.71 2.62 211 83.69%
SAFuse(Ours) 39.73 1.95 178 84.16%
Baseline 23.89 5.24 206 64.83%
TinyImageNet (64 × 64) OcCaMix [12] 23.89 10.48 206 69.22%
SAFuse(Ours) 38.24 7.81 165 71.49%
Baseline 23.91 6.26 195 79.47%
CUB-200-2011 (224 × 224) OcCaMix [12] 23.91 12.54 195 82.94%
SAFuse(Ours) 31.82 9.51 130 83.71%

Table 10
Ablation study of the proposed SAFuse. CIFAR100 was evaluated with the model of ResNet18 encoder. ‘‘Square.’’ denotes square grid-based
fusion. ‘‘Superpixel.’’ denotes superpixel grid-based fusion. ‘‘Local-cls.’’ denotes local superpixel classification. ‘‘Weighted Local-cls.’’ denotes
weighted local superpixel classification. ‘‘Local-con.’’ denotes local superpixel contrastive learning. ‘‘Weighted Local-con.’’ denotes weighted
local superpixel contrastive learning.
Square. Superpixel. Local-cls. Weighted Local-cls. Local-con. Weighted Local-con. Acc.
✔ ✘ ✘ ✘ ✘ ✘ 80.49%
✘ ✔ ✘ ✘ ✘ ✘ 81.33%
✘ ✔ ✔ ✘ ✘ ✘ 81.72%
✘ ✔ ✘ ✔ ✘ ✘ 82.10%
✘ ✔ ✘ ✘ ✔ ✘ 81.49%
✘ ✔ ✘ ✘ ✘ ✔ 81.84%
✘ ✔ ✔ ✘ ✔ ✘ 82.37%
✘ ✔ ✘ ✔ ✘ ✔ 82.54%

Fig. 4. Visualization of the accuracy varying with the combination of loss coefficients 𝛾1 and 𝛾2 . (a) Acc. for CIFAR100 with ResNet18 as the encoder. Best result is obtained
when 𝛾1 = 0.8 and 𝛾2 = 0.08; (b) Acc. for CUB-200-2011 with ResNet18 as the encoder. Best result is achieved when 𝛾1 = 0.8 and 𝛾2 = 0.15.

more on local superpixel regions. Furthermore, weighted local super- contrastive loss, the performance increases from 81.33% to 81.84%.
pixel classification penalizes the model to increase sensitivity towards Cross-image weighted superpixel contrastive learning can pull close
more semantically significant superpixels. In our approach, we conduct the intra-class discriminative superpixel-level feature embeddings, and
weighted local superpixel classification on the selected discriminative push apart the inter-class discriminative superpixel-level feature em-
superpixels with the highest attention weights. This enables the model beddings, achieving better alignment of embeddings. In our method, we
to be effectively penalized to focus on the discriminative local regions. perform weighted contrastive learning only on superpixel features with
Our method becomes locally context-aware through weighted local su- the highest attention weights for better global semantic consistency.
perpixel classification, which takes into account the semantic attention
weights of superpixels during local classification.
5.4. Sensitivity to loss coefficients
5.3. Effect of weighted superpixel contrastive learning
Here, we study the performance of the proposed method as a func-
Weighted superpixel contrastive learning means performing con- tion of the loss coefficients 𝛾1 and 𝛾2 in Eq. (12). The results presented
trastive learning on superpixel-based local regions across images in in Fig. 4 indicate that the best result is obtained when 𝛾1 = 0.8 and
a batch with a weighted superpixel-wise contrastive loss. The results 𝛾2 = 0.08 for CIFAR100 with models using ResNet18 as encoders. For
in Table 10 demonstrate that weighted superpixel contrastive learning CUB-200-2011, with models also using ResNet18 as encoders, we can
improves the performance from 81.33% to 81.84% when not combined achieve the best result when 𝛾1 = 0.8 and 𝛾2 = 0.15. It is worth noting
with weighted local superpixel classification, and from 82.10% to that focusing on local regions too much with a large loss coefficient 𝛾1
82.54% when combined with the weighted local superpixel classifica- can lead the model failing in capturing global semantic information and
tion. By incorporating the superpixel attention weights with supervised poor performance. Similarly, while the primary purpose of contrastive

9
D. Sun and F. Dornaika Information Fusion 107 (2024) 102308

Fig. 5. (a) (b) Pairwise images; (c) (d) (e) The fused (augmented) images from different amount of superpixels..

Table 11
Influence of number of superpixels 𝑞 on CIFAR100 and CUB-200-2011 accuracy with models using ResNet18
encoder.
𝑞 Acc. for CIFAR100 𝑞 Acc. for CUB-200-2011
𝐶𝑜𝑛𝑠𝑡𝑎𝑛𝑡 ∶ 𝑞 = 28 81.57% 𝐶𝑜𝑛𝑠𝑡𝑎𝑛𝑡 ∶ 𝑞 = 35 78.23%
𝑞 ∼ 𝑈 (10, 20) 81.92% 𝑞 ∼ 𝑈 (20, 30) 78.75%
𝑞 ∼ 𝑈 (20, 30) 82.31% 𝑞 ∼ 𝑈 (20, 40) 78.94%
𝑞 ∼ 𝑈 (25, 30) 82.54% 𝑞 ∼ 𝑈 (30, 40) 79.24%
𝑞 ∼ 𝑈 (25, 35) 82.27% 𝑞 ∼ 𝑈 (30, 50) 78.87%
𝑞 ∼ 𝑈 (35, 40) 81.85% 𝑞 ∼ 𝑈 (50, 60) 78.49%

loss is to optimize superpixel embeddings, a large loss coefficient 𝛾2 Table 12

Performance of different label fusion methods on CIFAR100 with models using
could lead to a weak classifier.
ResNet18 encoder.
Label fusion method Accuracy
5.5. Effect of number of superpixels, top percentage for choosing discrimi-
Label fusion with area 81.43%
native superpixels Label fusion with pixel attention 82.37%
Label fusion with superpixel attention (𝐎𝐮𝐫𝐬) 82.54%
The number of superpixels is crucial for a superpixel grid map. The
loss of contour information can occur if the number of superpixels
is insufficient, while an excess of superpixels may lead to semantic
on the attention proportion of the fused superpixels for label fusion, as
information losing. Our approach chooses the number of superpixels described in Eq. (5).
per image individually and randomly, enabling us to obtain augmented As indicated in Table 12, our superpixel attention-based label fu-
samples with greater diversification while also controlling the number sion surpasses the area-based label fusion with a 1.11% performance
of superpixels within an appropriate range. increase, as it considers superpixel semantic information and resolves
Given constant numbers of superpixels for pairwise images, Fig. 5 discrepancies between the fused image and its corresponding label. Our
illustrates the impact of the number of superpixels. Fig. 5(c) shows that method has also demonstrated superior performance in comparison to
when the number of superpixels for both images is 10, some contour pixel attention-based label fusion, as our approach considers both the
details cannot be captured. On the other hand, Fig. 5(e) demonstrates semantic and contour information of local regions.
that when the number of superpixels for both images is 80, some
overly detailed contour information without semantics is captured. In 5.7. Visualization of deep features
Fig. 5(d), the capture of appropriate contour information occurs when
the superpixel amounts are 35. Our approach can better extract the discriminative features and en-
hance the embeddings of the encoded features. As discussed above, the
Table 11 displays the effect of the quantity of superpixels on classi-
results in Table 10 show that using both weighted local classification
fication performance. It is evident that randomly selecting the number
loss and weighted local contrastive loss results in an accuracy improve-
of superpixels from a uniform distribution during online training sig-
ment from 81.33% to 82.54%. Fig. 7 indicates that the ResNeXt50
nificantly enhances performance by incorporating more diversification
encoder trained with our SAFuse method can better extract the dis-
compared to a fixed number of superpixels. And too few or too many criminative features. Fig. 8 displays that the features extracted by
superpixels may hinder performance. ResNet50 trained with our method in Fig. 8(c) achieve a more effective
As described in Section 3.4.3, the most discriminative superpixels embedding than those trained with OcCaMix [12] in Fig. 8(b).
with the highest attention weights are selected for weighted local con-
trastive learning and weighted local classification. The hyperparameter 6. Conclusion
𝑡 denotes the top percentage for selection. Fig. 6 illustrates the process
of searching for the optimal top percentage 𝑡 using the golden-section We propose SAFuse, an efficient local-context-aware and complete
search policy [44]. object-part-preserved superpixel grid-based image fusion approach for
data augmentation. We analyze the possible disadvantages of existing
cutmix-based data augmentation methods from the viewpoint of image
5.6. Comparing different label fusion methods
fusion. We also propose a pioneering training framework for a strong
classifier, incorporating weighted superpixel-wise contrastive loss and
We investigate the scheme for label fusion with three variants: (i) weighted local superpixel classification loss. Extensive experiments
Label fusion with area, which fuses the labels with the area proportion have demonstrated superior performance on different benchmarks and
of images, as depicted in Eq. (3). (ii) Label fusion with pixel attention, both CNN and transformer models. In the future, we will explore
which fuses labels based on the attention proportion of the fused pixels, more advanced schemes allowing the selection of the optimal hyperpa-
considering only pixels but overlooking the local regions. (iii) Label rameter values. Our research will next be extended to self-supervised
fusion with superpixel attention, which is our proposed method based learning and medical segmentation.

10
D. Sun and F. Dornaika Information Fusion 107 (2024) 102308

Fig. 6. Visualization of golden-section search for top percentage 𝑡. (a) Accuracy for CIFAR100 using ResNet18 encoder. The best result is obtained when 𝑡 = 76.4%; (b) Accuracy
for CUB-200-2011 using ResNet18 encoder. The best result is obtained when 𝑡 = 85.4%.

Fig. 7. Visual representation of the heatmaps from trained ResNeXt50. (a) Original images. (b) Heatmaps generated by ResNeXt50 trained using the baseline method. (c) Heatmaps
generated by ResNeXt50 trained using our SAFuse.

Fig. 8. Visualization of t-SNE on CUB-200-2011 (labels from 0 to 9) features extracted by ResNet50. (a) Features obtained from baseline; (b) Features obtained from OcCaMix [12]
; (c) Features obtained from our SAFuse method.

11
D. Sun and F. Dornaika Information Fusion 107 (2024) 102308

CRediT authorship contribution statement [14] Terrance DeVries, Graham W. Taylor, Improved regularization of convolutional
neural networks with cutout, 2017, arXiv preprint arXiv:1708.04552.
[15] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, Yi Yang, Random eras-
D. Sun: Data curation, Formal analysis, Methodology, Software,
ing data augmentation, in: Proceedings of the AAAI Conference on Artificial
Validation, Visualization, Writing – original draft, Writing – review Intelligence, Vol. 34, 2020, pp. 13001–13008.
& editing. F. Dornaika: Conceptualization, Formal analysis, Investi- [16] Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, David Lopez-Paz, Mixup:
gation, Methodology, Project administration, Supervision, Validation, Beyond empirical risk minimization, in: International Conference on Learning
Writing – original draft, Writing – review & editing. Representations, 2018.
[17] Ekin D. Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, Quoc V. Le,
Autoaugment: Learning augmentation strategies from data, in: Proceedings of
Declaration of competing interest
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019,
pp. 113–123.
The authors declare that they have no known competing finan- [18] Sungbin Lim, Ildoo Kim, Taesup Kim, Chiheon Kim, Sungwoong Kim, Fast
cial interests or personal relationships that could have appeared to autoaugment, Adv. Neural Inf. Process. Syst. 32 (2019).
influence the work reported in this paper. [19] Samuel G. Müller, Frank Hutter, Trivialaugment: Tuning-free yet state-of-the-art
data augmentation, in: Proceedings of the IEEE/CVF International Conference on
Computer Vision, 2021, pp. 774–782.
Data availability [20] Zicheng Liu, Siyuan Li, Di Wu, Zihan Liu, Zhiyuan Chen, Lirong Wu, Stan Z.
Li, Automix: Unveiling the power of mixup for stronger classifiers, in: Computer
Data will be made available on request. Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27,
2022, Proceedings, Part XXIV, Springer, 2022, pp. 441–458.
[21] Jang-Hyun Kim, Wonho Choo, Hyun Oh Song, Puzzle mix: Exploiting saliency
Acknowledgments
and local statistics for optimal mixup, in: International Conference on Machine
Learning, PMLR, 2020, pp. 5275–5285.
This work is supported in part by grant PID2021-126701OB-I00 [22] AFM Shahab Uddin, Mst Sirazam Monira, Wheemyung Shin, TaeChoong Chung,
funded by MCIN/AEI/10.13039/501100011033, Spain and by ‘‘ERDF Sung-Ho Bae, Saliencymix: A saliency guided data augmentation strategy for
A way of making Europe’’. It is also partially supported by grant better regularization, in: International Conference on Learning Representations,
2020.
GIU23/022 funded by the University of the Basque Country (UPV/EHU).
[23] Jie Qin, Jiemin Fang, Qian Zhang, Wenyu Liu, Xingang Wang, Xinggang Wang,
Open Access funding is provided by the University of the Basque Resizemix: Mixing data with preserved object information and true labels, 2020,
Country. All authors approved the version of the manuscript to be arXiv preprint arXiv:2012.11101.
published. [24] Mojtaba Faramarzi, Mohammad Amini, Akilesh Badrinaaraayanan, Vikas Verma,
Sarath Chandar, Patchup: A feature-space block-level regularization technique
for convolutional neural networks, in: Proceedings of the AAAI Conference on
References
Artificial Intelligence, Vol. 36, 2022, pp. 589–597.
[25] Kyungjune Baek, Duhyeon Bang, Hyunjung Shim, Gridmix: Strong regularization
[1] Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton, Imagenet classification with through local context mapping, Pattern Recognit. 109 (2021) 107594.
deep convolutional neural networks, Commun. ACM 60 (6) (2017) 84–90. [26] Karim Hammoudi, Adnane Cabani, Bouthaina Slika, Halim Benhabiles, Fadi Dor-
[2] Sen Qiu, Hongkai Zhao, Nan Jiang, Zhelong Wang, Long Liu, Yi An, Hongyu naika, Mahmoud Melkemi, Superpixelgridmasks data augmentation: Application
Zhao, Xin Miao, Ruichen Liu, Giancarlo Fortino, Multi-sensor information fusion to precision health and other real-world data, J. Healthc. Inform. Res. 6 (4)
based on machine learning for real applications in human activity recognition: (2022) 442–460.
State-of-the-art and research challenges, Inf. Fusion 80 (2022) 241–265. [27] Joonhyung Park, June Yong Yang, Jinwoo Shin, Sung Ju Hwang, Eunho Yang,
[3] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, Hartwig Saliency grafting: Innocuous attribution-guided mixup with calibrated label
Adam, Encoder–decoder with atrous separable convolution for semantic image mixing, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol.
segmentation, in: Proceedings of the European Conference on Computer Vision, 36, 2022, pp. 7957–7965.
ECCV, 2018, pp. 801–818. [28] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal
[4] Li Guo, Pengfei Shi, Long Chen, Chenglizhao Chen, Weiping Ding, Pixel and Fua, Sabine Süsstrunk, Slic superpixels compared to state-of-the-art superpixel
region level information fusion in membership regularized fuzzy clustering for methods, IEEE Trans. Pattern Anal. Mach. Intell. 34 (11) (2012) 2274–2282.
image segmentation, Inf. Fusion 92 (2023) 479–497. [29] Shengfeng He, Rynson W.H. Lau, Wenxi Liu, Zhe Huang, Qingxiong Yang,
[5] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun, Faster r-cnn: Towards real- Supercnn: A superpixelwise convolutional neural network for salient object
time object detection with region proposal networks, Adv. Neural Inf. Process. detection, Int. J. Comput. Vis. 115 (2015) 330–344.
Syst. 28 (2015). [30] Suha Kwak, Seunghoon Hong, Bohyung Han, Weakly supervised semantic
[6] Zhengxia Zou, Keyan Chen, Zhenwei Shi, Yuhong Guo, Jieping Ye, Object segmentation using superpixel pooling network, in: Proceedings of the AAAI
detection in 20 years: A survey, Proc. IEEE (2023). Conference on Artificial Intelligence, Vol. 31, 2017.
[7] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi- [31] Teppei Suzuki, Shuichi Akizuki, Naoki Kato, Yoshimitsu Aoki, Superpixel convo-
aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg lution for segmentation, in: 2018 25th IEEE International Conference on Image
Heigold, Sylvain Gelly, et al., An image is worth 16x16 words: Transform- Processing, ICIP, IEEE, 2018, pp. 3249–3253.
ers for image recognition at scale, in: International Conference on Learning [32] Ting Lu, Shutao Li, Leyuan Fang, Xiuping Jia, Jón Atli Benediktsson, From
Representations, 2021. subpixel to superpixel: A novel fusion framework for hyperspectral image
[8] Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fa- classification, IEEE Trans. Geosci. Remote Sens. 55 (8) (2017) 4398–4411.
had Shahbaz Khan, Mubarak Shah, Transformers in vision: A survey, ACM [33] Maryam Imani, Hassan Ghassemian, An overview on spectral and spatial informa-
Comput. Surv. (CSUR) 54 (10s) (2022) 1–41. tion fusion for hyperspectral image classification: Current trends and challenges,
[9] Suorong Yang, Jinqiao Li, Tianyue Zhang, Jian Zhao, Furao Shen, Adv- Inf. Fusion 59 (2020) 59–83.
mask: A sparse adversarial attack-based data augmentation method for image [34] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip
classification, Pattern Recognit. 144 (2023) 109847. Isola, Aaron Maschinot, Ce Liu, Dilip Krishnan, Supervised contrastive learning,
[10] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, Adv. Neural Inf. Process. Syst. 33 (2020) 18661–18673.
Youngjoon Yoo, Cutmix: Regularization strategy to train strong classifiers with [35] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean
localizable features, in: Proceedings of the IEEE/CVF International Conference Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al.,
on Computer Vision, 2019, pp. 6023–6032. Imagenet large scale visual recognition challenge, Int. J. Comput. Vis. 115 (3)
[11] Devesh Walawalkar, Zhiqiang Shen, Zechun Liu, Marios Savvides, Attentive (2015) 211–252.
cutmix: An enhanced data augmentation approach for deep learning based image [36] Patryk Chrabaszcz, Ilya Loshchilov, Frank Hutter, A downsampled variant of
classification, in: ICASSP 2020-2020 IEEE International Conference on Acoustics, imagenet as an alternative to the cifar datasets, 2017, arXiv preprint arXiv:
Speech and Signal Processing, ICASSP, IEEE, 2020, pp. 3642–3646. 1707.08819.
[12] F. Dornaika, D. Sun, K. Hammoudi, J. Charafeddine, A. Cabani, C. Zhang, [37] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, Serge Belongie,
Object-centric contour-aware data augmentation using superpixels of varying The caltech-ucsd birds-200–2011 dataset, 2011.
granularity, Pattern Recognit. (2023) 109481. [38] Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, Fei-Fei Li, Novel
[13] Connor Shorten, Taghi M. Khoshgoftaar, A survey on image data augmentation dataset for fine-grained image categorization: Stanford dogs, in: Proc. CVPR
for deep learning, J. Big Data 6 (1) (2019) 1–48. Workshop on Fine-Grained Visual Categorization, FGVC, 2011.

12
D. Sun and F. Dornaika Information Fusion 107 (2024) 102308

[39] Pei Guo, Overlap between imagenet and cub. https://guopei.github.io/2016/ [42] Sergey Zagoruyko, Nikos Komodakis, Wide residual networks, in: British Machine
Overlap-Between-Imagenet-And-CUB/. Vision Conference 2016, British Machine Vision Association, 2016.
[40] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Deep residual learning for [43] Kan Wu, Jinnian Zhang, Houwen Peng, Mengchen Liu, Bin Xiao, Jianlong Fu,
image recognition, in: Proceedings of the IEEE Conference on Computer Vision Lu Yuan, Tinyvit: Fast pretraining distillation for small vision transformers,
and Pattern Recognition, 2016, pp. 770–778. in: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel,
[41] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, Kaiming He, Aggregated October 23–27, 2022, Proceedings, Part XXI, Springer, 2022, pp. 68–85.
residual transformations for deep neural networks, in: Proceedings of the IEEE [44] Jack Kiefer, Sequential minimax search for a maximum, Proc. Amer. Math. Soc.
Conference on Computer Vision and Pattern Recognition, 2017, pp. 1492–1500. 4 (3) (1953) 502–506.

INTE 30103 Information Processing and Handling in Libraries and Information Centers
No ratings yet
INTE 30103 Information Processing and Handling in Libraries and Information Centers
4 pages
Manual de RM1
No ratings yet
Manual de RM1
75 pages
MODULE 12 GMOs AND GENE THERAPY
100% (2)
MODULE 12 GMOs AND GENE THERAPY
37 pages
Dissertation On Investment Analysis
100% (2)
Dissertation On Investment Analysis
5 pages
Circlet For Edinburgh
100% (1)
Circlet For Edinburgh
2 pages
Brushless DC Motors
No ratings yet
Brushless DC Motors
21 pages
Mendel and Heredity Worksheet
No ratings yet
Mendel and Heredity Worksheet
11 pages
Khosla 2020
No ratings yet
Khosla 2020
7 pages
Florida Department of Children and Families Legislative Budget Request FY 2010-11
No ratings yet
Florida Department of Children and Families Legislative Budget Request FY 2010-11
419 pages
Gul Nawaz CV
No ratings yet
Gul Nawaz CV
2 pages
Chemical Catalog
No ratings yet
Chemical Catalog
58 pages
LG 50PM4700-TA Chassis PA22A
No ratings yet
LG 50PM4700-TA Chassis PA22A
73 pages
A Survey On Image Data Augmentation For Deep Learn
No ratings yet
A Survey On Image Data Augmentation For Deep Learn
49 pages
NARI Phaltan Rural Visit Report
100% (1)
NARI Phaltan Rural Visit Report
3 pages
Embedding Expansion: Augmentation in Embedding Space For Deep Metric Learning
No ratings yet
Embedding Expansion: Augmentation in Embedding Space For Deep Metric Learning
14 pages
A Triple Deep Image Prior Model For Image Denoising Based On Mixed Priors and Noise Learning
No ratings yet
A Triple Deep Image Prior Model For Image Denoising Based On Mixed Priors and Noise Learning
19 pages
1 s2.0 S0031320323000481 Main
No ratings yet
1 s2.0 S0031320323000481 Main
12 pages
Jimaging 09 00046 v2
No ratings yet
Jimaging 09 00046 v2
26 pages
Data Science Interview Preparation
No ratings yet
Data Science Interview Preparation
16 pages
2 - Self-Supervised Learning For Anomaly Detection and Localization
No ratings yet
2 - Self-Supervised Learning For Anomaly Detection and Localization
28 pages
131.15 Equations of Continuity
No ratings yet
131.15 Equations of Continuity
25 pages
Visión Computalizada
No ratings yet
Visión Computalizada
27 pages
Tata Motors
No ratings yet
Tata Motors
38 pages
A Comparative Analysis of The
No ratings yet
A Comparative Analysis of The
15 pages
Rethinking Data Augmentation For Image Super-Resolution - A Comprehensive Analysis and A New Strategy
No ratings yet
Rethinking Data Augmentation For Image Super-Resolution - A Comprehensive Analysis and A New Strategy
18 pages
Albumentation
No ratings yet
Albumentation
20 pages
Image Data Augmentation Approaches: A Comprehensive Survey and Future Directions
No ratings yet
Image Data Augmentation Approaches: A Comprehensive Survey and Future Directions
32 pages
Infrared and Visible Image Fusion With Resnet and Zero-Phase Component Analysis
No ratings yet
Infrared and Visible Image Fusion With Resnet and Zero-Phase Component Analysis
22 pages
1 s2.0 S1566253518305505 Main
No ratings yet
1 s2.0 S1566253518305505 Main
20 pages
Learning Enriched Features For Real Image Restoration and Enhancement
No ratings yet
Learning Enriched Features For Real Image Restoration and Enhancement
20 pages
Group 4 Travel Device
No ratings yet
Group 4 Travel Device
8 pages
Choi Self-Ensembling With GAN-Based Data Augmentation For Domain Adaptation in Semantic ICCV 2019 Paper
No ratings yet
Choi Self-Ensembling With GAN-Based Data Augmentation For Domain Adaptation in Semantic ICCV 2019 Paper
11 pages
300 PDF
No ratings yet
300 PDF
8 pages
Unit 5
No ratings yet
Unit 5
50 pages
Quadtree Based Image Fusion-Information Fusion 2015
No ratings yet
Quadtree Based Image Fusion-Information Fusion 2015
14 pages
Spectrum of Imaging Findings in Pulmonary Infections Part 1&2
No ratings yet
Spectrum of Imaging Findings in Pulmonary Infections Part 1&2
19 pages
Url
No ratings yet
Url
4 pages
Image Augmentation - Docx: Document Details
No ratings yet
Image Augmentation - Docx: Document Details
13 pages
1 s2.0 S0278612522001054 Main
No ratings yet
1 s2.0 S0278612522001054 Main
14 pages
Hu2021 Article AnImprovedMulti-focusImageFusi
No ratings yet
Hu2021 Article AnImprovedMulti-focusImageFusi
17 pages
Disentangling Light Fields For Super-Resolution and Disparity Estimation
No ratings yet
Disentangling Light Fields For Super-Resolution and Disparity Estimation
19 pages
Deep Coarse-to-Fine Dense Light Field Reconstruction With Flexible Sampling and Geometry-Aware Fusion
No ratings yet
Deep Coarse-to-Fine Dense Light Field Reconstruction With Flexible Sampling and Geometry-Aware Fusion
18 pages
Financial Accounting Q and A Easy
No ratings yet
Financial Accounting Q and A Easy
18 pages
A Comprehensive Survey of Recent Trends in Deep Learning For Digital Images Augmentation
No ratings yet
A Comprehensive Survey of Recent Trends in Deep Learning For Digital Images Augmentation
27 pages
DiffuseMix CVPR 24
No ratings yet
DiffuseMix CVPR 24
18 pages
NeurIPS 2022 Data Efficient Augmentation For Training Neural Networks Paper Conference
No ratings yet
NeurIPS 2022 Data Efficient Augmentation For Training Neural Networks Paper Conference
13 pages
Chemistry Practicals Notes
No ratings yet
Chemistry Practicals Notes
30 pages
Topo Sheet Report
No ratings yet
Topo Sheet Report
15 pages
Octnetfusion: Learning Depth Fusion From Data: (Riegler, Bischof) @icg - Tugraz.At (Osman - Ulusoy, Andreas - Geiger) @tue - Mpg.De
No ratings yet
Octnetfusion: Learning Depth Fusion From Data: (Riegler, Bischof) @icg - Tugraz.At (Osman - Ulusoy, Andreas - Geiger) @tue - Mpg.De
10 pages
Data Virt QB Updated
No ratings yet
Data Virt QB Updated
12 pages
Madhubhan Rejou Spa Services Menu
No ratings yet
Madhubhan Rejou Spa Services Menu
10 pages
MS-Former: Memory-Supported Transformer For Weakly Supervised Change Detection With Patch-Level Annotations
No ratings yet
MS-Former: Memory-Supported Transformer For Weakly Supervised Change Detection With Patch-Level Annotations
11 pages
Simple Copy-Paste Is A Strong Data Augmentation Method For Instance Segmentation
No ratings yet
Simple Copy-Paste Is A Strong Data Augmentation Method For Instance Segmentation
13 pages
Scale-Aware Automatic Augmentation For Object Detection
No ratings yet
Scale-Aware Automatic Augmentation For Object Detection
12 pages
Advancements in Point Cloud Data Augmentation For Deep Learning: A Survey
No ratings yet
Advancements in Point Cloud Data Augmentation For Deep Learning: A Survey
18 pages
Preparation of Brief Papers For IEEE TRANSACTIONS and JOURNALS December 2023 1
No ratings yet
Preparation of Brief Papers For IEEE TRANSACTIONS and JOURNALS December 2023 1
13 pages
Preparation of Brief Papers For IEEE TRANSACTIONS and JOURNALS December 2023 1
No ratings yet
Preparation of Brief Papers For IEEE TRANSACTIONS and JOURNALS December 2023 1
13 pages
Zhang CycleMix A Holistic Strategy For Medical Image Segmentation From Scribble CVPR 2022 Paper
No ratings yet
Zhang CycleMix A Holistic Strategy For Medical Image Segmentation From Scribble CVPR 2022 Paper
10 pages
Equitable Leasing Corporation vs. Lucita Suyom, Marissa Enano, Myrnatamayo and Felix Oledan (G.R. No. 143360, 5 September 2002, 388 Scra 445)
No ratings yet
Equitable Leasing Corporation vs. Lucita Suyom, Marissa Enano, Myrnatamayo and Felix Oledan (G.R. No. 143360, 5 September 2002, 388 Scra 445)
10 pages
What Should Not Be Contrastive in Contrastive Learning
No ratings yet
What Should Not Be Contrastive in Contrastive Learning
13 pages
Randaugment Practical Automated Data Augmentation With A Reduced Search Space
No ratings yet
Randaugment Practical Automated Data Augmentation With A Reduced Search Space
10 pages
Notification NIT Calicut Various Vacancy Posts
No ratings yet
Notification NIT Calicut Various Vacancy Posts
6 pages
MICCAI21 Fewshot
No ratings yet
MICCAI21 Fewshot
12 pages
Improved Regularization of Convolutional Neural Networks With Cutout
No ratings yet
Improved Regularization of Convolutional Neural Networks With Cutout
8 pages
CutPaste Self-Supervised Learning For Anomaly Detection and Localization
No ratings yet
CutPaste Self-Supervised Learning For Anomaly Detection and Localization
11 pages
Gao2022 Article ATotalVariationGlobalOptimizat
No ratings yet
Gao2022 Article ATotalVariationGlobalOptimizat
9 pages
1 s2.0 S0031320320303976 Main
No ratings yet
1 s2.0 S0031320320303976 Main
10 pages
Transactions On Computational Biology and Bioinformatics
No ratings yet
Transactions On Computational Biology and Bioinformatics
10 pages
Transactions On Computational Biology and Bioinformatics
No ratings yet
Transactions On Computational Biology and Bioinformatics
10 pages
Active Driveline
No ratings yet
Active Driveline
17 pages
Fully-Connected Transformer For Multi-Source Image Fusion
No ratings yet
Fully-Connected Transformer For Multi-Source Image Fusion
18 pages
Ejercicios de Matematica Avanzada para Ingenieros
No ratings yet
Ejercicios de Matematica Avanzada para Ingenieros
6 pages
Applsci 15 06840
No ratings yet
Applsci 15 06840
28 pages
Bioengineering 12 00631
No ratings yet
Bioengineering 12 00631
25 pages
KSSRN 5095780
No ratings yet
KSSRN 5095780
42 pages
7000-Article Text-10229-1-10-20200525
No ratings yet
7000-Article Text-10229-1-10-20200525
8 pages
Scene Comprehension Through Image Analysis With An Extensive Array of Categories and Context at The Scene Level
No ratings yet
Scene Comprehension Through Image Analysis With An Extensive Array of Categories and Context at The Scene Level
8 pages
Yan Transcending Forgery Specificity With Latent Space Augmentation For Generalizable Deepfake CVPR 2024 Paper
No ratings yet
Yan Transcending Forgery Specificity With Latent Space Augmentation For Generalizable Deepfake CVPR 2024 Paper
11 pages
A Preliminary Study On Data Augmentation of Deep Learning For Image Classification
No ratings yet
A Preliminary Study On Data Augmentation of Deep Learning For Image Classification
4 pages
Domain Disentanglement and Fusion Based On Hyperbolic 2025 Information Proc
No ratings yet
Domain Disentanglement and Fusion Based On Hyperbolic 2025 Information Proc
19 pages
Fourier-Basis Functions To Bridge Augmentation Gap Rethinking Frequency Augmentation in CVPR 2024 Paper
No ratings yet
Fourier-Basis Functions To Bridge Augmentation Gap Rethinking Frequency Augmentation in CVPR 2024 Paper
10 pages
Albumentations: Fast and Flexible Image Augmentations
No ratings yet
Albumentations: Fast and Flexible Image Augmentations
4 pages
Cross-Field Road Markings Detection Based On Inverse Perspective Mapping
No ratings yet
Cross-Field Road Markings Detection Based On Inverse Perspective Mapping
21 pages
Data Augmentation For Improving Deep Learning in Image Classification Problem
No ratings yet
Data Augmentation For Improving Deep Learning in Image Classification Problem
7 pages
Copy Paste++ v1
No ratings yet
Copy Paste++ v1
3 pages
1 s2.0 S0952197623000192 Main
No ratings yet
1 s2.0 S0952197623000192 Main
10 pages
1 s2.0 S0925231220303386 Main
No ratings yet
1 s2.0 S0925231220303386 Main
9 pages
Staff Manual 06
No ratings yet
Staff Manual 06
3 pages
Situation Infancy Mortality
No ratings yet
Situation Infancy Mortality
2 pages
Staff Manual 03
No ratings yet
Staff Manual 03
3 pages
View Generated Docs
No ratings yet
View Generated Docs
2 pages
9709 s10 QP 32
No ratings yet
9709 s10 QP 32
4 pages
No Objection Certificate
No ratings yet
No Objection Certificate
1 page
Artificial Intelligence for Image Super Resolution
From Everand
Artificial Intelligence for Image Super Resolution
Debmitra Ghosh
No ratings yet
Contextual Image Classification: Understanding Visual Data for Effective Classification
From Everand
Contextual Image Classification: Understanding Visual Data for Effective Classification
Fouad Sabry
No ratings yet
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Pyramid Image Processing: Exploring the Depths of Visual Analysis
From Everand
Pyramid Image Processing: Exploring the Depths of Visual Analysis
Fouad Sabry
No ratings yet

Information Fusion: D. Sun, F. Dornaika

Uploaded by

Information Fusion: D. Sun, F. Dornaika

Uploaded by

Information Fusion 107 (2024) 102308

Contents lists available at ScienceDirect

Full length article

Data augmentation for deep visual recognition using superpixel based

ARTICLE INFO ABSTRACT

propagation. Our method is object-part-aware and efficient. Further- 3. Proposed approach

3.4.1. Superpixel pooling We work on image-level classification, so global classification is

classification loss is multiplied by the normalized superpixel attention Table 1

where {𝑤̂ 1 , 𝑤̂ 2 , … , 𝑤̂ 𝑁𝑚 } is the normalized superpixel attention weights.

Table 7 data augmentation methods that rely on complete randomness, which

loss is to optimize superpixel embeddings, a large loss coefficient 𝛾2 Table 12

You might also like