Information Fusion: D. Sun, F. Dornaika
Information Fusion: D. Sun, F. Dornaika
Information Fusion
journal homepage: www.elsevier.com/locate/inffus
Keywords: Data augmentation is an important paradigm for boosting the generalization capability of deep learning in
Superpixel image classification tasks. Image augmentation using cut-and-paste strategies has shown very good performance
Image fusion improvement for deep learning. However, these existing methods often overlook the image’s discriminative
Data augmentation
local context and rely on ad hoc regions consisting of square or rectangular local regions, leading to
Weighted contrastive learning loss
the loss of complete semantic object parts. In this work, we attempt to overcome these limitations and
Local context
propose a superpixel-wise local-context-aware efficient image fusion approach for data augmentation. Our
approach requires only one forward propagation using a superpixel attention-based label fusion with less
computational complexity. The model is trained using a combination of a global classification of the fused
(augmented) image loss, a superpixel-wise weighted local classification loss, and a superpixel-based weighted
contrastive learning loss. The last two losses are based on the superpixel-aware attentive embeddings. Thus,
the resulting deep encoder can learn both local and global features of the images while capturing object-
part local context and information. Experiments on diverse benchmark image datasets indicate that the
proposed method out-performs many region-based augmentation methods for visual recognition. We have
demonstrated its effectiveness not only on CNN models but also on transformer models. The codes are accessible
at https://github.com/DanielaPlusPlus/SAFuse.
1. Introduction suffers from poor interpretability. CutMix [10] first proposes data aug-
mentation with the cut-and-paste technique based on pairwise images,
Deep learning has advanced image classification [1,2], image seg- which can provide more information by the fusion within two images.
mentation [3,4], and object detection [5,6] by extracting information Nevertheless, there are three drawbacks to existing data augmentation
from data effectively. As the quantity of data grows, deep learning methods with cutmix strategy. (I) Most methods only utilize the global
gains more prominence, particularly with Vision Transformers [7,8]. semantics along with the image-level constraints and overlook the local
However, the cost and impracticality of manual data annotation present context constraints. (II) Existing methods perform cutting and pasting
ongoing challenges. Overfitting can occur in supervised deep learning with square patches, leading to incomplete object-part information.
when there is insufficient data, resulting in limited performance. (III) Fused labels should be consistent with the fused images. Other-
Data augmentation is frequently used to prevent overfitting [13]. In
wise, a mismatch problem between the fused augmented image and
this paper, we investigate data augmentation from the perspective of
its fused label occurs. Some existing methods address the mismatch
image fusion. Traditional data augmentation methods operate on a sin-
problem by object centering with forward propagation twice, which
gle image, which applies various transformations to the original data,
is computation-consuming, and may compromise the diversification of
such as rotating, flipping or cropping. CutOut [14] randomly masks
data augmentation.
a square region with zero. Random Erasing [15] randomly masks a
square region with a random value. However, the supplementary infor- To mitigate the above shortcomings, we propose SAFuse, an effi-
mation provided by traditional data augmentation through operations cient Superpixel Attentive image Fusion approach for data augmenta-
within a single image remains restricted. Mixup [16] proposes pixel- tion and a framework for training a strong classifier. We aim to enhance
by-pixel image fusion between two images for data augmentation but feature representation through image fusion.
∗ Corresponding author at: University of the Basque Country UPV/EHU, San Sebastian, Spain.
E-mail address: [email protected] (F. Dornaika).
https://doi.org/10.1016/j.inffus.2024.102308
Received 31 July 2023; Received in revised form 25 January 2024; Accepted 15 February 2024
Available online 16 February 2024
1566-2535/© 2024 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
D. Sun and F. Dornaika Information Fusion 107 (2024) 102308
Fig. 1. Comparison of label fusion methods and augmented images. (a) AdvMix [9] generates augmented samples by randomly removing square regions from sensitive points in
an image, leading to a loss of contour information. (b) CutMix [10] randomly fuses the images in square form with area-based label fusion, resulting in discrepancies between the
fused image and its label, as well as loss of contour. (c) Attentive CutMix [11] fuses the discriminative square regions with a pretrained external object centering network, which
is not efficient and loses contour information. (d) OcCaMix [12] preserves the contour information by fusing discriminative superpixel regions, but needs forward propagation
twice for object centering, which is inefficient. (e) SAFuse generates object-part-aware augmented images using superpixel attention-based label fusion after only one forward
propagation, ensuring both diversification and efficiency.
Our proposed SAFuse facilitates the extraction of more object-part- The rest of the paper is structured as follows: we give a brief
aware and context-aware information in an efficient way. First, We fuse overview of related work on data augmentation and the use of superpix-
the images randomly by cutting and pasting superpixels to generate els in deep learning in Section 2. We elaborate on the proposed SAFuse
augmented images with diversification, which is object-part-aware. method in detail in Section 3. Section 4 compares the classification ac-
Second, We fuse the labels with superpixel attention to keep the seman- curacy of various data augmentation methods on different models and
tic consistency of the augmented images and labels with only a single image benchmark datasets. In Section 5, ablation studies are conducted
forward propagation, which is efficient. Third, We fuse the high-level on SAFuse. Finally, our work is concluded in Section 6.
feature and low-level feature for better feature representation. We se-
lect discriminative superpixel features for weighted local classification 2. Related work
and weighted contrastive learning, which are context-aware.
Fig. 1 visually compares various representative data augmentation 2.1. Data augmentation
methods. The target image is denoted by 𝑥1 , and the source image by
𝑥2 . Their corresponding labels are represented by 𝑦1 and 𝑦2 , respec- Traditional data augmentation is based on a single image and
tively. AdvMix [9], which removes random square regions in sensitive applies various transformations to the original data, such as rotating,
points within a single image, is demonstrated in Fig. 1(a). Fig. 1(b) flipping, or cropping. CutOut [14] randomly removes a square region
displays an augmented image of CutMix [10], which fuses the target of an image. AutoAugment [17], Fast AutoAugment [18], Random
image and source image with a non-semantic square region. In Fig. 1(c), Augment [17], and Trivial Augment [19] are the automatic augmen-
the augmented image of Attentive CutMix [11] shows the fusion of tation methods that jointly explore the augmentation spaces of multi-
two images using attentional square patches guided by an additional augmentation strategies to achieve optimal performance. The auto-
pre-trained network. Fig. 1(d) corresponds to OcCaMix [12], which matic augmentation methods usually have to make trade-offs between
fuses the two images with discriminative superpixels. Fig. 1(b)(c)(d) complexity, cost, and performance. The aforementioned methods that
depict augmentation methods that use area-based label fusion. Our augment samples based on a single image usually suffer from the in-
approach, illustrated in Fig. 1(e), produces the augmented image by sufficient information provided by the augmented images. Mixup [16]
pairwise fusion of random superpixels, and generates the fused label and CutMix [10] can provide more augmented data information from
using superpixel attention. As a result, our approach requires only pairwise image fusion. However, Mixup [20] fuses two images pixel
one single forward propagation making it more efficient compared to to pixel, making it difficult to interpret. CutMix [10] randomly fuses
the methods depicted in Fig. 1(a)(c)(d). Additionally, we are able to two images with a square region, and fuses the pairwise image labels
maintain more complete object-part information when compared to the with the proportion of area. CutMix can cause a mismatch between
methods depicted in Fig. 1(a)(b)(c). the augmented image and its fused label when the fusion region is
Our main contributions are as follows: the background, instead of the object. Additionally, the real object-part
information can be lost when the image fusion regions are square. Oc-
• We discuss the potential shortcomings of existing cutmix-based CaMix [12], Attentive CutMix [11], PuzzleMix [21], SaliencyMix [22],
data augmentation methods from the viewpoint of image fusion. and AutoMix [20] have proposed solutions to overcome the label and
• We introduce a novel data augmentation method that employs image mismatch problem by selecting regions guided by saliency or
superpixel fusion for the augmented image, and for the first time, attention. But they either require double forward propagations or an
we put forward superpixel attention-based label fusion, which is extra network, leading to inefficiency. ResizeMix [23], PatchUp [24],
object-part-aware and efficient. GridMix [25] and Random SuperpixelGridMix [26] conduct data aug-
• We propose a pioneering training framework for a strong clas- mentation totally randomly with area-based label fusion, which also
sifier, incorporating feature fusion and sparse superpixel feature potentially lead to discrepancies of the augmented image and the fused
constraints. To the best of our knowledge, it is the first time labels. Saliency Grafting [27] generates mixed labels through saliency-
weighted superpixel-wise contrastive loss and weighted local su- based semantic label fusion. However, it grafts square regions, losing
perpixel classification loss are proposed, which is context-aware. object-part information. Our proposed SAFuse fuses the images using
• We present extensive evaluations on various benchmarks and randomly selected superpixels for the largest diversification. We fuse
backbones, which provide evidence of SAFuse’s superiority. the labels using superpixel attention semantics with a single forward
2
D. Sun and F. Dornaika Information Fusion 107 (2024) 102308
Fig. 2. The overall framework of SAFuse. Image Fusion (Section 3.3) produces the augmented object-part-aware sample and its corresponding superpixel map from two training
images. Feature Fusion (Section 3.3) concatenates the high-level and low-level feature vectors after GAP (Global Average Pooling). Superpixel pooling and Self-Attention module
(Section 3.4) aggregates feature 𝐙̂ into superpixel vectors and learns the contextual information through self-attention module. 𝜆𝑎𝑡𝑡 is used for attention-based label fusion
(Section 3.3). Then we perform global classification (Section 3.5) with the fused feature and the fused label, conduct weighted local superpixel classification (Section 3.6)
on the selected top discriminative superpixel vectors, and execute weighted superpixel contrastive learning (Section 3.7) on the discriminative superpixel vectors selected across
images.
3
D. Sun and F. Dornaika Information Fusion 107 (2024) 102308
in Section 3.4. We separately explain the global classification in Sec- Algorithm 1: SAFuse
tion 3.5, weighted local superpixel classification in Section 3.6 and Input : Batch of 𝐵 images 𝐱 = [𝑥1 , 𝑥2 , ..., 𝑥𝐵 ] with image size
weighted superpixel contrastive learning in Section 3.7. Finally we 𝑊 × 𝐻, One-hot labels corresponding to the batch of
summarize the training and inference in Section 3.8. images 𝐲; Minimum and maximum number of
superpixels 𝑞𝑚𝑖𝑛 , 𝑞𝑚𝑎𝑥 ; Superpixel selection probability
3.2. Background 𝑝
Output: Global classification loss 𝑔𝑙𝑜𝑏𝑎𝑙 , Weighted local
Derived from CutMix [10], selected local regions are typically cut superpixel classification loss 𝑙𝑜𝑐𝑎𝑙 , Weighted
out of one image and pasted into another image in cutmix-based data superpixel contrastive loss 𝑐𝑜𝑛𝑡𝑟𝑎𝑠𝑡 , Total loss 𝑡𝑜𝑡𝑎𝑙
augmentation, which is image fusion in Eq. (1). Labels of the two
images are fused with a certain proportion for the label corresponding 1 𝐼𝑑𝑥1 = [1, 2, 3, ..., 𝐵 − 1, 𝐵], 𝐼𝑑𝑥2 = shuffle(𝐼𝑑𝑥1 )
to the augmented images, which is label fusion in Eq. (2). 2 for 𝑚 ← 1 to 𝐵 do
3 𝑥1 = 𝐱[𝑚], 𝑦1 = 𝐲[𝑚]
𝑥𝑚𝑖𝑥 = (𝟏 − 𝐌) ⊙ 𝑥1 + 𝐌 ⊙ 𝑥2 (1) 4 𝑥2 = 𝐱[𝐼𝑑𝑥2 ][𝑚], 𝑦2 = 𝐲[𝐼𝑑𝑥2 ][𝑚]
/* Randomly choose the pairwise images and
𝑦𝑚𝑖𝑥 = (1 − 𝜆) 𝑦1 + 𝜆 𝑦2 (2) corresponding labels */
5 𝑞1 ∼ 𝑈 (𝑞𝑚𝑖𝑛 , 𝑞𝑚𝑎𝑥 ), 𝑞2 ∼ 𝑈 (𝑞𝑚𝑖𝑛 , 𝑞𝑚𝑎𝑥 )
where 𝑥 ∈ R𝑊 ×𝐻×𝐶 denotes any training sample and 𝑦 is its annotated 6 Superpixel map 𝐒𝟏 ← Superpixel algorithm(𝑥1 ,𝑞1 )
label. 𝐻, 𝑊 and 𝐶 are the height, width, channel number of the 7 Superpixel map 𝐒𝟐 ← Superpixel algorithm(𝑥2 ,𝑞2 )
image separately. A new augmented training image (𝑥𝑚𝑖𝑥 , 𝑦𝑚𝑖𝑥 ) is gen- 8 𝑃 {𝑋 = 𝑘} = 𝑝𝑘 (1 − 𝑝)1−𝑘 , 𝑋 ∼ 𝐵(1, 𝑝), 𝑘 = 0, 1
erated from two random distinct training images (𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ). 𝐌 ∈ 9 𝐌 ← 𝐼𝑛𝑑(𝑆𝑒𝑙𝑒𝑐𝑡(𝐒𝟐 , 𝑋)) /* Select superpixels in 𝐒𝟐
{0, 1}𝑊 ×𝐻 denotes a binary mask whose pixels are from 𝑥2 , 𝟏 denotes a by Bernoulli Distribution for 𝐌 */
mask of value ones, and ⊙ represents multiplication by elements. The 10 Generate 𝐒𝐦𝐢𝐱 and 𝑥𝑚𝑖𝑥 with Eq. (4) and 𝐌 /* Image
traditional cutmix-based method uses area-based proportion for label fusion */
fusion. Usually, the blending parameter 𝜆 is set to the proportion of 11 Encoded feature 𝐙 ∈ R𝑤×ℎ×𝑐 ← 𝜃𝑒𝑛𝑐 (𝑥𝑚𝑖𝑥 )
the pixel number from image 𝑥2 to the total pixel number in image 𝑥1 , 12 Decoded feature 𝐙̂ ∈ R𝑊 ×𝐻×𝐷 ← 𝜃𝑑𝑒𝑐 (𝐙)
which is area-based and described in Eq. (3). 13 Vector sequence 𝐅 ← Average pooling(𝐙) ̂ by 𝐒𝐦𝐢𝐱
∑𝑊 ∑𝐻 14 𝐂 ∈ R𝐿×𝑑 ← self-attention(𝐅)
𝑖=1 𝑗=1 𝑀𝑖𝑗
𝜆𝑎𝑟𝑒𝑎 = (3) 15 𝐰𝐦 = {𝑤1 , 𝑤2 , ..., 𝑤𝐿 } ← Sigmoid (𝐂.𝑠𝑢𝑚(𝑑𝑖𝑚 = 1))
𝑊 ×𝐻
16 Calculate 𝜆𝑎𝑡𝑡𝑚 with Eq. (5) and 𝐰
It is worthy of note that the pixels in the background contribute less
17 Generate 𝑦𝑚𝑖𝑥𝑚 with Eq. (2) and 𝜆𝑎𝑡𝑡 /* Label Fusion */
to the semantic label compared to those in the object regions. Conse-
18 𝐜𝑠𝑚 ∈ R𝑁×𝑑 ← top-N(𝐂), 𝑁 = 𝑖𝑛𝑡(𝐿 × 𝑡)
quently, conventional cutmix-based data augmentation with area-based
label fusion often faces mismatch issues between the fused augmented 19 High-level feature vector 𝐞𝐡𝐢𝐠𝐡 ← Global Average Pooling(𝐙)
20 ̂
Low-level feature vector 𝐞𝐥𝐨𝐰 ← Global Average Pooling(𝐙)
image and its fused label, as illustrated in Fig. 1(b). Numerous existing
methods have addressed the issue of mismatch between the fused aug- 21 Fused feature vector 𝐞𝐦 ← Concatenate(𝐞𝐡𝐢𝐠𝐡 , 𝐞𝐥𝐨𝐰 )
mented image and its fused label by first centering the discriminative /* Feature Fusion */
regions before creating the augmented images (e.g. Fig. 1(c)(d)), which 22 𝐰, 𝜆𝑎𝑡𝑡 , 𝐜𝑠 , 𝐞 ← Record 𝐰𝐦 , 𝜆𝑎𝑡𝑡𝑚 , 𝐜𝑠𝑚 , 𝐞𝐦 in a batch
typically requires forward propagation twice, for object centering and 23 Update 𝑔𝑙𝑜𝑏𝑎𝑙 with Eq. (9) and 𝐞, 𝐲, 𝐲[𝐼𝑑𝑥2 ], 𝜆𝑎𝑡𝑡
training separately. Hence, exiting methods utilizing object centering 24 Update 𝑙𝑜𝑐𝑎𝑙 with Eq. (10) and 𝐜𝑠 , 𝐲, 𝐲[𝐼𝑑𝑥2 ], 𝐰
for the consistency between the fused label and the fused augmented 25 Update 𝑐𝑜𝑛𝑡𝑟𝑎𝑠𝑡 with Eq. (11) and 𝐜𝑠 , 𝐲, 𝐲[𝐼𝑑𝑥2 ], 𝐰
image is inefficient due to increased computational complexity. Worse, 26 Update 𝑡𝑜𝑡𝑎𝑙 with Eq. (12)
object centering leads to a lack of augmentation diversification, which
in turn will harm the performance.
3.3. Image fusion, feature fusion and label fusion Note that fusion with superpixels for the augmented image 𝑥𝑚𝑖𝑥 may
slightly cut off some superpixels in 𝑥1 due to potential overlap of su-
We start with two images from the current training batch in order perpixels from both images. Nevertheless, it is not troublesome because
to create an augmented image to replace the target image. As shown the superpixels of the object parts from image 𝑥2 are fully inserted. And
in Fig. 2 and Algorithm 1, the objective of Image Fusion is to generate the random occlusion and clipping of superpixels in image 𝑥1 increases
the augmented sample 𝑥𝑚𝑖𝑥 and the associated superpixel grid map 𝐒𝐦𝐢𝐱 . the generalization.
Feature Fusion aims to combine the high-level feature vector 𝐞𝐡𝐢𝐠𝐡 and To overcome the limitations of area-based label fusion mentioned
low-level feature vector 𝐞𝐥𝐨𝐰 for a more comprehensive feature vector 𝐞 in Section 3.2, we introduce superpixel attention-based label fusion.
for classification. We randomly select the number of superpixels 𝑞1 , 𝑞2 We mitigate the discrepancy between the fused image and label by
generating the fused label with superpixel attention in label space,
separately from the uniform distribution 𝑈 (𝑞𝑚𝑖𝑛 , 𝑞𝑚𝑎𝑥 ) to achieve greater
rather than by generating the fused image with image centering in
diversification. For the target image 𝑥1 and the source image 𝑥2 , we
image space. Specifically, we augment images by randomly cutting and
obtain the pre-computed associated superpixel maps 𝐒𝟏 and 𝐒𝟐 . Then,
pasting superpixels. We do not require object centering, only one single
superpixels from image 𝑥2 are randomly selected in 𝐒𝟐 by Bernoulli
forward propagation is necessary for training. To ensure consistency
Distribution parameterized by 𝑝 = 0.5 for the largest diversification.
in image and label, we fuse labels with superpixel attention weights
We cut the selected superpixels from image 𝑥2 to paste onto image
{𝑤1 , 𝑤2 , … , 𝑤𝐿 }. 𝐿 presents the total amount of superpixels of the fused
𝑥1 for image augmentation, which is described in Eq. (4) and Line
image 𝑥𝑚𝑖𝑥 . The details for the superpixel attention weights are in Sec-
10 in Algorithm 1. We generate the fused superpixel map 𝐒𝐦𝐢𝐱 for the
tion 3.4 and in Algorithm 1, Line 15. Then superpixel attention-based
augmented image 𝑥𝑚𝑖𝑥 simultaneously. proportion 𝜆𝑎𝑡𝑡 is calculated by the ratio of the superpixel semantics
𝑥𝑚𝑖𝑥 = (𝟏 − 𝐌) ⊙ 𝑥1 + 𝐌 ⊙ 𝑥2 from 𝑥2 to the total superpixel semantics in 𝑥𝑚𝑖𝑥 in Eq. (5). Multi-
(4) plying the number of pixels by the associated attention weight of the
𝐒𝐦𝐢𝐱 = (𝟏 − 𝐌) ⊙ 𝐒𝟏 + 𝐌 ⊙ 𝐒𝟐
4
D. Sun and F. Dornaika Information Fusion 107 (2024) 102308
Fig. 3. (a) Our method boost capturing local context information and complete object-part information; (b) Outline of superpixel pooling and self-attention, with top selection.
superpixel gives the semantics of the superpixels. The description of the final output feature vector 𝐂 = {𝐂𝟏 , 𝐂𝟐 , … ,
∑ 𝐂𝐋 } ∈ R𝐿×𝑑 after layer normalization can be found in Eq. (8).
𝑖∈𝐈𝐱 𝑤𝑖 ⋅ |𝐒𝐦𝐢𝐱 [𝑖]|
𝜆𝑎𝑡𝑡 = ∑𝐿 𝟐 (5) 𝐂 = 𝐿𝑎𝑦𝑒𝑟𝑁𝑜𝑟𝑚(𝐅 + 𝑆𝐴(𝐐, 𝐊, 𝐕)) (8)
𝑗=1 𝑤𝑗 ⋅ |𝐒𝐦𝐢𝐱 [𝑗]|
We capture local contextual information as well as object-part in-
where 𝐈𝐱𝟐 = [𝐼𝑥1 , 𝐼𝑥2 , … , 𝐼𝑥𝑚 ] are the indices of superpixels from 𝑥2 to
2 2 2 formation with this step, as indicated in Fig. 3(a).
𝑥1 , 𝐿 is the total amount of superpixels in the fused image 𝑥𝑚𝑖𝑥 .
Compared to traditional cutmix-based data augmentation meth- 3.4.3. Attention-based superpixel selection
ods, our approach fuses images based on superpixel grids to preserve After applying superpixel pooling and self-attention, attention
complete object-part information. We fuse the high-level features and weights can be obtained for each superpixel based on attentional
low-level features for better feature representation. Moreover, we fuse superpixel vectors (detailed in line 15 in Algorithm 1), which is the
the labels with superpixel attention weights, which alleviates the mis- superpixel weights vector 𝐰. Each weight is the sum of the super-
match issue between the fused image and its fused label with less pixel features followed by the application of the sigmoid function.
computational complexity. All the fusion operations aim at training a First, we compute the proportion 𝜆𝑎𝑡𝑡 from superpixel attention with
strong classifier by data augmentation in Section 3.5. superpixel attention-based label fusion (discussed in Section 3.3). At
the same time, we select the top most discriminative superpixels (as
3.4. Superpixel pooling, self-attention and selection described in line 18 of Algorithm 1) for downstream weighted local
superpixel classification and weighted superpixel contrastive learning.
To further maintain the complete object-part information of su- Superpixel attention-based selection drives the models to focus on the
perpixels and learn the discriminative contextual information, three most discriminative and informative superpixels and reduce noise.
steps are conducted: (i) Superpixel pooling, (ii) Self-attention, and (iii)
Selection, as shown in Fig. 3. 3.5. Global classification
5
D. Sun and F. Dornaika Information Fusion 107 (2024) 102308
superpixel vectors. The FC layer used for superpixel-level classification Datasets Number of classes Input size Standard split
differs from the one used for image-level classification. Training set Test set
The weighted local superpixel classification loss is formulated in CIFAR100 100 32 × 32 50,000 10,000
Eq. (10). TinyImageNet 200 64 × 64 100,000 10,000
𝑁𝑚
CUB-200-2011 200 224 × 224 5,994 5771⋆
1 ∑
𝐵 ∑
Stanford Dogs 120 224 × 224 12,000 8580
𝑙𝑜𝑐𝑎𝑙 = ∑𝐵 𝑤̂ 𝑖 (𝑓𝑙𝑜𝑐𝑎𝑙 (𝑐𝑖 ), 𝑦𝑠(𝑖) ) (10) ImageNet1K 1000 224 × 224 1,281,167 50,000
𝑚=1 𝑁𝑚 𝑚=1 𝑖=1
where 𝐜𝑖 is the unit-normalized features for superpixel 𝑖 in a batch. Dogs, the batch size is set as 16 and the initial learning rate as 0.01. The
𝑃𝑖 and 𝑁𝑖 are the positive set (intra-class superpixels) and negative base augmentations are random cropping and horizontal flipping. For
set (inter-class superpixels). {𝑤̃ 1 , 𝑤̃ 2 , … , 𝑤̃ 𝑁𝐵 } is the normalized super- ImageNet-1K, we follow the setting of Saliency Grafting [27]. The batch
pixels weights in a batch. 𝑁𝐵 denotes the selected superpixel number size is 256 and the initial learning rate is 0.1. The base augmentation is
across all images during a batch. We fix the temperature 𝜏 as 0.7. random cropping and random horizontal flipping for training images,
and center cropping for the test images. We use the SGD optimizer with
3.8. Training and inference the momentum value 0.9 and weight decay value 0.0005. The baseline
results are trained only with the aforementioned base augmentation.
For training, the objective of the global classification loss in Eq. (9) Following CutMix [10], the proposed augmentation scheme and all the
is to extract the global semantic feature of the training images; the competing methods are combined with the base augmentation with the
objective of the weighted local superpixel classification loss in Eq. (10) probability value 0.5 . We use bold and underlined to mark the best and
is to enhance the focus and sensitivity on the discriminative local second best results.
superpixels; the objective of weighted superpixel contrastive loss in
Eq. (11) is to optimize an embedding representation with enhanced
4.3. Experimental results
intra-class superpixel-wise compactness and inter-class superpixel-wise
separation. The overall training loss is denoted as Eq. (12).
Tables 2–4 illustrate the top-1 classification accuracies with
𝑡𝑜𝑡𝑎𝑙 = 𝑔𝑙𝑜𝑏𝑎𝑙 + 𝛾1 𝑙𝑜𝑐𝑎𝑙 + 𝛾2 𝑐𝑜𝑛𝑡𝑟𝑎𝑠𝑡 (12) ResNet18 and ResNeXt50 as encoders on CIFAR100, TinyImageNet,
CUB-200-2011, respectively. Table 5 presents the top-1 classification
where 𝛾1 > 0, 𝛾2 > 0 are the two coefficients.
accuracy with ResNet50 as the encoder on Stanford Dogs dataset.
The training of the model is carried out using back-propagation.
Table 6 shows the top-1 classification accuracies of CUB-200-2011
The inference is performed only with the model for global classifi-
cation. with TinyViT and ViT as the encoders. Table 7 presents the top-1
classification accuracy with ResNet50 as the encoder on ImageNet-1K
4. Performance evaluation dataset. These tables also illustrate the values of the hyperparameters
used in each method. In our proposed method, we randomly select
SAFuse is evaluated with Top.1 classification accuracy. We first the number of superpixels 𝑞 from a uniform distribution (𝑞𝑚𝑖𝑛 , 𝑞𝑚𝑎𝑥 ). 𝑡
introduce the used datasets and models in Section 4.1 and experimental
details in Section 4.2, then present the results in Section 4.3.
All of the experiments have been implemented in PyTorch. Source 1
https://scikit-image.org/docs/stable/api/skimage.segmentation.html#
code can be found in https://github.com/DanielaPlusPlus/SAFuse. skimage.segmentation.slic.
6
D. Sun and F. Dornaika Information Fusion 107 (2024) 102308
Table 2
Performance on CIFAR100 with ResNet18, ResNeXt50 as encoders.
Method Hyperparameters Top-1 Acc.
R18 RX50
Baseline – 78.58% 80.67%
CutMix [10] 𝛼=1 79.69% 83.23%
Attentive CutMix [11] 𝑁 =3 79.29% 82.51%
SaliencyMix [22] – 79.57% 82.56%
ResizeMix [23] 𝛼 = 0.1, 𝛽 = 0.8 79.71% 82.34%
GridMix [25] 𝑔𝑟𝑖𝑑 = 4 × 4, 𝑝 = 0.8, 𝛾 = 0.15 79.45% 82.47%
Random SuperpixelGridMix [26] 𝑞 = 200, 𝑁 = 50 79.06% 82.22%
Random SuperpixelGridMix [26] 𝑞 = 16, 𝑁 = 3 80.30% 83.25%
OcCaMix† [12] 𝑞 ∼ 𝑈 (15, 50), 𝑁 = 3 81.42% 84.01%
PatchUp (input space) [24] pr = 0.7, block = 7, 𝛼 = 2, 𝛾 = 0.5 80.13% 83.46%
PatchUp (hidden space)[24] pr = 0.7, block = 7, 𝛼 = 2, 𝛾 = 0.5 80.91% 83.65%
Saliency Grafting [27] 𝛼 = 2, 𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 = 0.2 80.83% 83.10%
AutoMix† [20] 𝛼 = 2, 𝑙 = 3 82.04% 83.64%
SAFuse(Ours) 𝑞 ∼ 𝑈 (25, 30), 𝑡 = 76.4%, 𝛾1 = 0.8, 𝛾2 = 0.08 82.54% 84.33%
Table 3
Performance on TinyImageNet with ResNet18, ResNetXt50 as encoders.
Method Hyperparameters Top-1 Acc.
R18 RX50
Baseline – 61.66% 65.69%
CutMix [10] – 64.35% 66.97%
Attentive CutMix [11] 𝑁 =7 64.01% 66.84%
SaliencyMix [22] – 63.52% 66.52%
ResizeMix [23] 𝛼 = 0.1, 𝛽 = 0.8 64.63% 67.33%
GridMix [25] 𝑔𝑟𝑖𝑑 = 8 × 8, 𝑝 = 0.8, 𝛾 = 0.15 64.79% 67.43%
Random SuperpixelGridMix [26] 𝑞 = 200, 𝑁 = 50 65.59% 69.37%
Random SuperpixelGridMix [26] 𝑞 = 64, 𝑁 = 7 66.46% 71.53%
OcCaMix† [12] 𝑞 ∼ 𝑈 (30, 70), 𝑁 = 7 67.35% 72.23%
PatchUp (input space) [24] pr = 0.7, block = 7, 𝛼 = 2, 𝛾 = 0.5 66.14% 70.49%
PatchUp (hidden space) [24] pr = 0.7, block = 7, 𝛼 = 2, 𝛾 = 0.5 67.06% 71.51%
Saliency Grafting [27] 𝛼 = 2, 𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 = 0.2 64.96% † 67.83%
AutoMix † [20] 𝛼 = 2, 𝑙 = 3 67.33% 70.72%
SAFuse(Ours) 𝑞 ∼ 𝑈 (20, 35), 𝑡 = 76.4%, 𝛾1 = 0.8, 𝛾2 = 0.08 68.31% 73.12%
Table 4
Performance on CUB-200-2011 with ResNet18 and ResNeXt50 as encoders.
Method Hyperparameters Top-1 Acc.
R18 RX50
Baseline – 75.56% 81.41%
CutMix [10] 𝛼=1 76.90% 82.63%
Attentive CutMix [11] 𝑁 =9 76.73% 82.34%
SaliencyMix [22] – 76.88% 82.81%
ResizeMix [23] 𝛼 = 0.1, 𝛽 = 0.8 76.23% 81.94%
GridMix [25] 𝑔𝑟𝑖𝑑 = 14 × 14, 𝑝 = 0.8, 𝛾 = 0.15 77.13% 82.17%
Random SuperpixelGridMix [26] 𝑞 = 200, 𝑁 = 50 77.58% 83.03%
Random SuperpixelGridMix [26] 𝑞 = 196, 𝑁 = 9 76.98% 82.19%
OcCaMix† [12] 𝑞 ∼ 𝑈 (30, 100), 𝑁 = 9 78.40% 83.69%
PatchUp (Input space) [24] pr = 0.7, block = 7, 𝛼 = 2, 𝛾 = 0.5 77.05% 82.66%
PatchUp (Hidden space) [24] pr = 0.7, block = 7, 𝛼 = 2, 𝛾 = 0.5 77.96% 83.27%
Saliency Grafting [27] 𝛼 = 2, 𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 = 0.2 77.43% 82.93%
AutoMix [20] 𝛼 = 2, 𝑙 = 3 78.17% 83.52%
SAFuse(Ours) 𝑞 ∼ 𝑈 (30, 40), 𝑡 = 85.4%, 𝛾1 = 0.85, 𝛾2 = 0.15 79.24% 84.61%
denotes the top percentage for the selection of local superpixels. 𝛾1 and Dogs and CUB-200-2011. Table 6 indicates that our SAFuse performs
𝛾2 are two loss coefficients to weight different loss values in Eq. (12). the best not only when the encoder is based on the CNN structure,
We tuned the Attentive CutMix [11] and Rand SuperpixelGridMix [26] but also when the encoder is based on the transformer structure. Our
for better results. The hyperparameters of the other competing methods SAFuse outperforms the baseline by 1.08% with TinyViT as the encoder
are set according to the suggestions in the corresponding paper. All and by 2.03% with ViT-B/16 as the encoder for CUB-200-2011. As can
the experiments on CUB-200-2011 load the models pre-trained on be seen in Table 7, SAFuse consistently shows the best performance on
the ImageNet-1K dataset with the ResNet50 encoder.
ImageNet. † marks the results which are published in the corresponding
Table 8 displays our method’s results compared to representative
paper.
data augmentation methods based on a single image. On the same
Our method outperforms the baseline by 3.96% with ResNet18 as dataset and with the same encoder, our approach outperforms the
the encoder, and by 3.66% with ResNeXt50 as the encoder on CI- second best method AdvMask [9] by 5.17% when ResNet50 is used as
FAR100, as shown in Table 2. In Table 3 on TinyImageNet, our method the encoder and by 1.07% when WRN-28-10 is used as the encoder.
outperforms the second best by 0.96% with ResNet18 as encoder, and These results confirm our hypothesis that data augmentation by image
by 0.89% with ResNeXt50 as encoder. According to Tables 4 and 5, our fusion, especially fusion with additional object-part information, can
SAFuse still outperforms on the fine-grained datasets, such as Stanford significantly improve performance.
7
D. Sun and F. Dornaika Information Fusion 107 (2024) 102308
Table 5
Performance on Stanford Dogs with encoder ResNet50.
Method Hyperparameters Top-1 Acc. with R50
Baseline – 61.46%
CutMix [10] 𝛼=1 63.92%
Attentive CutMix [11] 𝑁 = 12 62.87%
SaliencyMix [22] – 64.28%
ResizeMix [23] 𝛼 = 0.1, 𝛽 = 0.8 64.58%
GridMix [25] 𝑔𝑟𝑖𝑑 = 14 × 14, 𝑝 = 0.8, 𝛾 = 0.15 62.55%
Random SuperpixelGridMix [26] 𝑞 = 200, 𝑁 = 50 68.79%
Random SuperpixelGridMix [26] 𝑞 = 196, 𝑁 = 12 67.76%
OcCaMix† [12] 𝑞 ∼ 𝑈 (50, 95), 𝑁 = 12 69.34%
PatchUp (Input space) [24] pr = 0.7, block = 7, 𝛼 = 2, 𝛾 = 0.5 64.03%
PatchUp (Hidden space)[24] pr = 0.7, block = 7, 𝛼 = 2, 𝛾 = 0.5 65.19%
Saliency Grafting [27] 𝛼 = 2, 𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 = 0.2 66.32%
AutoMix [20] 𝛼 = 2, 𝑙 = 3 69.12%
SAFuse(Ours) 𝑞 ∼ 𝑈 (40, 60), 𝑡 = 76.4%, 𝛾1 = 0.8, 𝛾2 = 0.08 70.36%
Table 6
Performance on CUB-200-2011 with encoder ViT-B/16 and TinyViT-11m-224.
Method Hyperparameters Top-1 Acc.
ViT-B/16 TinyViT11m
Baseline – 86.96% 80.45%
Random SuperpixelGridMix [26] 𝑞 = 200, 𝑁 = 50 81.32% 87.19%
OcCaMix [12] 𝑞 ∼ 𝑈 (30, 100), 𝑁 = 9 81.70% 87.88%
SAFuse(Ours) 𝑈 (30, 40), 𝑡 = 76.4%, 𝛾1 = 0.85, 𝛾2 = 0.15 82.48% 88.04%
5. Ablation studies
Table 9 illustrates the performance when using ResNet50 as the
encoder. Note that for various datasets, the input image sizes differ, and
5.1. Effect of superpixel grid-based fusion
the encoders have slight variations (Detailed in Section 4.1). However,
the datasets and encoders remain entirely consistent across different
Superpixel grid-based fusion involves generating augmented images
methods. At the same time, we compare the model size, inference using a superpixel grid map rather than a square grid map. We can
speed, computational complexity on different datasets. Our method out- see the performance improvement resulting from superpixel grid-based
performs OcCaMix [12] in all three datasets, which also uses superpixel fusion in Table 10, with an increase from 80.49% to 81.33%. This is
grid-based image fusion. OcCaMix [12] performs second best most of due to the ability preserving object-part information when employing
the time, but needs one forward propagation for object concentration the superpixel grid. Superpixel grid-based fusion drives our model to
and one for training. We only need one forward propagation by using become object-part-aware.
superpixel attention based label fusion. Therefore, our proposal is more
efficient and has a lower computational cost, as shown in Table 9. 5.2. Effect of weighted local superpixel classification
We emphasize that our approach does not require any superpixel-
based operation in the inference phase. As can be seen in Table 9, our Weighted local superpixel classification involves performing local
method achieves better performance despite the larger model size and classifications on the superpixel-based local regions considering the
lower inference speed due to feature fusion during inference. Moreover, semantic attention weights of superpixels. As shown in Table 10, the
compared to many sophisticated augmentation models, our method has results show that weighted local superpixel mapping increases the
lower computational complexity, which makes it both effective and performance from 81.33% to 82.10%. Compared to local superpixel
efficient. classification, which does not consider superpixel attention weight-
In summary, our pairwise fusion method surpasses single-image ing, using weighted local superpixel classification improves perfor-
data augmentation methods, such as CutOut [14] and AdvMask [9], mance from 81.72% to 82.10%. The improvements result from the
by providing richer information. Our approach outperforms comparison fact that local superpixel classification drives the model to concentrate
8
D. Sun and F. Dornaika Information Fusion 107 (2024) 102308
Table 9
Comparison of model size (Param.), computational complexity (FLOPs), inference speed (FPS), and top-1 accuracy. Measured
with NVIDIA GeForce RTX 2070 Super. ↑ means the larger the better, ↓ means the smaller the better. Bold marks the best.
Dataset (Input size) Method Param.(M)↓ FLOPs(G)↓ FPS↑ Top-1 Acc.↑
Baseline 23.71 1.30 211 80.19%
CIFAR100 (32 × 32) OcCaMix [12] 23.71 2.62 211 83.69%
SAFuse(Ours) 39.73 1.95 178 84.16%
Baseline 23.89 5.24 206 64.83%
TinyImageNet (64 × 64) OcCaMix [12] 23.89 10.48 206 69.22%
SAFuse(Ours) 38.24 7.81 165 71.49%
Baseline 23.91 6.26 195 79.47%
CUB-200-2011 (224 × 224) OcCaMix [12] 23.91 12.54 195 82.94%
SAFuse(Ours) 31.82 9.51 130 83.71%
Table 10
Ablation study of the proposed SAFuse. CIFAR100 was evaluated with the model of ResNet18 encoder. ‘‘Square.’’ denotes square grid-based
fusion. ‘‘Superpixel.’’ denotes superpixel grid-based fusion. ‘‘Local-cls.’’ denotes local superpixel classification. ‘‘Weighted Local-cls.’’ denotes
weighted local superpixel classification. ‘‘Local-con.’’ denotes local superpixel contrastive learning. ‘‘Weighted Local-con.’’ denotes weighted
local superpixel contrastive learning.
Square. Superpixel. Local-cls. Weighted Local-cls. Local-con. Weighted Local-con. Acc.
✔ ✘ ✘ ✘ ✘ ✘ 80.49%
✘ ✔ ✘ ✘ ✘ ✘ 81.33%
✘ ✔ ✔ ✘ ✘ ✘ 81.72%
✘ ✔ ✘ ✔ ✘ ✘ 82.10%
✘ ✔ ✘ ✘ ✔ ✘ 81.49%
✘ ✔ ✘ ✘ ✘ ✔ 81.84%
✘ ✔ ✔ ✘ ✔ ✘ 82.37%
✘ ✔ ✘ ✔ ✘ ✔ 82.54%
Fig. 4. Visualization of the accuracy varying with the combination of loss coefficients 𝛾1 and 𝛾2 . (a) Acc. for CIFAR100 with ResNet18 as the encoder. Best result is obtained
when 𝛾1 = 0.8 and 𝛾2 = 0.08; (b) Acc. for CUB-200-2011 with ResNet18 as the encoder. Best result is achieved when 𝛾1 = 0.8 and 𝛾2 = 0.15.
more on local superpixel regions. Furthermore, weighted local super- contrastive loss, the performance increases from 81.33% to 81.84%.
pixel classification penalizes the model to increase sensitivity towards Cross-image weighted superpixel contrastive learning can pull close
more semantically significant superpixels. In our approach, we conduct the intra-class discriminative superpixel-level feature embeddings, and
weighted local superpixel classification on the selected discriminative push apart the inter-class discriminative superpixel-level feature em-
superpixels with the highest attention weights. This enables the model beddings, achieving better alignment of embeddings. In our method, we
to be effectively penalized to focus on the discriminative local regions. perform weighted contrastive learning only on superpixel features with
Our method becomes locally context-aware through weighted local su- the highest attention weights for better global semantic consistency.
perpixel classification, which takes into account the semantic attention
weights of superpixels during local classification.
5.4. Sensitivity to loss coefficients
5.3. Effect of weighted superpixel contrastive learning
Here, we study the performance of the proposed method as a func-
Weighted superpixel contrastive learning means performing con- tion of the loss coefficients 𝛾1 and 𝛾2 in Eq. (12). The results presented
trastive learning on superpixel-based local regions across images in in Fig. 4 indicate that the best result is obtained when 𝛾1 = 0.8 and
a batch with a weighted superpixel-wise contrastive loss. The results 𝛾2 = 0.08 for CIFAR100 with models using ResNet18 as encoders. For
in Table 10 demonstrate that weighted superpixel contrastive learning CUB-200-2011, with models also using ResNet18 as encoders, we can
improves the performance from 81.33% to 81.84% when not combined achieve the best result when 𝛾1 = 0.8 and 𝛾2 = 0.15. It is worth noting
with weighted local superpixel classification, and from 82.10% to that focusing on local regions too much with a large loss coefficient 𝛾1
82.54% when combined with the weighted local superpixel classifica- can lead the model failing in capturing global semantic information and
tion. By incorporating the superpixel attention weights with supervised poor performance. Similarly, while the primary purpose of contrastive
9
D. Sun and F. Dornaika Information Fusion 107 (2024) 102308
Fig. 5. (a) (b) Pairwise images; (c) (d) (e) The fused (augmented) images from different amount of superpixels..
Table 11
Influence of number of superpixels 𝑞 on CIFAR100 and CUB-200-2011 accuracy with models using ResNet18
encoder.
𝑞 Acc. for CIFAR100 𝑞 Acc. for CUB-200-2011
𝐶𝑜𝑛𝑠𝑡𝑎𝑛𝑡 ∶ 𝑞 = 28 81.57% 𝐶𝑜𝑛𝑠𝑡𝑎𝑛𝑡 ∶ 𝑞 = 35 78.23%
𝑞 ∼ 𝑈 (10, 20) 81.92% 𝑞 ∼ 𝑈 (20, 30) 78.75%
𝑞 ∼ 𝑈 (20, 30) 82.31% 𝑞 ∼ 𝑈 (20, 40) 78.94%
𝑞 ∼ 𝑈 (25, 30) 82.54% 𝑞 ∼ 𝑈 (30, 40) 79.24%
𝑞 ∼ 𝑈 (25, 35) 82.27% 𝑞 ∼ 𝑈 (30, 50) 78.87%
𝑞 ∼ 𝑈 (35, 40) 81.85% 𝑞 ∼ 𝑈 (50, 60) 78.49%
10
D. Sun and F. Dornaika Information Fusion 107 (2024) 102308
Fig. 6. Visualization of golden-section search for top percentage 𝑡. (a) Accuracy for CIFAR100 using ResNet18 encoder. The best result is obtained when 𝑡 = 76.4%; (b) Accuracy
for CUB-200-2011 using ResNet18 encoder. The best result is obtained when 𝑡 = 85.4%.
Fig. 7. Visual representation of the heatmaps from trained ResNeXt50. (a) Original images. (b) Heatmaps generated by ResNeXt50 trained using the baseline method. (c) Heatmaps
generated by ResNeXt50 trained using our SAFuse.
Fig. 8. Visualization of t-SNE on CUB-200-2011 (labels from 0 to 9) features extracted by ResNet50. (a) Features obtained from baseline; (b) Features obtained from OcCaMix [12]
; (c) Features obtained from our SAFuse method.
11
D. Sun and F. Dornaika Information Fusion 107 (2024) 102308
CRediT authorship contribution statement [14] Terrance DeVries, Graham W. Taylor, Improved regularization of convolutional
neural networks with cutout, 2017, arXiv preprint arXiv:1708.04552.
[15] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, Yi Yang, Random eras-
D. Sun: Data curation, Formal analysis, Methodology, Software,
ing data augmentation, in: Proceedings of the AAAI Conference on Artificial
Validation, Visualization, Writing – original draft, Writing – review Intelligence, Vol. 34, 2020, pp. 13001–13008.
& editing. F. Dornaika: Conceptualization, Formal analysis, Investi- [16] Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, David Lopez-Paz, Mixup:
gation, Methodology, Project administration, Supervision, Validation, Beyond empirical risk minimization, in: International Conference on Learning
Writing – original draft, Writing – review & editing. Representations, 2018.
[17] Ekin D. Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, Quoc V. Le,
Autoaugment: Learning augmentation strategies from data, in: Proceedings of
Declaration of competing interest
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019,
pp. 113–123.
The authors declare that they have no known competing finan- [18] Sungbin Lim, Ildoo Kim, Taesup Kim, Chiheon Kim, Sungwoong Kim, Fast
cial interests or personal relationships that could have appeared to autoaugment, Adv. Neural Inf. Process. Syst. 32 (2019).
influence the work reported in this paper. [19] Samuel G. Müller, Frank Hutter, Trivialaugment: Tuning-free yet state-of-the-art
data augmentation, in: Proceedings of the IEEE/CVF International Conference on
Computer Vision, 2021, pp. 774–782.
Data availability [20] Zicheng Liu, Siyuan Li, Di Wu, Zihan Liu, Zhiyuan Chen, Lirong Wu, Stan Z.
Li, Automix: Unveiling the power of mixup for stronger classifiers, in: Computer
Data will be made available on request. Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27,
2022, Proceedings, Part XXIV, Springer, 2022, pp. 441–458.
[21] Jang-Hyun Kim, Wonho Choo, Hyun Oh Song, Puzzle mix: Exploiting saliency
Acknowledgments
and local statistics for optimal mixup, in: International Conference on Machine
Learning, PMLR, 2020, pp. 5275–5285.
This work is supported in part by grant PID2021-126701OB-I00 [22] AFM Shahab Uddin, Mst Sirazam Monira, Wheemyung Shin, TaeChoong Chung,
funded by MCIN/AEI/10.13039/501100011033, Spain and by ‘‘ERDF Sung-Ho Bae, Saliencymix: A saliency guided data augmentation strategy for
A way of making Europe’’. It is also partially supported by grant better regularization, in: International Conference on Learning Representations,
2020.
GIU23/022 funded by the University of the Basque Country (UPV/EHU).
[23] Jie Qin, Jiemin Fang, Qian Zhang, Wenyu Liu, Xingang Wang, Xinggang Wang,
Open Access funding is provided by the University of the Basque Resizemix: Mixing data with preserved object information and true labels, 2020,
Country. All authors approved the version of the manuscript to be arXiv preprint arXiv:2012.11101.
published. [24] Mojtaba Faramarzi, Mohammad Amini, Akilesh Badrinaaraayanan, Vikas Verma,
Sarath Chandar, Patchup: A feature-space block-level regularization technique
for convolutional neural networks, in: Proceedings of the AAAI Conference on
References
Artificial Intelligence, Vol. 36, 2022, pp. 589–597.
[25] Kyungjune Baek, Duhyeon Bang, Hyunjung Shim, Gridmix: Strong regularization
[1] Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton, Imagenet classification with through local context mapping, Pattern Recognit. 109 (2021) 107594.
deep convolutional neural networks, Commun. ACM 60 (6) (2017) 84–90. [26] Karim Hammoudi, Adnane Cabani, Bouthaina Slika, Halim Benhabiles, Fadi Dor-
[2] Sen Qiu, Hongkai Zhao, Nan Jiang, Zhelong Wang, Long Liu, Yi An, Hongyu naika, Mahmoud Melkemi, Superpixelgridmasks data augmentation: Application
Zhao, Xin Miao, Ruichen Liu, Giancarlo Fortino, Multi-sensor information fusion to precision health and other real-world data, J. Healthc. Inform. Res. 6 (4)
based on machine learning for real applications in human activity recognition: (2022) 442–460.
State-of-the-art and research challenges, Inf. Fusion 80 (2022) 241–265. [27] Joonhyung Park, June Yong Yang, Jinwoo Shin, Sung Ju Hwang, Eunho Yang,
[3] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, Hartwig Saliency grafting: Innocuous attribution-guided mixup with calibrated label
Adam, Encoder–decoder with atrous separable convolution for semantic image mixing, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol.
segmentation, in: Proceedings of the European Conference on Computer Vision, 36, 2022, pp. 7957–7965.
ECCV, 2018, pp. 801–818. [28] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal
[4] Li Guo, Pengfei Shi, Long Chen, Chenglizhao Chen, Weiping Ding, Pixel and Fua, Sabine Süsstrunk, Slic superpixels compared to state-of-the-art superpixel
region level information fusion in membership regularized fuzzy clustering for methods, IEEE Trans. Pattern Anal. Mach. Intell. 34 (11) (2012) 2274–2282.
image segmentation, Inf. Fusion 92 (2023) 479–497. [29] Shengfeng He, Rynson W.H. Lau, Wenxi Liu, Zhe Huang, Qingxiong Yang,
[5] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun, Faster r-cnn: Towards real- Supercnn: A superpixelwise convolutional neural network for salient object
time object detection with region proposal networks, Adv. Neural Inf. Process. detection, Int. J. Comput. Vis. 115 (2015) 330–344.
Syst. 28 (2015). [30] Suha Kwak, Seunghoon Hong, Bohyung Han, Weakly supervised semantic
[6] Zhengxia Zou, Keyan Chen, Zhenwei Shi, Yuhong Guo, Jieping Ye, Object segmentation using superpixel pooling network, in: Proceedings of the AAAI
detection in 20 years: A survey, Proc. IEEE (2023). Conference on Artificial Intelligence, Vol. 31, 2017.
[7] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi- [31] Teppei Suzuki, Shuichi Akizuki, Naoki Kato, Yoshimitsu Aoki, Superpixel convo-
aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg lution for segmentation, in: 2018 25th IEEE International Conference on Image
Heigold, Sylvain Gelly, et al., An image is worth 16x16 words: Transform- Processing, ICIP, IEEE, 2018, pp. 3249–3253.
ers for image recognition at scale, in: International Conference on Learning [32] Ting Lu, Shutao Li, Leyuan Fang, Xiuping Jia, Jón Atli Benediktsson, From
Representations, 2021. subpixel to superpixel: A novel fusion framework for hyperspectral image
[8] Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fa- classification, IEEE Trans. Geosci. Remote Sens. 55 (8) (2017) 4398–4411.
had Shahbaz Khan, Mubarak Shah, Transformers in vision: A survey, ACM [33] Maryam Imani, Hassan Ghassemian, An overview on spectral and spatial informa-
Comput. Surv. (CSUR) 54 (10s) (2022) 1–41. tion fusion for hyperspectral image classification: Current trends and challenges,
[9] Suorong Yang, Jinqiao Li, Tianyue Zhang, Jian Zhao, Furao Shen, Adv- Inf. Fusion 59 (2020) 59–83.
mask: A sparse adversarial attack-based data augmentation method for image [34] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip
classification, Pattern Recognit. 144 (2023) 109847. Isola, Aaron Maschinot, Ce Liu, Dilip Krishnan, Supervised contrastive learning,
[10] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, Adv. Neural Inf. Process. Syst. 33 (2020) 18661–18673.
Youngjoon Yoo, Cutmix: Regularization strategy to train strong classifiers with [35] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean
localizable features, in: Proceedings of the IEEE/CVF International Conference Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al.,
on Computer Vision, 2019, pp. 6023–6032. Imagenet large scale visual recognition challenge, Int. J. Comput. Vis. 115 (3)
[11] Devesh Walawalkar, Zhiqiang Shen, Zechun Liu, Marios Savvides, Attentive (2015) 211–252.
cutmix: An enhanced data augmentation approach for deep learning based image [36] Patryk Chrabaszcz, Ilya Loshchilov, Frank Hutter, A downsampled variant of
classification, in: ICASSP 2020-2020 IEEE International Conference on Acoustics, imagenet as an alternative to the cifar datasets, 2017, arXiv preprint arXiv:
Speech and Signal Processing, ICASSP, IEEE, 2020, pp. 3642–3646. 1707.08819.
[12] F. Dornaika, D. Sun, K. Hammoudi, J. Charafeddine, A. Cabani, C. Zhang, [37] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, Serge Belongie,
Object-centric contour-aware data augmentation using superpixels of varying The caltech-ucsd birds-200–2011 dataset, 2011.
granularity, Pattern Recognit. (2023) 109481. [38] Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, Fei-Fei Li, Novel
[13] Connor Shorten, Taghi M. Khoshgoftaar, A survey on image data augmentation dataset for fine-grained image categorization: Stanford dogs, in: Proc. CVPR
for deep learning, J. Big Data 6 (1) (2019) 1–48. Workshop on Fine-Grained Visual Categorization, FGVC, 2011.
12
D. Sun and F. Dornaika Information Fusion 107 (2024) 102308
[39] Pei Guo, Overlap between imagenet and cub. https://guopei.github.io/2016/ [42] Sergey Zagoruyko, Nikos Komodakis, Wide residual networks, in: British Machine
Overlap-Between-Imagenet-And-CUB/. Vision Conference 2016, British Machine Vision Association, 2016.
[40] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Deep residual learning for [43] Kan Wu, Jinnian Zhang, Houwen Peng, Mengchen Liu, Bin Xiao, Jianlong Fu,
image recognition, in: Proceedings of the IEEE Conference on Computer Vision Lu Yuan, Tinyvit: Fast pretraining distillation for small vision transformers,
and Pattern Recognition, 2016, pp. 770–778. in: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel,
[41] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, Kaiming He, Aggregated October 23–27, 2022, Proceedings, Part XXI, Springer, 2022, pp. 68–85.
residual transformations for deep neural networks, in: Proceedings of the IEEE [44] Jack Kiefer, Sequential minimax search for a maximum, Proc. Amer. Math. Soc.
Conference on Computer Vision and Pattern Recognition, 2017, pp. 1492–1500. 4 (3) (1953) 502–506.
13