Fourier-Basis Functions To Bridge Augmentation Gap Rethinking Frequency Augmentation in CVPR 2024 Paper
Fourier-Basis Functions To Bridge Augmentation Gap Rethinking Frequency Augmentation in CVPR 2024 Paper
Abstract
17763
Common visual augmentations impact different fre- APR-SP /
AFA AFA AFA
AFA (ours) AugMix† PRIME AugMax†
quency components of images simultaneously, which are w/o aux.
(ours) w/ AugMix w/ PRIME
difficult to explicitly control, and might not encompass all FLOPs ×1 ×2 ×3 ×2 ×1 ×2 ×8
Memory ×1.02 ×1.62 ×2.66 ×1.83 ×2.50 ×3.06 ×2.35
possible frequency variations present in unseen corruptions
or variantions happening in real-world scenarios [36]. We Table 1. AFA adds minimal computational burden to existing
thus rethink image augmentation in the frequency domain, methods and is more efficient compared to other adversarial meth-
and complement visual augmentation strategies with ex- ods. It requires only ×1.62 memory and just ×2 the FLOPs of
plicit use of Fourier basis functions in an adversarial set- standard augmentation [12] training whereas AugMax uses ×2.35
ting. There has been exploration into frequency-based aug- the memory and ×8 the FLOPs when using 5 PGD steps. Methods
mentations to discover capabilities beyond what visual aug- with † denote the use of loss with JSD.
mentations can achieve. [6, 41, 48] swap or mix par- gap of common visual augmentations.
tial amplitude spectrum between images, aiming to induce
more phase-reliance for classification. [43] augments im- 2. Related works
ages with shortcut features to reduce their specificity for
classification. AugSVF [37] introduces frequency noise Data augmentation includes a set of techniques to increase
within the AugMix framework and [24, 28] adversarially data variety, thus reducing the distribution gap between
perturb the frequency components of images. These aug- training and test data. Generalization and robustness per-
mentations are computationally heavy, due to the compli- formance of models normally benefits from the use of data
cated augmentation framework [37], computation of mul- augmentation for training [45] or at test-time [19].
tiple Fourier transforms for training images and their aug- Image-based augmentations. Common image augmenta-
mented versions [6, 41, 48], identification of learned fre- tion techniques include transformations, e.g. cropping, flip-
quency shortcuts [43], or adversarial training [24, 28]. ping, rotation, among others [45]. Applying the transfor-
mations with fixed configuration lacks flexibility when the
In this work, we propose Auxiliary Fourier-basis Aug- models encounter more variations in the inputs at testing
mentation (AFA). We use additive noise based on Fourier- time. Thus, algorithms were designed to combine transfor-
basis functions to augment the frequency spectrum in a mations randomly, e.g. AugMix [15], RandAug [3], Triv-
more efficient way than other methods that apply frequency ialAugment [34], MixUp [52], and CutMix [51]. How-
manipulations [6, 37, 43]. The effect of additive Fourier- ever, random combinations might not be optimal. In [2],
basis functions on image appearance is complementary to AutoAugment was proposed, based on using reinforcement
those of other augmentations (see Fig. 1). These images learning to find the best policy on how to combine basic
can be interpreted as samples representing an adversarial transformations for augmentation. AugMax [42] instead
distribution, distinct from those augmented by common vi- combines transformations adversarially, aiming at comple-
sual transformations. We thus expand upon the conven- menting augmentations based on diversity with others that
tional idea of adversarial augmentation, moving beyond the favour hardness of training data. PRIME [33] samples
generation of imperceptible noise through gradient back- transformations with maximum-entropy distributions. [40]
propagation. We employ a training architecture and strat- augments images based on knowledge distilled by a teacher
egy with an auxiliary component to address the adversarial model. However, these approaches address variations lim-
distribution, and a main component for the original distri- ited by visually-plausible transformations only.
bution, similarly to AugMax [42]. However, the adversarial Frequency-based augmentations. In [49], it was dis-
distribution that we construct using additive Fourier-basis is covered that models trained with visual transformations
much less computationally expensive than that of AugMax might be vulnerable to noise impacting certain parts of
(and other visual augmentation methods - see Tab. 1). It the frequency spectrum (e.g. high-frequency components),
contributes to comparable or higher generalization results, demonstrating that visual augmentations do not completely
while allowing for the training of larger models on larger guarantee robustness. Complementary augmentation tech-
datasets (e.g. ImageNet). Our contributions are: niques are thus required to fill the augmentation gap left by
• We propose a straightforward and computationally effi- visual augmentations. The straightforward approach is aug-
cient augmentation technique called AFA. We show that mentation in the frequency domain. For example, [6] mixes
it enhances robustness of models to common image cor- the amplitude spectrum of images to reduce reliance on the
ruptions, improves OOD generalization and consistency amplitude part of the spectrum and induce phase-reliance
of prediction w.r.t. perturbations; for classification. [41, 48] swap or mix the amplitude spec-
• We expand the augmentation space, complementary to trum of images. [43] augments images with shortcut fea-
that of visual augmentations, by exploiting amplitude- tures to reduce their specificity for classification, mitigat-
and phase-adjustable frequency noise, and use it in an ad- ing frequency shortcut learning. [37] introduces frequency
versarial setting. Our method reduces the augmentation noise in the AugMix framework. [24, 29] adversarially
17764
perturb images in the frequency domain. While these tech-
niques address what visual augmentations may overlook,
they also have limitations. Most frequency augmentation
methods are based on manipulation of the frequency com-
ponents of images. They usually have high computational
requirements to identify frequency shortcuts [43] (f.i. us-
ing [44, 46]), implement adversarial training setup [24] or
calculate multiple Fourier transforms of original and aug-
mented images [6, 41, 43, 48]. Figure 2. Example of Fourier-basis functions added to natural im-
We instead propose to use Fourier-basis functions as ad- ages. They appear as gratings that obscure spatial information.
ditive noise in the frequency domain. Our augmentation
technique requires only one extra step during training rather other considered out-of-distribution or adversarial (using
than multiple pre-processing and expensive computations frequency-based noise) as shown in Fig. 3. We generate the
during training time as in other methods [6, 41, 43, 48], adversarial augmented images by sampling a Fourier-basis
and works to complement image-based augmentations. Fur- and a strength parameter per colour channel, and adding
thermore, we simplify the adversarial training framework them to the original images. Visually augmented and adver-
of AugMax [42], not requiring an optimization process to sarially augmented training images are then processed us-
maximize the hardness of adversarial augmentation, and ing a main component and an auxiliary component, respec-
achieving comparable or higher robustness. This allows the tively. Joint optimisation of two cross-entropy functions en-
use of adversarial augmentations at larger-scale. We ac- courages robust and consistent classification, as it promotes
count for the induced distribution shifts in the frequency correctness under adversarially augmented images. Details
domain via an auxiliary component. The benefit of AFA is of the different parts of the method are reported below.
complementary to visual augmentations, and we can incor- Generation of adversarial augmented images. Ran-
porate them seamlessly to further boost model robustness. domly sampling augmentations and applying them to im-
ages with random strengths was shown to be sufficient to
3. Preliminary: Fourier-basis functions outperform more complex strategies [34].
We follow this design principle in our method to generate
We utilize Fourier-basis functions in our augmentation strat- adversarial augmented images with Fourier basis functions,
egy as an additive perturbation to the images. They are si- which allows us to avoid optimization steps to determine
nusoidal wave functions used as basic components of the the worst-case combination of augmentations as in Aug-
Fourier transform to represent signals and images. A real Max [42]. We produce adversarial augmented images by
Fourier basis function has two parameters, namely a fre- adding a different Fourier basis function Af,ω per channel
quency f and direction ω, and is denoted as: of the original RGB image. We generate the Fourier basis
\label {eqn:general_wave} A_{f, \omega }(u, v) = R\sin (2\pi f(u\cos (\omega ) + v\sin (\omega ) - \pi / 4)), (1) functions by sampling f and ω from uniform distributions
as f ∼ U[1,M] and ω ∼ U[0,π] , where M is the image size.
where A_{f, \omega }(u, v) represents the amplitude of the wave at The sampling space of all Fourier-basis is denoted as V.
position (u, v). The function involves the sine of a 2D spa- We add the generated Fourier basis functions per channel c
tial frequency 2\pi f to produce a planar wave with a spe- with a weight factor sampled from an exponential distribu-
cific frequency f , and angle \omega that indicates the direction tion σc ∼ Exp(1/λ), with c ∈ {R, G, B}. The selection
of propagation. R is chosen such that the planar wave has of the exponential distribution for sampling augmentation
unit l2 -norm. A particular Fourier basis function, charac- magnitude is motivated by the concept of event rate, where
terized by specific frequency (f ) and direction (ω), can be perturbations with larger magnitudes become progressively
associated with a Dirac delta function in the spectral do- less likely, albeit still possible. This is controlled by ad-
main. Therefore, when employed in an additive manner, justing λ, ensuring a balance between maintaining diversity
as in our augmentation strategy, this Fourier-basis function in sampled values while minimizing the occurrence of ex-
facilitates the targeted modification of particular frequency tremely large augmentation perturbations. In Sec. 5.3, we
components of images. Examples of Fourier-basis waves show how the parameter λ affects the augmentation results.
superimposed on images are shown in Fig. 2. The proposed augmentation process results in a 3-
channel image xa = [xaR , xaG , xaB ], where:
4. Auxiliary Fourier-basis Augmentation
\label {eqn:afa} x^a_c = \text {Clamp}_{[0, 1]}(x_c + \sigma _c A_{f_c, \omega _c}), && c \in \{\text {R}, \text {G}, \text {B}\}. (2)
The Auxiliary Fourier-basis Augmentation (AFA) that we
propose is based on two lines of augmentations, one con- An example of image xa augmented with additive Fourier-
sidered in-distribution (using visual augmentations) and an- basis functions is shown in our method schema in Fig. 3. We
17765
V Visual
LCE (ŷ, y)
.
x̂
..
Aug.
x
...
Model
independently
LCE (y a , y)
1
σc ∼ Exp λ Aux.
xa
σR · σG · σB ·
Figure 3. Schema of the AFA augmentation pipeline. The image x is augmented using AFA, which adds a planar wave per channel c of the
image at a strength value σc sampled from an exponential distribution (eq.2). The AFA augmented image xa is used for training, processed
through the auxiliary component of the parallel batch normalisation layer (for models that use batch normalization to track batch statistics,
e.g. ResNet). Other visual augmentations are applied in parallel, and used for training via the main component of the normalization layer.
Finally, we train via optimizing two cross-entropy losses, one for the main and the other for the auxiliary component.
demonstrate the adversarial nature of augmented samples in of distribution of the visually and adversarially augmented
the supplementary material. images. Without these additional normalization layers, the
Auxiliary component for distribution shifts. As shown model training assumes a single-modal sample distribution,
in Figs. 2 and 3, the Fourier-basis augmentations result in limiting its ability to differentiate between the main and
images with an unnatural appearance due to substantial fre- the adversarial distribution, thus negatively affecting overall
quency perturbations. The presence of planar waves across performance. In Sec. 5.3, we show the result of not employ-
the augmented images determines the unnaturalness of im- ing the auxiliary components.
age appearance, which can be seen as adversarial attacks on
the images. These augmentations disrupt the learned mean It is worth noting that for models that do not employ
and variance in batch normalization layers, which are incon- batch normalization layers (e.g. CCT that uses layer nor-
sistent with the distribution shifts induced by our augmen- malization and does not track statistics), the parallel nor-
tation and lead to inconsistent activations. This results in a malization layers are not needed. However, the extra term
negative impact on model convergence and generalization in the loss function (see next paragraph) to generate consis-
abilities. tent predictions across distribution shifts serves as a regular-
ization mechanism, which is verified in the supplementary
We address these issues by deploying architectural com-
material.
ponents in the training, capable of handling distribution
shifts explicitly by tracking statistics and adjusting the loss Loss function. We work in the supervised learning setting
function accordingly. Namely, we incorporate auxiliary with a training dataset D consisting of clean images x with
components into the model, such as Parallel Batch Normal- labels y. We train the model in the main architecture stream
ization layers and an additional cross-entropy term in the (see Fig. 3) using a cross-entropy loss LCE (ŷ, y), where y is
loss function to specifically account for these adversarial the ground-truth label and ŷ is the predicted label for images
augmented images. These modifications to the model ar- augmented with a given visual augmentation strategy (e.g.
chitecture and training enhance performance, particularly standard, PRIME, etc.). Under the non-auxiliary setting,
in the presence of distribution shifts, contributing to better models thus optimise the standard cross entropy loss.
generalization, robustness to common corruptions and con-
sistency to time-dependent increasing perturbations. The In the auxiliary setting, we add an extra cross-entropy
introduction of parallel batch normalization layers is moti- loss term LCE (y a , y), which optimise the model to predict
vated by the need to account for distribution shifts induced the correct label on adversarial augmented images whose
by adversarial (Fourier-basis) augmentations, as observed predicted label is denoted by y a , contributing to robustness
in [42]. With the parallel batch normalisation, the affine pa- of the model w.r.t. aggressive distribution shifts. We refer
rameters and statistics of main and auxiliary distributions to the combined loss function LACE , taking the average of
are recorded separately. This allows independent learning the two cross-entropy terms, as the Auxiliary Cross Entropy
17766
(ACE) Loss: compute the accuracy on the ImageNet-R and ImageNet-v2
test sets (note that ImageNet-v2 has 3 test sets, and we re-
\label {eqn:adv} \mathcal {L}_{\text {ACE}}(\hat {y}, y^a, y) = \frac {1}{2} \left [ \mathcal {L}_{\text {CE}}(\hat {y}, y) + \mathcal {L}_{\text {CE}}(y^{a}, y) \right ]. (3) port the average accuracy on them). We only use the main
BN layers during testing, similar to AugMax. More details
It contributes to achieve comparable performance, with about the metrics are in the supplementary material.
lower training time and complexity, than using the Jensen-
Shannon Divergence (JSD) loss [15, 42]. Our motivation 5.2. Results
to not employ the JSD loss is the reduced training time due
Comparison with AugMax. We first report a direct com-
to less computational complexity. In our experiments, for
parison with AugMax [42] in Tab. 2, as AFA addresses the
comparison purposes, we also use the JSD loss in the aux-
computational shortcomings of generating adversarial aug-
iliary setting, where training batches are augmented using
mentations via PGD iterations, and of using a JSD loss for
AFA and go through auxiliary components. We report re-
alignment of the distribution of original and (adversarially)
sults in Sec. 5.3 (Fig. 6).
augmented images. We use AugMix as main augmentation,
as in AugMax, and ablate on the use of JSD and ACE loss.
5. Experiments and results
We show that AFA achieves comparable (or better) per-
We compare AFA with other popular augmentation tech- formance than AugMax, despite it being much less compu-
niques, evaluating robustness to common corruptions, gen- tational intensive. We indeed demonstrate that we can gen-
eralization abilities and consistency to time-dependent in- erate adversarial augmentations by only adding (weighted)
creasing perturbations, on benchmark datasets. Fourier-basis waves per color channel, not requiring PGD
steps, and can train the models using an extra cross-entropy
5.1. Experiment setup instead of the expensive JSD loss. The improvements
Datasets. We trained models on the CIFAR-10 (C10) [20], granted by our approach are particularly evident in the case
CIFAR-100 (C100) [21], TinyImageNet (TIN) [22] and Im- of ImageNet (using ACE), where we gain 1.6% of standard
ageNet (IN) [4] datasets and evaluate them on the corre- accuracy and 4.1% of robust accuracy (5.6% mCE) perfor-
sponding robustness benchmark datasets, namely C10-C, mance w.r.t. AugMax. Considering the increased computa-
C100-C, TIN-C, IN-C [14], IN-C̄ [32], and IN-3DCC [18]. tional efficiency and the simplicity of adversarial augmenta-
For ImageNet-trained models, we further evaluate their tion method, AFA is a more versatile and effective tool than
generalisation performance on the IN-v2 [35] and IN-R AugMax. Hence, in the rest of the paper, we do not report
datasets [13], and consistency of performance on time- further results of the AugMax framework, due to its high
dependent increasing perturbations on the IN-P dataset [14]. computational requirements, which complicate the training
Architectures and training details. We train ResNet [12] of larger models (e.g. ResNet-50 and CCT).
and transformers (CCT [7], CVT [7] and ViT [5]). We train Robustness, generalization and consistency. In Tab. 3, we
ResNet-18, CCT-7/3x1 (32 resolution), CVT and ViT-Lite
on C-10, C-100, and only ResNet-18 on TIN. In the case of
- Main Auxiliary SA↑ RA↑ mCE↓
ImageNet, we train ResNet-18, ResNet-50 and CCT-14/7x2
AugMix† ✗ 95.47 86.48 -
(224 resolution). Under auxiliary setting, we use the Du- AugMix†
C10
all corruptions in the robustness benchmarks as robust- AugMax 62.21 38.67 80.72
AugMix† AFA 64.34 38.53 80.79
ness accuracy (RA). This provides direct comparison be- AugMix AFA 62.51 38.67 80.83
tween model performance on original and corruption bench-
AugMix† ✗ 65.2 31.5 87.1
mark datasets. We also compute the mean corruption er- AugMix† AugMax 66.5 36.5 80.6
IN
ror (mCE) [14] for TIN and IN (for CIFAR there are no AugMix† AFA 65.0 36.8 80.4
baselines advised) to evaluate the normalized robustness of AugMix AFA 68.1 41.1 75.0
models against image corruptions, the mean flip rate (mFR)
and the mean top-5 distance (mT5D) to evaluate the consis- Table 2. Comparison of AFA and AugMax (with AugMix for vi-
tency performance of models against increasing perturba- sual augmentation [42]), with a ResNet18 backbone. The mark †
tions. For the evaluation of generalization performance, we indicates the use of the JSD loss, otherwise the ACE loss is used.
17767
Robustness Generalisation Consistency
IN-C IN-C̄ IN-3DCC IN-R IN-v2 IN-P
Main Aux SA (↑) RA (↑) mCE (↓) RA (↑) mCE (↓) RA (↑) mCE (↓) Acc. (↑) Avg. Acc. (↑) mFP (↓) mT5D (↓)
- ✗ 68.9 32.9 84.7 34.8 87.0 34.9 84.4 33.1 64.3 72.8 87.0
- AFA 68.2 35.9 81.0 41.7 78.3 37.1 81.7 32.8 63.7 64.2 76.8
AugMix† ✗ 65.2 31.5 87.1 34.6 87.3 32.1 88.3 28.2 59.5 80.2 86.2
ResNet18
AugMix† AFA 65.0 36.8 80.4 40.9 79.3 36.0 83.2 30.6 60.9 60.1 68.5
AugMix AFA 68.1 41.1 75.0 45.2 73.3 38.9 79.4 35.2 63.2 68.5 81.7
PRIME ✗ 66.0 43.6 72.0 42.0 78.1 42.4 75.2 36.9 61.4 54.7 65.3
PRIME AFA 67.2 47.2 67.8 47.3 71.1 43.8 73.5 37.8 63.0 52.3 63.7
TA+ ✗ 68.9 36.9 80.1 35.9 85.6 38.6 79.7 32.6 63.7 68.1 81.4
TA+ AFA 67.8 41.4 74.7 42.9 76.7 41.1 76.5 35.4 62.7 59.9 72.3
- ✗ 75.6 39.2 76.7 39.9 79.4 41.2 76.1 36.2 70.8 58.0 78.4
- AFA 76.5 46.2 68.0 47.6 69.4 46.2 69.8 38.1 72.0 48.0 67.2
APR-SP ✗ 71.9 42.9 72.7 45.9 72.5 39.8 78.4 34.9 67.2 60.2 75.4
ResNet50
APR-SP AFA 74.4 47.6 66.7 51.4 64.9 42.6 74.6 38.7 69.3 54.9 72.6
AugMix† ✗ 74.7 43.4 72.0 44.6 73.3 41.9 75.5 33.0 70.0 60.9 72.5
AugMix† AFA 75.6 50.6 62.9 51.8 64.0 47.6 68.3 36.3 71.2 44.5 56.1
AugMix AFA 76.6 49.1 64.7 52.5 62.9 46.3 69.6 41.0 71.8 52.2 72.2
PRIME ✗ 72.1 49.2 64.9 46.4 71.5 47.2 68.8 38.5 67.8 45.4 58.1
PRIME AFA 74.5 53.9 59.2 54.2 61.3 50.2 65.0 40.9 69.8 40.4 54.8
TA+ ✗ 75.9 43.4 71.7 41.8 77.1 44.7 71.6 37.1 70.3 51.9 70.4
TA+ AFA 76.6 50.3 63.1 49.7 66.7 49.6 65.4 40.0 72.2 45.1 64.5
- ✗ 76.4 43.9 70.7 50.3 65.6 43.4 73.2 35.6 71.2 48.3 72.9
- AFA 76.9 51.9 61.0 58.5 55.4 50.7 64.4 39.0 71.9 38.4 61.8
AugMix ✗ 76.1 47.3 66.8 52.2 63.1 45.3 71.0 37.9 70.7 49.3 72.8
AugMix AFA 77.4 56.5 55.6 60.8 52.2 51.8 62.8 41.0 72.5 37.9 59.9
CCT
PRIME ✗ 73.6 54.1 58.6 54.5 60.8 50.7 64.4 39.2 68.7 36.1 53.0
PRIME AFA 76.6 58.7 52.8 61.2 52.0 54.5 59.4 43.2 71.9 31.9 51.2
TA+ ✗ 77.1 50.2 63.2 54.1 60.7 49.3 65.8 38.2 72.1 41.8 66.3
TA+ AFA 76.9 56.0 56.0 59.1 54.6 53.1 61.1 41.1 72.1 36.4 58.5
Table 3. Robustness, generalization and consistency results on ImageNet-based benchmarks. Models with † use the JSD loss. Triv-
ialAugment (TA) has overlapping augmentations with IN-C (+ ), and no other overlaps with other datasets. The green colour indicates an
improvement when the main augmentation is combined with AFA, while red indicates no improvement. Results marked with bold/bold
are the best for a particular architecture.
report results achieved by AFA combined with different vi- Robustness to high-severity corruptions. AFA con-
sual augmentation methods, AugMix, PRIME, TrivialAug- tributes to a consistent improvement of robustness of mod-
ment (TA), to train different architectures (ResNet, CCT). els at increasing corruption severity. We compute the rel-
We evaluate robustness to common corruptions on IN-C, ative corruption error, namely the difference between the
IN-C̄ and IN-3DCC, OOD generalisation on IN-v2 and IN- corruption error of models trained with a visual augmenta-
R, and consistency w.r.t. increasing perturbations on IN-P. tion technique only and those trained with both visual aug-
mentations and AFA, and report it in Fig. 4 for different
AFA generally contributes to a boost of performance corruption severity. A positive value indicates that mod-
(green colored results in Tab. 3) when combined with dif- els trained with the addition of AFA have better robustness.
ferent visual augmentation techniques, reducing the robust- For higher corruption severity, AFA contributes to stronger
ness and generalization gap for different model architec- robustness, measured by an increase in the relative corrup-
tures. When compared to another Fourier-based augmen- tion error in Fig. 4. The improvements obtained by AFA on
tation technique, APR-SP [6], AFA outperforms it on all IN-3DCC are slightly less pronounced than those on IN-C
benchmarks when trained with only standard augmentation and IN-C̄. This is attributable to the specific corruptions in
techniques. When models trained with AugMix and AFA, IN-3DCC that concern 3D geometric information, and are
we record better overall performance than those trained with somewhat more complicated image transformations. How-
AugMix alone. For the transformer architecture CCT, train- ever, AFA contributes to a substantial improvement w.r.t. to
ing with AFA contributes to an even stronger improvement models trained without it. We thus highlight that AFA is
in all tests. These results stay consistent for smaller resolu- very beneficial for increasing robustness to aggressive cor-
tion datasets (CIFAR and TIN), as we report at the end of ruptions of the test images. Details of the results at different
this section.
17768
IN-C IN-C̄ IN-3DCC
12
10
Relative Corruption Error
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Corruption Severity Corruption Severity Corruption Severity
RN50+PRIME+AFA CCT+PRIME+AFA RN50+TA+AFA CCT+TA+AFA RN50+AugMix+AFA CCT+AugMix+AFA
Figure 4. Relative error per corruption severity, computed by subtracting the classification error of models trained with PRIME, Triv-
ialAugment, and AugMix with that of corresponding models trained with PRIME+AFA, TrivialAugment+AFA, and AugMix+AFA.
Fourier heatmap: robustness in the frequency spec- AugMix ✗ 95.10 85.42 75.79 60.83
trum. We further evaluate the robustness of models to AugMix AFA 95.93 90.57 77.22 66.18
perturbations at specific frequencies, using test images per- PRIME ✗ 95.30 90.56 76.65 67.92
turbed with frequency noises according to [49]. We present PRIME AFA 95.49 91.40 76.50 67.89
the results in the form of Fourier heatmaps, see Fig. 5 for
CVT
els. The intensity of a pixel at location (u, v) in the heatmap - AFA 94.58 86.71 75.13 58.25
indicates the classification error of a model tested on im-
ages perturbed by Fourier noise at frequency (u, v) in the Table 4. Results for C10-C and C100-C with ResNet18, CCT.
frequency spectrum (implementation details are in the sup- CVT and ViT-Lite. Models with † use loss with JSD.
plementary material). ResNet18 trained with standard aug-
mentations setting (baseline) is very sensitive to perturba- 5.3. Ablation
tions at low and middle-high frequency (see Fig. 5), while
those trained with visual augmentations like PRIME and Auxiliary components. We investigate the contribution
TrivialAugment (TA) still show vulnerability at low and and importance of the auxiliary components in improving
middle-high frequency noise. When training models with model robustness. We trained models with AFA-augmented
AFA, i.e. PRIME+AFA and TA+AFA, the models become images, passing through only the main components or the
more robust to frequency pertubations, especially at middle- auxiliary components. The results in Tab. 5, i.e. lower
high frequency. AFA can provide extensive robustness to RA and higher mCE of models trained with AFA applied
frequency perturbations and bridge the robustness gap that only in the main components, highlight the importance of
visual augmentation might not cover. AFA auxiliary components. The auxiliary components play
Results on CIFAR and TIN. In Tab. 4, we present the a crucial role in mitigating the impact of aggressive adver-
robustness results on smaller resolution datasets, C10 and sarial distribution shifts induced by AFA. By doing so, they
C100. The results on TIN are in the supplementary mate- contribute to model ability to learn from the original dis-
rial. These results are inline with those on IN in Tab. 3. tribution, while AFA facilitates learning robustness to dis-
17769
- Main Auxiliary SA↑ RA↑ mCE↓ 100 100
- ✗ 94.15 73.67 -
C10
AFA ✗ 92.36 83.25 - 80 80
mCE ↓ (%)
RA ↑ (%)
- AFA 94.69 88.22 -
- ✗ 78.27 48.30 - 60 60
C100
AFA ✗ 66.7 33.3 84.4 Figure 6. Comparison of using objective with and without the JSD
- AFA 68.2 35.9 81.0 term. All models are ResNet-18 trained with only AFA in the
auxiliary component and no other augmentations. When used with
Table 5. Ablation results ResNet18 trained with and without Aux- JSD two batches passed through Auxiliary components and there
iliary Components on C10, C100, TinyImageNet and ImageNet. was no main augmentation (in total 3 batches, 1 clean and 2 AFA).
mCE ↓ (%)
ponents. While model robustness improves under both set- 63
tings, the performance gain for the auxiliary setting is three 90
SA ↑ (%)
to five percentage points higher across all datasets. 87.5 61
ACE vs JSD. As part of our method, we replaced the use
of JSD with ACE which is less computationally burdening. 88.5 C10 96
RA ↑ (%)
17770
References Simple Data Processing Method to Improve Robustness and
Uncertainty. arXiv, Dec. 2019. 1, 2, 5
[1] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- [16] Ignacio Hounie, Luiz F. O. Chamon, and Alejandro Ribeiro.
offrey Hinton. A simple framework for contrastive learning Automatic data augmentation via invariance-constrained
of visual representations, 2020. 1 learning. In Andreas Krause, Emma Brunskill, Kyunghyun
[2] Ekin D. Cubuk, Barret Zoph, Dandelion Mane, Vijay Va- Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scar-
sudevan, and Quoc V. Le. Autoaugment: Learning augmen- lett, editors, Proceedings of the 40th International Confer-
tation policies from data, 2019. 1, 2 ence on Machine Learning, volume 202 of Proceedings of
[3] Ekin Dogus Cubuk, Barret Zoph, Jon Shlens, and Quoc Le.
Machine Learning Research, pages 13410–13433. PMLR,
Randaugment: Practical automated data augmentation with a
23–29 Jul 2023. 1
reduced search space. In H. Larochelle, M. Ranzato, R. Had- [17] Christoph Kamann and Carsten Rother. Benchmarking the
sell, M.F. Balcan, and H. Lin, editors, Advances in Neural robustness of semantic segmentation models with respect to
Information Processing Systems, volume 33, pages 18613– common corruptions. International Journal of Computer Vi-
18624. Curran Associates, Inc., 2020. 2 sion, 129(2):462–483, Feb 2021. 1
[4] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, [18] Oğuzhan Fatih Kar, Teresa Yeo, Andrei Atanov, and Amir
and Li Fei-Fei. Imagenet: A large-scale hierarchical image Zamir. 3d common corruptions and data augmentation,
database. In 2009 IEEE Conference on Computer Vision and 2022. 1, 5
Pattern Recognition, pages 248–255, 2009. 5 [19] Ildoo Kim, Younghoon Kim, and Sungwoong Kim. Learning
[5] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, loss for test-time augmentation. In H. Larochelle, M. Ran-
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, zato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- in Neural Information Processing Systems, volume 33, pages
vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is 4163–4174. Curran Associates, Inc., 2020. 2
worth 16x16 words: Transformers for image recognition at [20] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-10
scale. In International Conference on Learning Representa- (canadian institute for advanced research). 5
tions, 2021. 5 [21] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-
[6] Chen et al. Amplitude-phase recombination: Rethinking ro- 100 (canadian institute for advanced research). 5
bustness of convolutional neural networks in frequency do- [22] Ya Le and Xuan S. Yang. Tiny imagenet visual recognition
main, 2021. 2, 3, 6 challenge. 2015. 5
[7] Hassani et al. Escaping the big data paradigm with compact [23] Xiu-Chuan Li, Xu-Yao Zhang, Fei Yin, and Cheng-Lin Liu.
transformers, 2022. 5 F-mixup: Attack cnns from fourier perspective. In 2020 25th
[8] Fartash Faghri, Hadi Pouransari, Sachin Mehta, Mehrdad International Conference on Pattern Recognition (ICPR),
Farajtabar, Ali Farhadi, Mohammad Rastegari, and Oncel pages 541–548, 2021. 1
Tuzel. Reinforce data, multiply impact: Improved model [24] Chang Liu, Wenzhao Xiang, Yuan He, Hui Xue, Shibao
accuracy and robustness with dataset reinforcement, 2023. 1 Zheng, and Hang Su. Improving model generalization by
[9] Zhiqiang Gao, Kaizhu Huang, Rui Zhang, Dawei Liu, and on-manifold adversarial augmentation in the frequency do-
Jieming Ma. Towards better robustness against common cor- main, 2023. 2, 3
ruptions for unsupervised domain adaptation. In Proceedings [25] Jiashuo Liu, Zheyan Shen, Yue He, Xingxuan Zhang, Ren-
of the IEEE/CVF International Conference on Computer Vi- zhe Xu, Han Yu, and Peng Cui. Towards out-of-distribution
sion (ICCV), pages 18882–18893, October 2023. 1 generalization: A survey, 2023. 1
[10] Antonio Greco, Nicola Strisciuglio, Mario Vento, and Vin- [26] Siao Liu, Zhaoyu Chen, Yang Liu, Yuzheng Wang, Dingkang
cenzo Vigilante. Benchmarking deep networks for facial Yang, Zhile Zhao, Ziqing Zhou, Xie Yi, Wei Li, Wen-
emotion recognition in the wild. Multimedia Tools and Ap- qiang Zhang, and Zhongxue Gan. Improving generalization
plications, 82(8):11189–11220, 2023. 1 in visual reinforcement learning via conflict-aware gradient
[11] Xiaoshuai Hao, Yi Zhu, Srikar Appalaraju, Aston Zhang, agreement augmentation. In Proceedings of the IEEE/CVF
Wanqian Zhang, Bo Li, and Mu Li. Mixgen: A new International Conference on Computer Vision (ICCV), pages
multi-modal data augmentation. In Proceedings of the 23436–23446, October 2023. 1
IEEE/CVF Winter Conference on Applications of Computer [27] Yang Liu, Shen Yan, Laura Leal-Taixé, James Hays, and
Vision (WACV) Workshops, pages 379–389, January 2023. 1 Deva Ramanan. Soft augmentation for image classifica-
[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. tion. In Proceedings of the IEEE/CVF Conference on Com-
Deep residual learning for image recognition, 2015. 2, 5 puter Vision and Pattern Recognition (CVPR), pages 16241–
[13] Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kada- 16250, June 2023. 1
vath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, [28] Yuyang Long, Qilong Zhang, Boheng Zeng, Lianli Gao, Xi-
Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, anglong Liu, Jian Zhang, and Jingkuan Song. Frequency
and Justin Gilmer. The many faces of robustness: A critical domain model augmentation for adversarial attack. In Shai
analysis of out-of-distribution generalization, 2021. 1, 5 Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria
[14] Dan Hendrycks and Thomas Dietterich. Benchmarking neu- Farinella, and Tal Hassner, editors, Computer Vision – ECCV
ral network robustness to common corruptions and perturba- 2022, pages 549–566, Cham, 2022. Springer Nature Switzer-
tions, 2019. 1, 5 land. 2
[15] Dan Hendrycks, Norman Mu, Ekin D. Cubuk, Barret Zoph, [29] Yuyang Long, Qilong Zhang, Boheng Zeng, Lianli Gao, Xi-
Justin Gilmer, and Balaji Lakshminarayanan. AugMix: A
17771
anglong Liu, Jian Zhang, and Jingkuan Song. Frequency Nicola Strisciuglio. DFM-x: Augmentation by leveraging
domain model augmentation for adversarial attack. In Shai prior knowledge of shortcut learning. In 4th Visual Inductive
Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Priors for Data-Efficient Deep Learning Workshop, 2023. 2,
Farinella, and Tal Hassner, editors, Computer Vision – ECCV 3
2022, pages 549–566, Cham, 2022. Springer Nature Switzer- [44] Shunxin Wang, Raymond Veldhuis, Christoph Brune, and
land. 2 Nicola Strisciuglio. Frequency shortcut learning in neural
[30] Guozheng Ma, Linrui Zhang, Haoyu Wang, Lu Li, Zilin networks. In NeurIPS 2022 Workshop on Distribution Shifts:
Wang, Zhen Wang, Li Shen, Xueqian Wang, and Dacheng Connecting Methods and Applications, 2022. 3
Tao. Learning better with less: Effective augmentation for [45] Shunxin Wang, Raymond Veldhuis, Christoph Brune, and
sample-efficient visual reinforcement learning, 2023. 1 Nicola Strisciuglio. A Survey on the Robustness of Com-
[31] Juliette Marrie, Michael Arbel, Diane Larlus, and Julien puter Vision Models against Common Corruptions. arXiv,
Mairal. Slack: Stable learning of augmentations with cold- May 2023. 1, 2
start and kl regularization. In Proceedings of the IEEE/CVF [46] Shunxin Wang, Raymond Veldhuis, Christoph Brune, and
Conference on Computer Vision and Pattern Recognition Nicola Strisciuglio. What do neural networks learn in im-
(CVPR), pages 24306–24314, June 2023. 1 age classification? a frequency shortcut perspective. In
[32] Eric Mintun, Alexander Kirillov, and Saining Xie. On in- Proceedings of the IEEE/CVF International Conference on
teraction between augmentations and corruptions in natural Computer Vision (ICCV), pages 1433–1442, October 2023.
corruption robustness, 2021. 1, 5 3
[33] Apostolos Modas, Rahul Rade, Guillermo Ortiz-Jiménez, [47] Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V.
Seyed-Mohsen Moosavi-Dezfooli, and Pascal Frossard. Le. Self-training with noisy student improves imagenet clas-
PRIME: A few primitives can boost robustness to common sification. In Proceedings of the IEEE/CVF Conference
corruptions. arXiv, Dec. 2021. 1, 2 on Computer Vision and Pattern Recognition (CVPR), June
[34] Samuel G. Müller and Frank Hutter. TrivialAugment: 2020. 1
Tuning-free Yet State-of-the-Art Data Augmentation. arXiv, [48] Qinwei Xu, Ruipeng Zhang, Ziqing Fan, Yanfeng Wang, Yi-
Mar. 2021. 1, 2, 3 Yan Wu, and Ya Zhang. Fourier-based augmentation with
[35] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and applications to domain generalization. Pattern Recognition,
Vaishaal Shankar. Do imagenet classifiers generalize to im- 139:109474, 2023. 2, 3
agenet?, 2019. 1, 5 [49] Dong Yin, Raphael Gontijo Lopes, Jonathon Shlens, Ekin D.
[36] Tonmoy Saikia, Cordelia Schmid, and Thomas Brox. Im- Cubuk, and Justin Gilmer. A Fourier Perspective on Model
proving robustness against common corruptions with fre- Robustness in Computer Vision. arXiv, June 2019. 1, 2, 7
quency biased models. In Proceedings of the IEEE/CVF In- [50] Mehmet Kerim Yucel, Ramazan Gokberk Cinbis, and Pinar
ternational Conference on Computer Vision (ICCV), pages Duygulu. Hybridaugment++: Unified frequency spectra per-
10211–10220, October 2021. 2 turbations for model robustness, 2023. 1
[37] Ryan Soklaski, Michael Yee, and Theodoros Tsiligkaridis. [51] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk
Fourier-Based Augmentations for Improved Robustness and Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regu-
Uncertainty Calibration. arXiv, Feb. 2022. 2 larization strategy to train strong classifiers with localizable
[38] Nicola Strisciuglio and George Azzopardi. Visual response features, 2019. 2
inhibition for increased robustness of convolutional networks [52] Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and
to distribution shifts. In NeurIPS 2022 Workshop on Distri- David Lopez-Paz. mixup: Beyond empirical risk minimiza-
bution Shifts: Connecting Methods and Applications, 2022. tion. In International Conference on Learning Representa-
1 tions, 2018. 2
[39] Nicola Strisciuglio, Manuel Lopez-Antequera, and Nicolai [53] Stephan Zheng, Yang Song, Thomas Leung, and Ian Good-
Petkov. Enhanced robustness of convolutional networks with fellow. Improving the Robustness of Deep Neural Networks
a push–pull inhibition layer. Neural Computing and Appli- via Stability Training. arXiv, Apr. 2016. 1
cations, 32(24):17957–17971, 2020. 1
[40] Teppei Suzuki. Teachaugment: Data augmentation op-
timization using teacher knowledge. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), pages 10904–10914, June 2022. 2
[41] An Wang, Mobarakol Islam, Mengya Xu, and Hongliang
Ren. Curriculum-based augmented fourier domain adapta-
tion for robust medical image segmentation. IEEE Transac-
tions on Automation Science and Engineering, pages 1–13,
2023. 2, 3
[42] Haotao Wang, Chaowei Xiao, Jean Kossaifi, Zhiding Yu,
Anima Anandkumar, and Zhangyang Wang. AugMax: Ad-
versarial Composition of Random Augmentations for Robust
Training. arXiv, Oct. 2021. 1, 2, 3, 4, 5
[43] Shunxin Wang, Christoph Brune, Raymond Veldhuis, and
17772