0% found this document useful (0 votes)
14 views13 pages

NeurIPS 2022 Data Efficient Augmentation For Training Neural Networks Paper Conference

Uploaded by

HamdiMidou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views13 pages

NeurIPS 2022 Data Efficient Augmentation For Training Neural Networks Paper Conference

Uploaded by

HamdiMidou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Data-Efficient Augmentation for Training Neural

Networks

Tian Yu Liu Baharan Mirzasoleiman


Department of Computer Science Department of Computer Science
University of California, Los Angeles University of California, Los Angeles
[email protected] [email protected]

Abstract

Data augmentation is essential to achieve state-of-the-art performance in many


deep learning applications. However, the most effective augmentation techniques
become computationally prohibitive for even medium-sized datasets. To address
this, we propose a rigorous technique to select subsets of data points that when
augmented, closely capture the training dynamics of full data augmentation. We
first show that data augmentation, modeled as additive perturbations, improves
learning and generalization by relatively enlarging and perturbing the smaller
singular values of the network Jacobian, while preserving its prominent directions.
This prevents overfitting and enhances learning the harder to learn information.
Then, we propose a framework to iteratively extract small subsets of training
data that when augmented, closely capture the alignment of the fully augmented
Jacobian with labels/residuals. We prove that stochastic gradient descent applied
to the augmented subsets found by our approach has similar training dynamics
to that of fully augmented data. Our experiments demonstrate that our method
achieves 6.3x speedup on CIFAR10 and 2.2x speedup on SVHN, and outperforms
the baselines by up to 10% across various subset sizes. Similarly, on TinyImageNet
and ImageNet, our method beats the baselines by up to 8%, while achieving up to
3.3x speedup across various subset sizes. Finally, training on and augmenting 50%
subsets using our method on a version of CIFAR10 corrupted with label noise even
outperforms using the full dataset. 1

1 Introduction
Standard (weak) data augmentation transforms the training examples with e.g. rotations or crops for
images, and trains on the transformed examples in place of the original training data. While weak
augmentation is effective and computationally inexpensive, strong data augmentation (in addition to
weak augmentation) is a key component in achieving nearly all state-of-the-art results in deep learning
applications [31]. However, strong data augmentation techniques often increase the training time by
orders of magnitude. First, they often have a very expensive pipeline to find or generate more complex
transformations that best improves generalization [4, 13, 19, 36]. Second, appending transformed
examples to the training data is often much more effective than training on the (strongly or weakly)
transformed examples in-place of the original data. For example, appending one transformed example
to the training data is often much more effective than training on two transformed examples in place
of every original training data, while both strategies have the same computational cost (c.f. Appendix
D.6). Hence, to obtain the state-of-the-art performance, multiple augmented examples are added
for every single data point and to each training iteration [12, 36]. In this case, even if producing
transformations are cheap, such methods increases the size of the training data by orders of magnitude.
1
Our code can be found at https://github.com/tianyu139/data-efficient-augmentation

36th Conference on Neural Information Processing Systems (NeurIPS 2022).


As a result, state-of-the-art data augmentation techniques become computationally prohibitive for
even medium-sized real-world problems. For example, the state-of-the-art augmentation of [36],
which appends every example with its highest-loss transformations, increases the training time of
ResNet20 on CIFAR10 by 13x on an Nvidia A40 GPU (c.f. Sec. 6).
To make state-of-the-art data augmentation more efficient and scalable, an effective approach is to
carefully select a small subset of the training data such that augmenting only the subset provides
similar training dynamics to that of full data augmentation. If such a subset can be quickly found,
it would directly lead to a significant reduction in storage and training costs. First, while standard
in-place augmentation can be applied to the entire data, the strong and expensive transformations
can be only produced for the examples in the subset. Besides, only the transformed elements of the
subset can be appended to the training data. Finally, when the data is larger than the training budget,
one can train on random subsets (with standard in-place augmentation) and augment coresets (by
strong augmentation and/or appending transformations) to achieve a superior performance.
Despite the efficiency and scalability that it can provide, this direction has remained largely unex-
plored. Existing studies are limited to fully training a network and subsampling data points based on
their loss or influence, for augmentation in subsequent training runs [17]. However, this method is
prohibitive for large datasets, provides a marginal improvement over augmenting random subsets,
and does not provide any theoretical guarantee for the performance of the network trained on the
augmented subsets. Besides, when the data contains mislabeled examples, augmentation methods
that select examples with maximum loss, and append their transformed versions to the data, degrade
the performance by selecting and appending several noisy labels.
A major challenge in finding the most effective data points for augmentation is to theoretically under-
stand how data augmentation affects the optimization and generalization of neural networks. Existing
theoretical results are mainly limited to simple linear classifiers and analyze data augmentation as
enlarging the span of the training data [36], providing a regularization effect [3, 8, 33, 36], enlarging
the margin of a linear classifier [28], or having a variance reduction effect [5]. However, such tools
do not provide insights on the effect of data augmentation on training deep neural networks.
Here, we study the effect of label invariant data augmentation on training dynamics of overparameter-
ized neural networks. Theoretically, we model data augmentation by bounded additive perturbations
[28], and analyze its effect on neural network Jacobian matrix containing all its first-order partial
derivatives [1]. We show that label invariant additive data augmentation proportionally enlarges but
more importantly perturbs the singular values of the Jacobian, particularly the smaller ones, while
maintaining prominent directions of the Jacobian. In doing so, data augmentation regularizes training
by adding bounded but varying perturbations to the gradients. In addition, it speeds up learning
harder to learn information. Thus, it prevents overfitting and improves generalization. Empirically,
we show that the same effect can be observed for various strong augmentations, e.g., AutoAugment
[6], CutOut [9], and AugMix [12].2
Next, we develop a rigorous method to iteratively find small weighted subsets (coresets) that when
augmented, closely capture the alignment between the Jacobian of the full augmented data with the
label/residual vector. We show that the most effective subsets for data augmentation are the set of
examples that when data is mapped to the gradient space, have the most centrally located gradients.
This problem can be formulated as maximizing a submodular function. The subsets can be efficiently
extracted using a fast greedy algorithm which operates on small dimensional gradient proxies, with
only a small additional cost. We prove that augmenting the coresets guarantees similar training
dynamics to that of full data augmentation. We also show that augmenting our coresets achieve a
superior accuracy in presence of noisy labeled examples.
We demonstrate the effectiveness of our approach applied to CIFAR10 (ResNet20, WideResNet-
28-10), CIFAR10-IB (ResNet32), SVHN (ResNet32), noisy-CIFAR10 (ResNet20), Caltech256
(ResNet18, ResNet50), TinyImageNet (ResNet50), and ImageNet (ResNet50) compared to random
and max-loss baselines [17]. We show the effectiveness of our approach (in presence of standard
augmentation) in the following cases:
• When producing augmentations is expensive and/or they are appended to the training data:

2
We note that our results are in line with that of [30], that in parallel to our work, analyzed the effect of linear
transformations on a two-layer convolutional network, and showed that it can make the hard to learn features
more likely to be captured during training.

2
We show that for the state-of-the-art augmentation method of [36] applied to CIFAR10/ResNet20 it
is 3.43x faster to train on the whole dataset and only augment our coresets of size 30%, compared
to training and augmenting the whole dataset. At the same time, we achieve 75% of the accuracy
improvement of training on and augmenting the full data with the method of [36], outperforming
both max-loss and random baselines by up to 10%.
• When data is larger than the training budget: We show that we can achieve 71.99% test
accuracy on ResNet50/ImageNet when training on and augmenting only 30% subsets for 90
epochs. Compared to AutoAugment [6], despite using only 30% subsets, we achieve 92.8%
of the original reported accuracy while boasting 5x speedup in the training time. Similarly, on
Caltech256/ResNet18, training on and augmenting 10% coresets with AutoAugment yields 65.4%
accuracy, improving over random 10% subsets by 5.8% and over only weak augmentation by 17.4%.
• When data contains mislabeled examples: We show that training on and strongly augmenting
50% subsets using our method on CIFAR10 with 50% noisy labels achieves 76.20% test accuracy.
Notably, this yields a superior performance to training on and strongly augmenting the full data.

2 Additional Related Work


Strong data augmentation methods achieve state-of-the-art performance by finding the set of trans-
formations for every example that best improves the performance. Methods like AutoAugment [6],
RandAugment [7], and Faster RandAugment [7] search over a (possibly large) space of transfor-
mations to find sequences of transformations that best improve generalization [6, 7, 21, 36]. Other
techniques involve a very expensive pipeline for generating the transformations. For example, some
use Generative Adversarial Networks to directly learn new transformations [2, 21, 24, 29]. Strong
augmentations like Smart Augmentation [19], Neural Style Transfer-based [13], and GAN-based aug-
mentations [4] require an expensive forward pass through a deep network for input transformations.
For example, [13] increases training time by 2.8x for training ResNet18 on Caltech256. Similarly, [36]
generates multiple augmentations for each training example, and selects the ones with the highest loss.
Strong data augmentation methods either replace the original example by its transformed version,
or append the generated transformations to the training data. Crucially, appending the training data
with transformations is much more effective in improving the generalization performance. Hence,
the most effective data augmentation methods such as that of [36] and AugMix [12] append the
transformed examples to the training data. In Appendix D.6, we show that even for cheaper strong
augmentation methods such as AutoAugment [6], while replacing the original training examples with
transformations may decrease the performance, appending the augmentations significantly improves
the performance. Appending the training data with augmentations, however, increase the training
time by orders of magnitude. For example, AugMix [12] that outperforms AutoAugment increases
the training time by at least 3x by appending extra augmented examples, and [36] increases training
time by 13x due to appending and forwarding additional augmented examples through the model.

3 Problem Formulation
We begin by formally describing the problem of learning from augmented data. Consider a dataset
Dtrain = (Xtrain , ytrain ), where Xtrain = (x1 , · · · , xn ) ∈ Rd×n is the set of n normalized data
points xi ∈ [0, 1]d , from the index set V , and ytrain = (y1 ,· · ·, yn ) ∈ {y ∈ {ν1 , ν2 , · · · , νC }} with
{νj }C
j=1 ∈ [0, 1].
The additive perturbation model. Following [28] we model data augmentation as an arbitrary
bounded additive perturbation , with kk≤ 0 . For a given 0 and the set of all possible transforma-
tions A, we study the transformations selected from S ⊆ A satisfying
S = {Ti ∈ A | kTi (x) − xk≤ 0 ∀x ∈ X train }. (1)
While the additive perturbation model cannot represent all augmentations, most real-world augmenta-
tions are bounded to preserve the regularities of natural images (e.g. AutoAugment [6] finds that a 6
degree rotation is optimal for CIFAR10). Thus, under local smoothness of images, additive pertur-
bation can model bounded transformations such as small rotations, crops, shearing, and pixel-wise
transformations like sharpening, blurring, color distortions, structured adversarial perturbation [21].
As such, we see the effects of additive augmentation on the singular spectrum holds even under

3
real-world augmentation settings (c.f. Fig. 3 in the Appendix). However, this model is indeed limited
when applied to augmentations that cannot be reduced to perturbations, such as horizontal/vertical
flips and large translations. We extend our theoretical analysis to augmentations modeled as arbitrary
linear transforms (e.g. as mentioned, horizontal flips) in Appendix B.5.
The set of augmentations at iteration t generatingSr r augmented examples per data tpoint can be
t
specified, with abuse of notation, as Daug = { i=1 (Tit (Xtrain ), ytrain )}, where |Daug |= rn and
Tit (Xtrain ) transforms all
Sr the training data points with the
Srset of transformations Ti
t
⊂ S at iteration
t
t. We denote Xaug = { i=1 Tit (Xtrain )} and yaug t
= { i=1 ytrain }.
Training on the augmented data. Let f (W , x) be an arbitrary neural network with m vectorized
(trainable) parameters W ∈Rm . We assume that the network is trained using (stochastic) gradient
descent with learning rate η to minimize the squared loss L over the original and augmented training
examples Dt = {Dtrain ∪ Daug t
} with associated index set V t , at every iteration t. I.e.,
1 X 1 X
L(W t , X) := Li (W t , xi ) := kf (W t , xi ) − yi k22 . (2)
2 t
2 t
i∈V (xi ,yi )∈D

The gradient update at iteration t is given by


W t+1 = W t − η∇L(W t , X), s.t. ∇L(W t , X) = J T (W t , X)(f (W t , X) − y), (3)
where X t = {Xtrain ∪ Xaug t
} and y t = {ytrain ∪ yaug
t
} are the set of original and augmented
examples and their labels, J (W , X) ∈ Rn×m is the Jacobian matrix associated with f , and
r t = f (W t , X) − y is the residual.
We further assume that J is smooth with Lipschitz constant L. I.e., kJ (W , xi )−J (W , xj )k≤
Lk xi − xj k ∀ xi , xj ∈ X . Thus, for any transformation Tj ∈ S, we have kJ (W , xi ) −
J (W , Tj (xi ))k≤ L0 . Finally, denoting J = J (W,Xtrain ) and J˜ = J (W,Tj (Xtrain )),we get

J˜=J + E, where E is the perturbation matrix withkEk2 ≤ kEkF ≤ nL0 .

4 Data Augmentation Improves Learning


In this section, we analyze the effect of data augmentation on training dynamics of neural networks,
and show that data augmentation can provably prevent overfitting. To do so, we leverage the recent
results that characterize the training dynamics based on properties of neural network Jacobian and the
corresponding Neural Tangent Kernel (NTK) [14] defined as Θ = J (W , X)J (W , X)T . Formally:
n
X n
X
r t= (1 − ηλi )(ui uTi )r t−1= (1 − ηλi )t (ui uTi )r 0 , (4)
i=1 i=1

where Θ = U ΛU T = i=1 λi ui uTi is the eigendecomposition of the NTK [1]. Although the
P
constant NTK assumption holds only in the infinite width limit, [18] found close empirical agreement
between the NTK dynamics and the true dynamics for wide but practical networks, such as wide
ResNet architectures [37]. Eq. (4) shows that the training dynamics depend on the alignment of the
NTK with the residual vector at every iteration t. Next, we prove that for small perturbations 0 ,
data augmentation prevents overfitting and improves generalization by proportionally enlarging and
perturbing smaller eigenvalues of the NTK relatively more, while preserving its prominent directions.

4.1 Effect of Augmentation on Eigenvalues of the NTK

We first investigate the effect of data augmentation on the singular values of the Jacobian, and use
this result to bound the change in the eigenvalues of the NTK. To characterize the effect of data
augmentation on singular values of the perturbed Jacobian J˜, we rely on Weyl’s theorem [35]
stating that under bounded perturbations E, no singular value can move more than the norm of the
perturbations. Formally, |σ̃i − σi |≤ kEk2 , where σ̃i and σi are the singular values of the perturbed
and original Jacobian respectively. Crucially, data augmentation affects larger and smaller singular
values differently. Let P be orthogonal projection onto the column space of J T , and P⊥ = I − P be
the projection onto its orthogonal complement subspace. Then, the singular values of the perturbed

4
(a) MNIST - ∆σi (b) CIFAR - ∆σi (c) MNIST - Subs. Angle (d) CIFAR - Sub. Angle
Figure 1: Effect of augmentations on the singular spectrum of the network Jacobian of ResNet20
trained on CIFAR10, and a MLP on MNIST, trained till epoch 15. (a), (b) Difference in singular
values and (c), (d) singular subspace angles between the original and augmented data with bounded
perturbations with 0 = 8 and 0 = 16 for different ranges of singular values. Note that augmentations
with larger bound 0 results in larger perturbations to the singular spectrum.

Jacobian J˜T are σ̃i2 = (σi + µi )2 + ζi2 , where |µi |≤ kP Ek2 , and σmin (P⊥ E) ≤ ζi ≤ kP⊥ Ek2 ,
σmin the smallest singular value of J T [32]. Since the eigenvalues of the projection matrix P are
either 0 or 1, as the number of dimensions m grows, for bounded perturbations we get that on average
µ2i = O(1) and ζi2 = O(m). Thus, the √second term dominates and increase of small singular values
under perturbation is proportional to m. However, for larger singular values, first term dominates
and hence σ̃i − σi ∼ = µi . Thus in general, small singular values can become proportionally larger,
while larger singular values remain relatively unchanged. The following Lemma characterizes the
expected change to the eignvalues of the NTK.
Lemma 4.1. Data augmentation as additive perturbations bounded by small 0 results in the
following expected change to the eigenvalues of the NTK:
E[λ̃i ] = E[σ̃i2 ] = σi2 + σi (1 − 2pi )kEk+kEk2 /3 (5)
where pi := P(σ̃i − σi < 0) is the probability that σi decreases as a result of data augmentation,
and is smaller for smaller singular values.
The proof can be found in Appendix A.1.
Next, we discuss the effect of data augmentation on singular vectors of the Jacobian and show that
it mainly affects the non-prominent directions of the Jacobian spectrum, but to a smaller extent
compared to the singular values.

4.2 Effect of Augmentation on Eigenvectors of the NTK

Here, we focus on characterizing the effect of data augmentation on the eigenspace of the NTK.
Let the singular subspace decomposition of the Jacobian be J = U ΣV T . Then for the NTK, we
have Θ = J J T = U ΣV T V ΣU T = U Σ2 U T (since V T V = I). Hence, the perturbation of the
eigenspace of the NTK is the same as perturbation of the left singular subspace of the Jacobian J .
Suppose σi are singular values of the Jacobian. Let the perturbed Jacobian be J˜ = J + E, and
denote the eigengap γ0 = min{σi − σi+1 : i = 1, · · · , r} where σr+1 := 0. Assuming γ0 ≥ 2kEk2 ,
a combination of Wedin’s theorem [34] and Mirsky’s inequality [23] implies

kui − ũi k≤ 2 2kEk/γ0 . (6)
This result provides an upper-bound on the change of every left singular vectors of the Jacobian.
However as we discuss below, data augmentation affects larger and smaller singular directions
differently. To see the effect of data augmentation on every singular vectors of the Jacobian, let
the subspace decomposition of Jacobian be J = U ΣV T = Us Σs VsT + Un Σn VnT , where Us
associated with nonzero singular values, spans the column space of J , which is also called the
signal subspace, and Un , associated with zero singular values (Σn = 0), spans the orthogonal space
of Us , which is also called the noise subspace. Similarly, let the subspace decomposition of the
perturbed Jacobian be J˜ = Ũ Σ̃Ṽ T = Ũs Σ̃s ṼsT + Ũn Σ˜n ṼnT , and Ũs = Us + ∆Us , where ∆Us
is the perturbation of the singular vectors that span the signal subspace. Then the following general
first-order expression for the perturbation of the orthogonal subspace due to perturbations of the
Jacobian characterize the change of the singular directions: ∆Us = Un UnT EVs Σ−1 s [20]. We
see that singular vectors associated to larger singular values are more robust to data augmentation,
compared to others. Note that in general singular vectors are more robust than singular values.

5
Fig. 1 shows the effect of perturbations with 0 = 8, 16 on singular values and singular vectors of the
Jacobian matrix for a 1 hidden layer MLP trained on MNIST, and ResNet20 trained on CIFAR10. As
calculating the entire Jacobian spectrum is computationally prohibitive, data is subsampled from 3
classes. We report the effect of other real-world augmentation techniques, such as random crops, flips,
rotations and Autoaugment [6] - which includes translations, contrast, and brightness transforms - in
Appendix C. We observe that data augmentation increases smaller singular values relatively more.
On the other hand, it affects prominent singular vectors of the Jacobian to a smaller extent.

4.3 Augmentation Improves Training & Generalization


Recent studies have revealed that the Jacobian matrix of common neural networks is low rank. That is
there are a number of large singular values and the rest of the singular values are small. Based on this,
the Jacobian spectrum can be divided into information and nuisance spaces [27]. Information space
is a lower dimensional space associated with the prominent singular value/vectors of the Jacobian.
Nuisance space is a high dimensional space corresponding to smaller singular value/vectors of the
Jacobian. While learning over information space is fast and generalizes well, learning over nuisance
space is slow and results in overfitting [27]. Importantly, recent theoretical studies connected the
generalization performance to small singular values (of the information space) [1].
Our results show that label-preserving additive perturbations relatively enlarge the smaller singular
values of the Jacobian in a stochastic way and with a high probability. This benefits generalization in
2 ways. First, this stochastic behavior prevents overfitting along any particular singular direction in
the nuisance space, as stochastic perturbation of the smallest singular values results in a stochastic
noise to be added to the gradient at every training iteration. This prevents overfitting (thus a larger
training loss as shown in Appendix D.5), and improves generalization [7, 8]. Theorem B.1 in
the Appendix characterizes the expected training dynamics resulted by data augmentation. Second,
additive perturbations improve the generalization by enlarging the smaller (useful) singular values that
lie in the information space, while preserving eigenvectors. Hence, it enhances learning along these
(harder to learn) components. The following Lemma captures the improvement in the generalization
performance, as a result of data augmentation.
Lemma 4.2. Assume gradient descent with learning rate η is applied to train a neural network
with constant NTK and Lipschitz constant L, on data points augmented with additive perturbations
bounded by 0 as defined in Sec. 3. Let σmin be the minimum singular value of Jacobian J associated
with training data Xtrain . With probability 1 − δ, generalization error of the network trained with
gradient descent on augmented data Xaug enjoys the following bound:
s  
2 1
√ + O log . (7)
(σmin + nL0 )2 δ

The proof can be found in Appendix A.2.

5 Effective Subsets for Data Augmentation

Here, we focus on identifying subsets of data that when augmented similarly improve generalization
and prevent overfitting. To do so, our key idea is to find subsets of data points that when augmented,
closely capture the alignment of the NTK (or equivalently the Jacobian) corresponding to the
full augmented data with the residual vector, J (W t , Xaug t
)T raug
t
. If such subsets can be found,
augmenting only the subsets will change the NTK and its alignment with the residual in a similar way
as that of full data augmentation, and will result in similar improved training dynamics. However,
t
generating the full set of transformations Xaug is often very expensive, particularly for strong
augmentations and large datasets. Hence, generating the transformations, and then extracting the
subsets may not provide a considerable overall speedup.
In the following, we show that weighted subsets (coresets) S that closely estimate the alignment of the
Jacobian associated to the original data with the residual vector J T (W t , Xtrain )rtrain can closely
estimate the alignment of the Jacobian of the full augmented data and the corresponding residual
J T (W t , Xaug
t t
)raug . Thus, the most effective subsets for augmentation can be directly found from
the training data. Formally, subsets S∗t weighted by γSt that capture the alignment of the full Jacobian

6
Algorithm 1 C ORESETS FOR E FFICIENT DATA AUGMENTATION
Require: The dataset D = {(xi , yi )}ni=1 , number of iterations T .
Ensure: Output model parameters W T .
1: for t = 1, · · · , T do
t
2: Xaug = ∅.
3: for c ∈ {1, · · · , C} do
4: Sct = ∅, [GSct ]i. = c1 1 ∀i.
5: while kGSct kF ≥ ξ do . Extract a coreset from class c by solving Eq. (9)
6: Sct = {Sct ∪ arg maxs∈V \Sct (kGSct kF −kG{Sct ∪{s}} kF )}
7: end while
γj = i∈Vc I[j = arg minj 0 ∈S kJ T (W t , xi )ri −J T (W t , xj 0 )rj 0 k] . Coreset weights
P
8:
t
9: Xaug = {Xaug ∪ {∪ri=1 Tit (XSct )}} . Augment the coreset
10: ρj = γjt /r
t

11: end for


12: Update the parameters W t using weighted gradient descent on Xaug t t
or {Xtrain ∪ Xaug }.
13: end for

with the residual by an error of at most ξ can be found by solving the following optimization problem:
S∗t = arg min|S| s.t. kJ T (W t , X t )r t − diag(γSt )J T (W t , XSt )rSt k≤ ξ. (8)
S⊆V

Solving the above optimization problem is NP-hard. However, as we discuss in the Appendix A.5, a
near optimal subset can be found by minimizing the Frobenius norm of a matrix GS , in which the ith
row contains the euclidean distance between data point i and its closest element in the subset S, in
the gradient space. Formally, [GS ]i. = minj 0 ∈S kJ T (W t , xi )ri − J T (W t , xj 0 )rj 0 k. When S = ∅,
[GS ]i. = c1 1, where c1 is a big constant. Intuitively, such subsets contain the set of medoids of the
dataset in the gradient space. Medoids of a dataset are defined as the most centrally located elements
in the dataset [16]. The weightPof every element j ∈ S is the number of data points closest to it in
the gradient space, i.e., γj = i∈V I[j = arg minj 0 ∈S kJ T (W t , xi )ri − J T (W t , xj 0 )rj 0 k]. The
set of medoids can be found by solving the following submodular3 cover problem:
S∗t = arg minS⊆V |S| s.t. kGS kF ≤ ξ. (9)
The classical greedy algorithm provides a logarithmic approximation for the above submodular
maximization problem, i.e., |S|≤ (1 + ln(n)). It starts with the empty set S0 = ∅, and at each
iteration τ , it selects the training example s ∈ V \ Sτ −1 that maximizes the marginal gain, i.e.,
Sτ = Sτ −1 ∪ {arg maxs∈V \Sτ −1 (kGSτ −1 kF −kG{Sτ −1 ∪{s}} kF )}. The O(nk) computational
complexity of the greedy algorithm can be reduced to O(n) using randomized methods [25] and
further improved using lazy evaluation [22] and distributed implementations [26]. The rows of the
matrix G can be efficiently upper-bounded using the gradient of the loss w.r.t. the input to the last
layer of the network, which has been shown to capture the variation of the gradient norms closely
[15]. The above upper-bound is only marginally more expensive than calculating the value of the loss.
Hence the subset can be found efficiently. Better approximations can be obtained by considering
earlier layers in addition to the last two, at the expense of greater computational cost.
At every iteration t during training, we select a coreset from every class c ∈ [C] separately, and apply
the set of transformations {Tit }ri=1 only to the elements of the coresets, i.e., Xaug
t
= {∪ri=1 Tit (XS t )}.
We divide the weight of every element j in the coreset equally among its transformations, i.e. the
final weight ρtj = γjt /r if j ∈ S t . We apply the gradient descent updates in Eq. (3) to the weighted
Jacobian matrix of X t = Xaug t
or X t = {Xtrain ∪ Xaug t
} (viewing ρt as ρt ∈ Rn ) as follows:
T
W t+1 = W t − η diag(ρt )J (W t , X t ) r t . (10)
The pseudocode is illustrated in Alg. 1.
The following Lemma upper bounds the difference between the alignment of the Jacobian and residual
for augmented coreset vs. full augmented data.
A set function F : 2V → R+ is submodular if F (S ∪ {e}) − F (S) ≥ F (T ∪ {e}) − F (T ), for any
3

S ⊆ T ⊆ V and e ∈ V \ T . F is monotone if F (e|S) ≥ 0 for any e ∈ V \S and S ⊆ V .

7
Table 1: Training ResNet20 (R20) and WideResnet-28-10 (W2810) on CIFAR10 (C10) using small
subsets, and ResNet18 (R18) on Caltech256 (Cal). We compare accuracies of training on and strongly
(and weakly) augmenting subsets. For CIFAR10, training and augmenting subsets selected by max-
loss performed poorly and did not converge. Average number of examples per class in each subset is
shown in parentheses. Appendix D.4 shows baseline accuracies from only weak augmentations.
Model/Data C10/R20 C10/W2810 Cal/R18
Subset 0.1% (5) 0.2% (10) 0.5% (25) 1% (50) 1% (50) 5% (3) 10% (6) 20% (12) 30% (18) 40% (24) 50% (30)
Max-loss < 15% < 15% < 15% < 15% < 15% 19.2 50.6 71.3 75.6 77.3 78.6
Random 33.5 42.7 58.7 74.4 57.7 41.5 61.8 72.5 75.7 77.6 78.5
Ours 37.8 45.1 63.9 74.7 62.1 52.7 65.4 73.1 76.3 77.7 78.9

Lemma 5.1. Let S be a coreset that captures the alignment of the full data NTK with residual
with an error of at most ξ as in Eq. 8. Augmenting the coreset with perturbations bounded by
0 ≤ 31√ captures the alignment of the fully augmented data with the residual by an error of at most
n2 L
√ 
kJ T (W t , Xaug )r − diag(ρt )J t (W t , XS aug )rS k≤ ξ + O L . (11)

5.1 Coreset vs. Max-loss Data Augmentation


In the initial phase of training the NTK goes through rapid changes. This determines the final basin
of convergence and network’s final performance [10]. Regularizing deep networks by weight decay
or data augmentation mainly affects this initial phase and matters little afterwards [11]. Crucially,
augmenting coresets that closely capture the alignment of the NTK with the residual during this
initial phase results in less overfitting and improved generalization performance. On the other hand,
augmenting points with maximum loss early in training decreases the alignment between the NTK
and the label vector and impedes learning and convergence. After this initial phase when the network
has good prediction performance, the gradients for majority of data points become small. Here, the
alignment is mainly captured by the elements with the maximum loss. Thus, as training proceeds, the
intersection between the elements of the coresets and examples with maximum loss increases. We
visualize this pattern in Appendix D.11. The following Theorem characterizes the training dynamics
of training on the full data and the augmented coresets, using our additive perturbation model.
Theorem 5.2. Let Li be β-smooth, L be λ-smooth and satisfy the α-PL condition, that is for
2
α > 0, k∇L(W )k ≥ αL(W ) for all weights W . Let f be Lipschitz in X with constant L0 , and
0
L̄ = max{L, L }. Let G0 be the gradient at initializaion, σmax the maximum singular value of
1√
the coreset Jacobian at initialization. Choosing 0 ≤ and running SGD on full data with
σmax L̄n
α
augmented coreset using constant step size η = λβ , result in the following bound:
√ !!
f +caug t 1  αη  2t L̄
E[k∇L (W )k] ≤ √ 1− 2G0 + ξ + O .
α 2 σmax
The proof can be found in Appendix A.4.
Theorem 5.2 shows that training on full data and augmented coresets converges to a close neighbor-
hood of the optimal solution, with the same rate as that of training on the fully augmented data. The
size of the neighborhood depends on the error of the coreset ξ in Eq. (8), and the error in capturing
the alignment of the full augmented data with the residual derived in Lemma 5.1. The first term
decrease as the size of the coreset grows, and the second term depends on the network structure.
We also analyze convergence of training only on the augmented coresets, and augmentations modelled
as arbitrary linear transformations using a linear model [36] in Appendix B.5.

6 Experiments
Setup and baselines. We extensively evaluate the performance of our approach in three different
settings. Firstly, we consider training only on coresets and their augmentations. Secondly, we
investigate the effect of adding augmented coresets to the full training data. Finally, we consider
adding augmented coresets to random subsets. We compare our coresets with max-loss and random
subsets as baselines. For all methods, we select a new augmentation subset every R epochs. We note
that the original max-loss method [17] selects points using a fully trained model, hence it can only

8
Table 2: Caltech256/ResNet18 with same set- Table 3: Training on full data and strongly (and
tings as Tab. 1 with default weak augmenta- weakly) augmenting random subsets, max-loss
tions but varying strong augmentations. subsets and coresets on TinyImageNet/ResNet50,
Random Ours R = 15.
Augmentation
30% 40% 50% 30% 40% 50% Random Max-loss Ours
CutOut 43.32 62.84 76.21 55.53 66.10 76.91 20% 30% 50% 20% 30% 50% 20% 30% 50%
AugMix 40.77 61.81 72.17 52.72 64.91 73.01
Perturb 48.51 66.20 75.34 58.29 67.47 76.50 50.97 52.00 54.92 51.30 52.34 53.37 51.99 54.30 55.16

Table 4: Accuracy improvement by augmenting subsets found by our method vs. max-loss and
random, over improvement of full (weak and strong) data augmentation (F.A.) compared to weak
augmentation only (W.A.). The table shows the results for training on CIFAR10 (C10)/ResNet20
(R20), SVHN/ResNet32(R32), and CIFAR10-Imbalanced(C10-IB)/ResNet32, with R = 20.
Dataset W.A. F.A. Random Max-loss Ours
Acc Acc 5% 10% 30% 5% 10% 30% 5% 10% 30%
C10/R20 89.46 93.50 21.8% 39.9% 65.6% 32.9% 47.8% 73.5% 34.9% 51.5% 75.0%
C10-IB/R32 87.08 92.48 25.9% 45.2% 74.6% 31.3% 39.6% 74.6% 37.4% 49.4% 74.8%
SVHN/R32 95.68 97.07 5.8% 36.7% 64.1% 35.3% 49.7% 76.4% 31.7% 48.3% 80.0%

select one subset throughout training. To maximize fairness, we modify our max-loss baseline to select
a new subset at every subset selection step. For all experiments, standard weak augmentations (random
crop and horizontal flips) are always performed on both the original and strongly augmented data.

6.1 Training on Coresets and their Augmentations


First, we evaluate the effectiveness of our approach for training on the coresets and their augmenta-
tions. Our main goal here is to compare the performance of training on and augmenting coresets vs.
random and max-loss subsets. Tab. 1 shows the test accuracy for training ResNet20 and Wide-ResNet
on CIFAR10 when we only train on small augmented coresets of size 0.1% to 1% selected at every
epoch (R = 1), and training ResNet18 on Caltech256 using coresets of size 5% to 50% with R = 5.
We see that the augmented coresets outperform augmented random subsets by a large margin, particu-
larly when the size of the subset is small. On Caltech256/ResNet18, training on and augmenting 10%
coresets yields 65.4% accuracy, improving over random by 5.8%, and over only weak augmentation
by 17.4%. This clearly shows the effectiveness of augmenting the coresets. Note that for CIFAR10
experiments, training on the augmented max-loss points did not even converge in absence of full data.
Generalization across augmentation techniques. We note that our coresets are not dependent on
the type of data augmentation. To confirm this, we show the superior generalization performance of
our method in Tab. 2 for training ResNet18 with R = 5 on coresets vs. random subsets of Caltech256,
augmented with CutOut [9], AugMix [12], and noise perturbations (color jitter, gaussian blur). For
example, on 30% subsets, we obtain 28.2%, 29.3%, 20.2% relative improvement over augmenting
random subsets when using CutOut, AugMix, and noise perturbation augmentations, respectively.

6.2 Training on Full Data and Augmented Coresets


Next, we study the effectiveness of our method for training on full data and augmented coresets.
Tab. 4 demonstrates the percentage of accuracy improvement resulted by augmenting subsets of
size 5%, 10%, and 30% selected from our method vs. max-loss and random subsets, over that of
full data augmentation. We observe that augmenting coresets effectively improves generalization,
and outperforms augmenting random and max-loss subsets across different models and datasets.
For example, on 30% subsets, we obtain 13.1% and 2.3% relative improvement over random and
max-loss on average. We also report results on TinyImageNet/ResNet50 (R = 15) in Tab. 3, where
we show that augmenting coresets outperforms max-loss and random baselines, e.g. by achieving
3.7% and 4.4% relative improvement over 30% max-loss and random subsets, respectively.
Training speedup. In Fig. 2, we measure the improvement in training time in the case of training on
full data and augmenting subsets of various sizes. While our method yields similar or slightly lower
speedup to the max-loss and random baselines, our resulting accuracy outperforms the baselines
on average. For example, for SVHN/Resnet32 using 30% coresets, we sacrifice 11% of the relative
speedup to obtain an additional 24.8% of the relative gain in accuracy from full data augmentation,
compared to random baseline. Notably, we get 3.43x speedup for training on full data and augmenting

9
Table 5: Training ResNet20 on CIFAR10 with 50% label noise, R = 20. Accuracy without strong
augmentation is 70.72 ± 0.20 and the accuracy of full (weak and strong) data augmentation is
75.87 ± 0.77. Note that augmenting 50% subsets outperforms augmenting the full data (marked ∗∗ ).
Subset Random Max-loss Ours
10% 72.32 ± 0.14 71.83 ± 0.13 73.02 ± 1.06
30% 74.46 ± 0.27 72.45 ± 0.48 74.67 ± 0.15
50% 75.36 ± 0.05 73.23 ± 0.72 76.20 ± 0.75∗∗

30% coresets, while obtaining 75% of the improvement of full data augmentation. We provide wall-
clock times for finding coresets from Caltech256 and TinyImageNet in Appendix D.7.

(a) CIFAR10/ResNet20 (b) CIFAR10/ResNet20 (c) SVHN/ResNet32 (d) SVHN/ResNet32


Figure 2: Accuracy improvement and speedups by augmenting subsets found by our method vs.
max-loss and random on (a), (b) ResNet20/CIFAR10 and (c), (d) ResNet32/SVHN.
Augmenting noisy labeled data. Next, we evaluate the robustness of our coresets to label noise.
Tab. 5 shows the result of augmenting coresets vs. max-loss and random subsets of different sizes
selected from CIFAR10 with 50% label noise on ResNet20. Notably, our method not only outperforms
max-loss and random baselines, but also achieves superior performance over full data augmentation.
6.3 Training on Random Data and Augmented Coresets
Finally, we evaluate the performance of our method for training on random subsets and augmenting
coresets, applicable when data is larger than the training budget. We report results on TinyImageNet
and ImageNet on ResNet50 (90 epochs, R = 15). Tab. 6 shows the results of training on random
subsets, and augmenting random subsets and coresets of the same size. We see that our results
hold for large-scale datasets, where we obtain 7.9%, 4.9%, and 5.3% relative improvement over
random baseline with 10%, 20%, 30% subset sizes respectively on TinyImageNet, and 7.6%, 2.3%,
and 1.3% relative improvement over random baseline with 10%, 30%, and 50% subset sizes on
ImageNet. Notably, compared to AutoAugment, despite using only 30% subsets, we achieve 71.99%
test accuracy, which is 92.8% of the original reported accuracy, while boasting 5x speedup in training.
Table 6: Training on random subsets and strongly (and weakly) augmenting random and max loss
subsets vs coresets for TinyImageNet (left) and ImageNet (right) with ResNet50.
Random Max-loss Ours Random Maxloss Ours
10% 20% 30% 10% 20% 30% 10% 20% 30% 10% 30% 50% 10% 30% 50% 10% 30% 50%
28.64 38.97 44.10 27.64 41.40 45.75 30.90 40.88 46.42 63.67 70.39 72.35 65.43 71.55 72.77 68.53 71.99 73.28

7 Conclusion
We showed that data augmentation improves training and generalization by relatively enlarging
and perturbing the smaller singular values of the neural network Jacobian while preserving its
prominent directions. Then, we proposed a framework to iteratively extract small coresets of training
data that when augmented, closely capture the alignment of the fully augmented Jacobian with the
label/residual vector. We showed the effectiveness of augmenting coresets in providing a superior
generalization performance when added to the full data or random subsets, in presence of noisy labels,
or as a standalone subset. Under local smoothness of images, our additive perturbation can be applied
to model many bounded transformations such as small rotations, crops, shearing, and pixel-wise
transformations like sharpening, blurring, color distortions, structured adversarial perturbation [21].
However, the additive perturbation model is indeed limited when applied to augmentations that cannot
be reduced to perturbations, such as horizontal/vertical flips and large translations. Further theoretical
analysis of complex data augmentations is indeed an interesting direction for future work.
8 Acknowledgements
This research was supported in part by the National Science Foundation CAREER Award 2146492,
and the UCLA-Amazon Science Hub for Humanity and AI.

10
References
[1] Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of op-
timization and generalization for overparameterized two-layer neural networks. In International
Conference on Machine Learning, pages 322–332. PMLR, 2019.
[2] Shumeet Baluja and Ian Fischer. Adversarial transformation networks: Learning to generate
adversarial examples. arXiv preprint arXiv:1703.09387, 2017.
[3] Chris M Bishop. Training with noise is equivalent to tikhonov regularization. Neural computa-
tion, 7(1):108–116, 1995.
[4] Christopher Bowles, Roger Gunn, Alexander Hammers, and Daniel Rueckert. Gansfer learning:
Combining labelled and unlabelled data for gan based data augmentation. arXiv preprint
arXiv:1811.10669, 2018.
[5] Shuxiao Chen, Edgar Dobriban, and Jane H Lee. Invariance reduces variance: Understanding
data augmentation in deep learning and beyond. arXiv preprint arXiv:1907.10905, 2019.
[6] Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment:
Learning augmentation strategies from data. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 113–123, 2019.
[7] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical
automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition Workshops, pages 702–703, 2020.
[8] Tri Dao, Albert Gu, Alexander Ratner, Virginia Smith, Chris De Sa, and Christopher Ré. A
kernel theory of modern data augmentation. In International Conference on Machine Learning,
pages 1528–1537. PMLR, 2019.
[9] Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural
networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
[10] Stanislav Fort, Gintare Karolina Dziugaite, Mansheej Paul, Sepideh Kharaghani, Daniel M Roy,
and Surya Ganguli. Deep learning versus kernel learning: an empirical study of loss landscape
geometry and the time evolution of the neural tangent kernel. Advances in Neural Information
Processing Systems, 33, 2020.
[11] Aditya Sharad Golatkar, Alessandro Achille, and Stefano Soatto. Time matters in regularizing
deep networks: Weight decay and data augmentation affect early learning dynamics, matter
little near convergence. Advances in Neural Information Processing Systems, 32:10678–10688,
2019.
[12] Dan Hendrycks, Norman Mu, Ekin D Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshmi-
narayanan. Augmix: A simple data processing method to improve robustness and uncertainty.
arXiv preprint arXiv:1912.02781, 2019.
[13] Philip TG Jackson, Amir Atapour Abarghouei, Stephen Bonner, Toby P Breckon, and Boguslaw
Obara. Style augmentation: data augmentation via style randomization. In CVPR Workshops,
volume 6, pages 10–11, 2019.
[14] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and
generalization in neural networks. arXiv preprint arXiv:1806.07572, 2018.
[15] Angelos Katharopoulos and François Fleuret. Not all samples are created equal: Deep learning
with importance sampling. In International conference on machine learning, pages 2525–2534.
PMLR, 2018.
[16] L Kaufman, PJ Rousseeuw, and Y Dodge. Clustering by means of medoids in statistical data
analysis based on the, 1987.
[17] Michael Kuchnik and Virginia Smith. Efficient augmentation via data subsampling. In Interna-
tional Conference on Learning Representations, 2018.

11
[18] Jaehoon Lee, Lechao Xiao, Samuel S Schoenholz, Yasaman Bahri, Roman Novak, Jascha
Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear
models under gradient descent. In NeurIPS, 2019.
[19] Joseph Lemley, Shabab Bazrafkan, and Peter Corcoran. Smart augmentation learning an optimal
data augmentation strategy. Ieee Access, 5:5858–5869, 2017.
[20] Fu Li, Hui Liu, and Richard J Vaccaro. Performance analysis for doa estimation algorithms:
unification, simplification, and observations. IEEE Transactions on Aerospace and Electronic
Systems, 29(4):1170–1184, 1993.
[21] Calvin Luo, Hossein Mobahi, and Samy Bengio. Data augmentation via structured adversarial
perturbations. arXiv preprint arXiv:2011.03010, 2020.
[22] Michel Minoux. Accelerated greedy algorithms for maximizing submodular set functions. In
Optimization techniques, pages 234–243. Springer, 1978.
[23] Leon Mirsky. Symmetric gauge functions and unitarily invariant norms. The quarterly journal
of mathematics, 11(1):50–59, 1960.
[24] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint
arXiv:1411.1784, 2014.
[25] Baharan Mirzasoleiman, Ashwinkumar Badanidiyuru, Amin Karbasi, Jan Vondrák, and Andreas
Krause. Lazier than lazy greedy. In Twenty-Ninth AAAI Conference on Artificial Intelligence,
2015.
[26] Baharan Mirzasoleiman, Amin Karbasi, Rik Sarkar, and Andreas Krause. Distributed submodu-
lar maximization: Identifying representative elements in massive data. In Advances in Neural
Information Processing Systems, pages 2049–2057, 2013.
[27] Samet Oymak, Zalan Fabian, Mingchen Li, and Mahdi Soltanolkotabi. Generalization guaran-
tees for neural networks via harnessing the low-rank structure of the jacobian. arXiv preprint
arXiv:1906.05392, 2019.
[28] Shashank Rajput, Zhili Feng, Zachary Charles, Po-Ling Loh, and Dimitris Papailiopoulos. Does
data augmentation lead to positive margin? In International Conference on Machine Learning,
pages 5321–5330. PMLR, 2019.
[29] Alexander J Ratner, Henry R Ehrenberg, Zeshan Hussain, Jared Dunnmon, and Christopher
Ré. Learning to compose domain-specific transformations for data augmentation. Advances in
neural information processing systems, 30:3239, 2017.
[30] Ruoqi Shen, Sébastien Bubeck, and Suriya Gunasekar. Data augmentation as feature manipula-
tion: a story of desert cows and grass cows. arXiv preprint arXiv:2203.01572, 2022.
[31] Connor Shorten and Taghi M Khoshgoftaar. A survey on image data augmentation for deep
learning. Journal of Big Data, 6(1):1–48, 2019.
[32] GW Stewart. A note on the perturbation of singular values. Linear Algebra and Its Applications,
28:213–216, 1979.
[33] Stefan Wager, Sida Wang, and Percy Liang. Dropout training as adaptive regularization. arXiv
preprint arXiv:1307.1493, 2013.
[34] Per-Åke Wedin. Perturbation bounds in connection with singular value decomposition. BIT
Numerical Mathematics, 12(1):99–111, 1972.
[35] Hermann Weyl. The asymptotic distribution law of the eigenvalues of linear partial differential
equations (with an application to the theory of cavity radiation). mathematical annals, 71(4):441–
479, 1912.
[36] Sen Wu, Hongyang Zhang, Gregory Valiant, and Christopher Ré. On the generalization effects of
linear transformations in data augmentation. In International Conference on Machine Learning,
pages 10410–10420. PMLR, 2020.

12
[37] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In British Machine Vision
Conference 2016. British Machine Vision Association, 2016.

13

You might also like