NeurIPS 2022 Data Efficient Augmentation For Training Neural Networks Paper Conference
NeurIPS 2022 Data Efficient Augmentation For Training Neural Networks Paper Conference
Networks
Abstract
1 Introduction
Standard (weak) data augmentation transforms the training examples with e.g. rotations or crops for
images, and trains on the transformed examples in place of the original training data. While weak
augmentation is effective and computationally inexpensive, strong data augmentation (in addition to
weak augmentation) is a key component in achieving nearly all state-of-the-art results in deep learning
applications [31]. However, strong data augmentation techniques often increase the training time by
orders of magnitude. First, they often have a very expensive pipeline to find or generate more complex
transformations that best improves generalization [4, 13, 19, 36]. Second, appending transformed
examples to the training data is often much more effective than training on the (strongly or weakly)
transformed examples in-place of the original data. For example, appending one transformed example
to the training data is often much more effective than training on two transformed examples in place
of every original training data, while both strategies have the same computational cost (c.f. Appendix
D.6). Hence, to obtain the state-of-the-art performance, multiple augmented examples are added
for every single data point and to each training iteration [12, 36]. In this case, even if producing
transformations are cheap, such methods increases the size of the training data by orders of magnitude.
1
Our code can be found at https://github.com/tianyu139/data-efficient-augmentation
2
We note that our results are in line with that of [30], that in parallel to our work, analyzed the effect of linear
transformations on a two-layer convolutional network, and showed that it can make the hard to learn features
more likely to be captured during training.
2
We show that for the state-of-the-art augmentation method of [36] applied to CIFAR10/ResNet20 it
is 3.43x faster to train on the whole dataset and only augment our coresets of size 30%, compared
to training and augmenting the whole dataset. At the same time, we achieve 75% of the accuracy
improvement of training on and augmenting the full data with the method of [36], outperforming
both max-loss and random baselines by up to 10%.
• When data is larger than the training budget: We show that we can achieve 71.99% test
accuracy on ResNet50/ImageNet when training on and augmenting only 30% subsets for 90
epochs. Compared to AutoAugment [6], despite using only 30% subsets, we achieve 92.8%
of the original reported accuracy while boasting 5x speedup in the training time. Similarly, on
Caltech256/ResNet18, training on and augmenting 10% coresets with AutoAugment yields 65.4%
accuracy, improving over random 10% subsets by 5.8% and over only weak augmentation by 17.4%.
• When data contains mislabeled examples: We show that training on and strongly augmenting
50% subsets using our method on CIFAR10 with 50% noisy labels achieves 76.20% test accuracy.
Notably, this yields a superior performance to training on and strongly augmenting the full data.
3 Problem Formulation
We begin by formally describing the problem of learning from augmented data. Consider a dataset
Dtrain = (Xtrain , ytrain ), where Xtrain = (x1 , · · · , xn ) ∈ Rd×n is the set of n normalized data
points xi ∈ [0, 1]d , from the index set V , and ytrain = (y1 ,· · ·, yn ) ∈ {y ∈ {ν1 , ν2 , · · · , νC }} with
{νj }C
j=1 ∈ [0, 1].
The additive perturbation model. Following [28] we model data augmentation as an arbitrary
bounded additive perturbation , with kk≤ 0 . For a given 0 and the set of all possible transforma-
tions A, we study the transformations selected from S ⊆ A satisfying
S = {Ti ∈ A | kTi (x) − xk≤ 0 ∀x ∈ X train }. (1)
While the additive perturbation model cannot represent all augmentations, most real-world augmenta-
tions are bounded to preserve the regularities of natural images (e.g. AutoAugment [6] finds that a 6
degree rotation is optimal for CIFAR10). Thus, under local smoothness of images, additive pertur-
bation can model bounded transformations such as small rotations, crops, shearing, and pixel-wise
transformations like sharpening, blurring, color distortions, structured adversarial perturbation [21].
As such, we see the effects of additive augmentation on the singular spectrum holds even under
3
real-world augmentation settings (c.f. Fig. 3 in the Appendix). However, this model is indeed limited
when applied to augmentations that cannot be reduced to perturbations, such as horizontal/vertical
flips and large translations. We extend our theoretical analysis to augmentations modeled as arbitrary
linear transforms (e.g. as mentioned, horizontal flips) in Appendix B.5.
The set of augmentations at iteration t generatingSr r augmented examples per data tpoint can be
t
specified, with abuse of notation, as Daug = { i=1 (Tit (Xtrain ), ytrain )}, where |Daug |= rn and
Tit (Xtrain ) transforms all
Sr the training data points with the
Srset of transformations Ti
t
⊂ S at iteration
t
t. We denote Xaug = { i=1 Tit (Xtrain )} and yaug t
= { i=1 ytrain }.
Training on the augmented data. Let f (W , x) be an arbitrary neural network with m vectorized
(trainable) parameters W ∈Rm . We assume that the network is trained using (stochastic) gradient
descent with learning rate η to minimize the squared loss L over the original and augmented training
examples Dt = {Dtrain ∪ Daug t
} with associated index set V t , at every iteration t. I.e.,
1 X 1 X
L(W t , X) := Li (W t , xi ) := kf (W t , xi ) − yi k22 . (2)
2 t
2 t
i∈V (xi ,yi )∈D
where Θ = U ΛU T = i=1 λi ui uTi is the eigendecomposition of the NTK [1]. Although the
P
constant NTK assumption holds only in the infinite width limit, [18] found close empirical agreement
between the NTK dynamics and the true dynamics for wide but practical networks, such as wide
ResNet architectures [37]. Eq. (4) shows that the training dynamics depend on the alignment of the
NTK with the residual vector at every iteration t. Next, we prove that for small perturbations 0 ,
data augmentation prevents overfitting and improves generalization by proportionally enlarging and
perturbing smaller eigenvalues of the NTK relatively more, while preserving its prominent directions.
We first investigate the effect of data augmentation on the singular values of the Jacobian, and use
this result to bound the change in the eigenvalues of the NTK. To characterize the effect of data
augmentation on singular values of the perturbed Jacobian J˜, we rely on Weyl’s theorem [35]
stating that under bounded perturbations E, no singular value can move more than the norm of the
perturbations. Formally, |σ̃i − σi |≤ kEk2 , where σ̃i and σi are the singular values of the perturbed
and original Jacobian respectively. Crucially, data augmentation affects larger and smaller singular
values differently. Let P be orthogonal projection onto the column space of J T , and P⊥ = I − P be
the projection onto its orthogonal complement subspace. Then, the singular values of the perturbed
4
(a) MNIST - ∆σi (b) CIFAR - ∆σi (c) MNIST - Subs. Angle (d) CIFAR - Sub. Angle
Figure 1: Effect of augmentations on the singular spectrum of the network Jacobian of ResNet20
trained on CIFAR10, and a MLP on MNIST, trained till epoch 15. (a), (b) Difference in singular
values and (c), (d) singular subspace angles between the original and augmented data with bounded
perturbations with 0 = 8 and 0 = 16 for different ranges of singular values. Note that augmentations
with larger bound 0 results in larger perturbations to the singular spectrum.
Jacobian J˜T are σ̃i2 = (σi + µi )2 + ζi2 , where |µi |≤ kP Ek2 , and σmin (P⊥ E) ≤ ζi ≤ kP⊥ Ek2 ,
σmin the smallest singular value of J T [32]. Since the eigenvalues of the projection matrix P are
either 0 or 1, as the number of dimensions m grows, for bounded perturbations we get that on average
µ2i = O(1) and ζi2 = O(m). Thus, the √second term dominates and increase of small singular values
under perturbation is proportional to m. However, for larger singular values, first term dominates
and hence σ̃i − σi ∼ = µi . Thus in general, small singular values can become proportionally larger,
while larger singular values remain relatively unchanged. The following Lemma characterizes the
expected change to the eignvalues of the NTK.
Lemma 4.1. Data augmentation as additive perturbations bounded by small 0 results in the
following expected change to the eigenvalues of the NTK:
E[λ̃i ] = E[σ̃i2 ] = σi2 + σi (1 − 2pi )kEk+kEk2 /3 (5)
where pi := P(σ̃i − σi < 0) is the probability that σi decreases as a result of data augmentation,
and is smaller for smaller singular values.
The proof can be found in Appendix A.1.
Next, we discuss the effect of data augmentation on singular vectors of the Jacobian and show that
it mainly affects the non-prominent directions of the Jacobian spectrum, but to a smaller extent
compared to the singular values.
Here, we focus on characterizing the effect of data augmentation on the eigenspace of the NTK.
Let the singular subspace decomposition of the Jacobian be J = U ΣV T . Then for the NTK, we
have Θ = J J T = U ΣV T V ΣU T = U Σ2 U T (since V T V = I). Hence, the perturbation of the
eigenspace of the NTK is the same as perturbation of the left singular subspace of the Jacobian J .
Suppose σi are singular values of the Jacobian. Let the perturbed Jacobian be J˜ = J + E, and
denote the eigengap γ0 = min{σi − σi+1 : i = 1, · · · , r} where σr+1 := 0. Assuming γ0 ≥ 2kEk2 ,
a combination of Wedin’s theorem [34] and Mirsky’s inequality [23] implies
√
kui − ũi k≤ 2 2kEk/γ0 . (6)
This result provides an upper-bound on the change of every left singular vectors of the Jacobian.
However as we discuss below, data augmentation affects larger and smaller singular directions
differently. To see the effect of data augmentation on every singular vectors of the Jacobian, let
the subspace decomposition of Jacobian be J = U ΣV T = Us Σs VsT + Un Σn VnT , where Us
associated with nonzero singular values, spans the column space of J , which is also called the
signal subspace, and Un , associated with zero singular values (Σn = 0), spans the orthogonal space
of Us , which is also called the noise subspace. Similarly, let the subspace decomposition of the
perturbed Jacobian be J˜ = Ũ Σ̃Ṽ T = Ũs Σ̃s ṼsT + Ũn Σ˜n ṼnT , and Ũs = Us + ∆Us , where ∆Us
is the perturbation of the singular vectors that span the signal subspace. Then the following general
first-order expression for the perturbation of the orthogonal subspace due to perturbations of the
Jacobian characterize the change of the singular directions: ∆Us = Un UnT EVs Σ−1 s [20]. We
see that singular vectors associated to larger singular values are more robust to data augmentation,
compared to others. Note that in general singular vectors are more robust than singular values.
5
Fig. 1 shows the effect of perturbations with 0 = 8, 16 on singular values and singular vectors of the
Jacobian matrix for a 1 hidden layer MLP trained on MNIST, and ResNet20 trained on CIFAR10. As
calculating the entire Jacobian spectrum is computationally prohibitive, data is subsampled from 3
classes. We report the effect of other real-world augmentation techniques, such as random crops, flips,
rotations and Autoaugment [6] - which includes translations, contrast, and brightness transforms - in
Appendix C. We observe that data augmentation increases smaller singular values relatively more.
On the other hand, it affects prominent singular vectors of the Jacobian to a smaller extent.
Here, we focus on identifying subsets of data that when augmented similarly improve generalization
and prevent overfitting. To do so, our key idea is to find subsets of data points that when augmented,
closely capture the alignment of the NTK (or equivalently the Jacobian) corresponding to the
full augmented data with the residual vector, J (W t , Xaug t
)T raug
t
. If such subsets can be found,
augmenting only the subsets will change the NTK and its alignment with the residual in a similar way
as that of full data augmentation, and will result in similar improved training dynamics. However,
t
generating the full set of transformations Xaug is often very expensive, particularly for strong
augmentations and large datasets. Hence, generating the transformations, and then extracting the
subsets may not provide a considerable overall speedup.
In the following, we show that weighted subsets (coresets) S that closely estimate the alignment of the
Jacobian associated to the original data with the residual vector J T (W t , Xtrain )rtrain can closely
estimate the alignment of the Jacobian of the full augmented data and the corresponding residual
J T (W t , Xaug
t t
)raug . Thus, the most effective subsets for augmentation can be directly found from
the training data. Formally, subsets S∗t weighted by γSt that capture the alignment of the full Jacobian
6
Algorithm 1 C ORESETS FOR E FFICIENT DATA AUGMENTATION
Require: The dataset D = {(xi , yi )}ni=1 , number of iterations T .
Ensure: Output model parameters W T .
1: for t = 1, · · · , T do
t
2: Xaug = ∅.
3: for c ∈ {1, · · · , C} do
4: Sct = ∅, [GSct ]i. = c1 1 ∀i.
5: while kGSct kF ≥ ξ do . Extract a coreset from class c by solving Eq. (9)
6: Sct = {Sct ∪ arg maxs∈V \Sct (kGSct kF −kG{Sct ∪{s}} kF )}
7: end while
γj = i∈Vc I[j = arg minj 0 ∈S kJ T (W t , xi )ri −J T (W t , xj 0 )rj 0 k] . Coreset weights
P
8:
t
9: Xaug = {Xaug ∪ {∪ri=1 Tit (XSct )}} . Augment the coreset
10: ρj = γjt /r
t
with the residual by an error of at most ξ can be found by solving the following optimization problem:
S∗t = arg min|S| s.t. kJ T (W t , X t )r t − diag(γSt )J T (W t , XSt )rSt k≤ ξ. (8)
S⊆V
Solving the above optimization problem is NP-hard. However, as we discuss in the Appendix A.5, a
near optimal subset can be found by minimizing the Frobenius norm of a matrix GS , in which the ith
row contains the euclidean distance between data point i and its closest element in the subset S, in
the gradient space. Formally, [GS ]i. = minj 0 ∈S kJ T (W t , xi )ri − J T (W t , xj 0 )rj 0 k. When S = ∅,
[GS ]i. = c1 1, where c1 is a big constant. Intuitively, such subsets contain the set of medoids of the
dataset in the gradient space. Medoids of a dataset are defined as the most centrally located elements
in the dataset [16]. The weightPof every element j ∈ S is the number of data points closest to it in
the gradient space, i.e., γj = i∈V I[j = arg minj 0 ∈S kJ T (W t , xi )ri − J T (W t , xj 0 )rj 0 k]. The
set of medoids can be found by solving the following submodular3 cover problem:
S∗t = arg minS⊆V |S| s.t. kGS kF ≤ ξ. (9)
The classical greedy algorithm provides a logarithmic approximation for the above submodular
maximization problem, i.e., |S|≤ (1 + ln(n)). It starts with the empty set S0 = ∅, and at each
iteration τ , it selects the training example s ∈ V \ Sτ −1 that maximizes the marginal gain, i.e.,
Sτ = Sτ −1 ∪ {arg maxs∈V \Sτ −1 (kGSτ −1 kF −kG{Sτ −1 ∪{s}} kF )}. The O(nk) computational
complexity of the greedy algorithm can be reduced to O(n) using randomized methods [25] and
further improved using lazy evaluation [22] and distributed implementations [26]. The rows of the
matrix G can be efficiently upper-bounded using the gradient of the loss w.r.t. the input to the last
layer of the network, which has been shown to capture the variation of the gradient norms closely
[15]. The above upper-bound is only marginally more expensive than calculating the value of the loss.
Hence the subset can be found efficiently. Better approximations can be obtained by considering
earlier layers in addition to the last two, at the expense of greater computational cost.
At every iteration t during training, we select a coreset from every class c ∈ [C] separately, and apply
the set of transformations {Tit }ri=1 only to the elements of the coresets, i.e., Xaug
t
= {∪ri=1 Tit (XS t )}.
We divide the weight of every element j in the coreset equally among its transformations, i.e. the
final weight ρtj = γjt /r if j ∈ S t . We apply the gradient descent updates in Eq. (3) to the weighted
Jacobian matrix of X t = Xaug t
or X t = {Xtrain ∪ Xaug t
} (viewing ρt as ρt ∈ Rn ) as follows:
T
W t+1 = W t − η diag(ρt )J (W t , X t ) r t . (10)
The pseudocode is illustrated in Alg. 1.
The following Lemma upper bounds the difference between the alignment of the Jacobian and residual
for augmented coreset vs. full augmented data.
A set function F : 2V → R+ is submodular if F (S ∪ {e}) − F (S) ≥ F (T ∪ {e}) − F (T ), for any
3
7
Table 1: Training ResNet20 (R20) and WideResnet-28-10 (W2810) on CIFAR10 (C10) using small
subsets, and ResNet18 (R18) on Caltech256 (Cal). We compare accuracies of training on and strongly
(and weakly) augmenting subsets. For CIFAR10, training and augmenting subsets selected by max-
loss performed poorly and did not converge. Average number of examples per class in each subset is
shown in parentheses. Appendix D.4 shows baseline accuracies from only weak augmentations.
Model/Data C10/R20 C10/W2810 Cal/R18
Subset 0.1% (5) 0.2% (10) 0.5% (25) 1% (50) 1% (50) 5% (3) 10% (6) 20% (12) 30% (18) 40% (24) 50% (30)
Max-loss < 15% < 15% < 15% < 15% < 15% 19.2 50.6 71.3 75.6 77.3 78.6
Random 33.5 42.7 58.7 74.4 57.7 41.5 61.8 72.5 75.7 77.6 78.5
Ours 37.8 45.1 63.9 74.7 62.1 52.7 65.4 73.1 76.3 77.7 78.9
Lemma 5.1. Let S be a coreset that captures the alignment of the full data NTK with residual
with an error of at most ξ as in Eq. 8. Augmenting the coreset with perturbations bounded by
0 ≤ 31√ captures the alignment of the fully augmented data with the residual by an error of at most
n2 L
√
kJ T (W t , Xaug )r − diag(ρt )J t (W t , XS aug )rS k≤ ξ + O L . (11)
6 Experiments
Setup and baselines. We extensively evaluate the performance of our approach in three different
settings. Firstly, we consider training only on coresets and their augmentations. Secondly, we
investigate the effect of adding augmented coresets to the full training data. Finally, we consider
adding augmented coresets to random subsets. We compare our coresets with max-loss and random
subsets as baselines. For all methods, we select a new augmentation subset every R epochs. We note
that the original max-loss method [17] selects points using a fully trained model, hence it can only
8
Table 2: Caltech256/ResNet18 with same set- Table 3: Training on full data and strongly (and
tings as Tab. 1 with default weak augmenta- weakly) augmenting random subsets, max-loss
tions but varying strong augmentations. subsets and coresets on TinyImageNet/ResNet50,
Random Ours R = 15.
Augmentation
30% 40% 50% 30% 40% 50% Random Max-loss Ours
CutOut 43.32 62.84 76.21 55.53 66.10 76.91 20% 30% 50% 20% 30% 50% 20% 30% 50%
AugMix 40.77 61.81 72.17 52.72 64.91 73.01
Perturb 48.51 66.20 75.34 58.29 67.47 76.50 50.97 52.00 54.92 51.30 52.34 53.37 51.99 54.30 55.16
Table 4: Accuracy improvement by augmenting subsets found by our method vs. max-loss and
random, over improvement of full (weak and strong) data augmentation (F.A.) compared to weak
augmentation only (W.A.). The table shows the results for training on CIFAR10 (C10)/ResNet20
(R20), SVHN/ResNet32(R32), and CIFAR10-Imbalanced(C10-IB)/ResNet32, with R = 20.
Dataset W.A. F.A. Random Max-loss Ours
Acc Acc 5% 10% 30% 5% 10% 30% 5% 10% 30%
C10/R20 89.46 93.50 21.8% 39.9% 65.6% 32.9% 47.8% 73.5% 34.9% 51.5% 75.0%
C10-IB/R32 87.08 92.48 25.9% 45.2% 74.6% 31.3% 39.6% 74.6% 37.4% 49.4% 74.8%
SVHN/R32 95.68 97.07 5.8% 36.7% 64.1% 35.3% 49.7% 76.4% 31.7% 48.3% 80.0%
select one subset throughout training. To maximize fairness, we modify our max-loss baseline to select
a new subset at every subset selection step. For all experiments, standard weak augmentations (random
crop and horizontal flips) are always performed on both the original and strongly augmented data.
9
Table 5: Training ResNet20 on CIFAR10 with 50% label noise, R = 20. Accuracy without strong
augmentation is 70.72 ± 0.20 and the accuracy of full (weak and strong) data augmentation is
75.87 ± 0.77. Note that augmenting 50% subsets outperforms augmenting the full data (marked ∗∗ ).
Subset Random Max-loss Ours
10% 72.32 ± 0.14 71.83 ± 0.13 73.02 ± 1.06
30% 74.46 ± 0.27 72.45 ± 0.48 74.67 ± 0.15
50% 75.36 ± 0.05 73.23 ± 0.72 76.20 ± 0.75∗∗
30% coresets, while obtaining 75% of the improvement of full data augmentation. We provide wall-
clock times for finding coresets from Caltech256 and TinyImageNet in Appendix D.7.
7 Conclusion
We showed that data augmentation improves training and generalization by relatively enlarging
and perturbing the smaller singular values of the neural network Jacobian while preserving its
prominent directions. Then, we proposed a framework to iteratively extract small coresets of training
data that when augmented, closely capture the alignment of the fully augmented Jacobian with the
label/residual vector. We showed the effectiveness of augmenting coresets in providing a superior
generalization performance when added to the full data or random subsets, in presence of noisy labels,
or as a standalone subset. Under local smoothness of images, our additive perturbation can be applied
to model many bounded transformations such as small rotations, crops, shearing, and pixel-wise
transformations like sharpening, blurring, color distortions, structured adversarial perturbation [21].
However, the additive perturbation model is indeed limited when applied to augmentations that cannot
be reduced to perturbations, such as horizontal/vertical flips and large translations. Further theoretical
analysis of complex data augmentations is indeed an interesting direction for future work.
8 Acknowledgements
This research was supported in part by the National Science Foundation CAREER Award 2146492,
and the UCLA-Amazon Science Hub for Humanity and AI.
10
References
[1] Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of op-
timization and generalization for overparameterized two-layer neural networks. In International
Conference on Machine Learning, pages 322–332. PMLR, 2019.
[2] Shumeet Baluja and Ian Fischer. Adversarial transformation networks: Learning to generate
adversarial examples. arXiv preprint arXiv:1703.09387, 2017.
[3] Chris M Bishop. Training with noise is equivalent to tikhonov regularization. Neural computa-
tion, 7(1):108–116, 1995.
[4] Christopher Bowles, Roger Gunn, Alexander Hammers, and Daniel Rueckert. Gansfer learning:
Combining labelled and unlabelled data for gan based data augmentation. arXiv preprint
arXiv:1811.10669, 2018.
[5] Shuxiao Chen, Edgar Dobriban, and Jane H Lee. Invariance reduces variance: Understanding
data augmentation in deep learning and beyond. arXiv preprint arXiv:1907.10905, 2019.
[6] Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment:
Learning augmentation strategies from data. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 113–123, 2019.
[7] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical
automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition Workshops, pages 702–703, 2020.
[8] Tri Dao, Albert Gu, Alexander Ratner, Virginia Smith, Chris De Sa, and Christopher Ré. A
kernel theory of modern data augmentation. In International Conference on Machine Learning,
pages 1528–1537. PMLR, 2019.
[9] Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural
networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
[10] Stanislav Fort, Gintare Karolina Dziugaite, Mansheej Paul, Sepideh Kharaghani, Daniel M Roy,
and Surya Ganguli. Deep learning versus kernel learning: an empirical study of loss landscape
geometry and the time evolution of the neural tangent kernel. Advances in Neural Information
Processing Systems, 33, 2020.
[11] Aditya Sharad Golatkar, Alessandro Achille, and Stefano Soatto. Time matters in regularizing
deep networks: Weight decay and data augmentation affect early learning dynamics, matter
little near convergence. Advances in Neural Information Processing Systems, 32:10678–10688,
2019.
[12] Dan Hendrycks, Norman Mu, Ekin D Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshmi-
narayanan. Augmix: A simple data processing method to improve robustness and uncertainty.
arXiv preprint arXiv:1912.02781, 2019.
[13] Philip TG Jackson, Amir Atapour Abarghouei, Stephen Bonner, Toby P Breckon, and Boguslaw
Obara. Style augmentation: data augmentation via style randomization. In CVPR Workshops,
volume 6, pages 10–11, 2019.
[14] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and
generalization in neural networks. arXiv preprint arXiv:1806.07572, 2018.
[15] Angelos Katharopoulos and François Fleuret. Not all samples are created equal: Deep learning
with importance sampling. In International conference on machine learning, pages 2525–2534.
PMLR, 2018.
[16] L Kaufman, PJ Rousseeuw, and Y Dodge. Clustering by means of medoids in statistical data
analysis based on the, 1987.
[17] Michael Kuchnik and Virginia Smith. Efficient augmentation via data subsampling. In Interna-
tional Conference on Learning Representations, 2018.
11
[18] Jaehoon Lee, Lechao Xiao, Samuel S Schoenholz, Yasaman Bahri, Roman Novak, Jascha
Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear
models under gradient descent. In NeurIPS, 2019.
[19] Joseph Lemley, Shabab Bazrafkan, and Peter Corcoran. Smart augmentation learning an optimal
data augmentation strategy. Ieee Access, 5:5858–5869, 2017.
[20] Fu Li, Hui Liu, and Richard J Vaccaro. Performance analysis for doa estimation algorithms:
unification, simplification, and observations. IEEE Transactions on Aerospace and Electronic
Systems, 29(4):1170–1184, 1993.
[21] Calvin Luo, Hossein Mobahi, and Samy Bengio. Data augmentation via structured adversarial
perturbations. arXiv preprint arXiv:2011.03010, 2020.
[22] Michel Minoux. Accelerated greedy algorithms for maximizing submodular set functions. In
Optimization techniques, pages 234–243. Springer, 1978.
[23] Leon Mirsky. Symmetric gauge functions and unitarily invariant norms. The quarterly journal
of mathematics, 11(1):50–59, 1960.
[24] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint
arXiv:1411.1784, 2014.
[25] Baharan Mirzasoleiman, Ashwinkumar Badanidiyuru, Amin Karbasi, Jan Vondrák, and Andreas
Krause. Lazier than lazy greedy. In Twenty-Ninth AAAI Conference on Artificial Intelligence,
2015.
[26] Baharan Mirzasoleiman, Amin Karbasi, Rik Sarkar, and Andreas Krause. Distributed submodu-
lar maximization: Identifying representative elements in massive data. In Advances in Neural
Information Processing Systems, pages 2049–2057, 2013.
[27] Samet Oymak, Zalan Fabian, Mingchen Li, and Mahdi Soltanolkotabi. Generalization guaran-
tees for neural networks via harnessing the low-rank structure of the jacobian. arXiv preprint
arXiv:1906.05392, 2019.
[28] Shashank Rajput, Zhili Feng, Zachary Charles, Po-Ling Loh, and Dimitris Papailiopoulos. Does
data augmentation lead to positive margin? In International Conference on Machine Learning,
pages 5321–5330. PMLR, 2019.
[29] Alexander J Ratner, Henry R Ehrenberg, Zeshan Hussain, Jared Dunnmon, and Christopher
Ré. Learning to compose domain-specific transformations for data augmentation. Advances in
neural information processing systems, 30:3239, 2017.
[30] Ruoqi Shen, Sébastien Bubeck, and Suriya Gunasekar. Data augmentation as feature manipula-
tion: a story of desert cows and grass cows. arXiv preprint arXiv:2203.01572, 2022.
[31] Connor Shorten and Taghi M Khoshgoftaar. A survey on image data augmentation for deep
learning. Journal of Big Data, 6(1):1–48, 2019.
[32] GW Stewart. A note on the perturbation of singular values. Linear Algebra and Its Applications,
28:213–216, 1979.
[33] Stefan Wager, Sida Wang, and Percy Liang. Dropout training as adaptive regularization. arXiv
preprint arXiv:1307.1493, 2013.
[34] Per-Åke Wedin. Perturbation bounds in connection with singular value decomposition. BIT
Numerical Mathematics, 12(1):99–111, 1972.
[35] Hermann Weyl. The asymptotic distribution law of the eigenvalues of linear partial differential
equations (with an application to the theory of cavity radiation). mathematical annals, 71(4):441–
479, 1912.
[36] Sen Wu, Hongyang Zhang, Gregory Valiant, and Christopher Ré. On the generalization effects of
linear transformations in data augmentation. In International Conference on Machine Learning,
pages 10410–10420. PMLR, 2020.
12
[37] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In British Machine Vision
Conference 2016. British Machine Vision Association, 2016.
13