Just How Flexible Are Neural Networks in Practice
Just How Flexible Are Neural Networks in Practice
Abstract
It is widely believed that a neural network can fit a training set containing at least
as many samples as it has parameters, underpinning notions of overparameterized
and underparameterized models. In practice, however, we only find solutions
accessible via our training procedure, including the optimizer and regularizers,
limiting flexibility. Moreover, the exact parameterization of the function class,
built into an architecture, shapes its loss surface and impacts the minima we find.
In this work, we examine the ability of neural networks to fit data in practice.
Our findings indicate that: (1) standard optimizers find minima where the model
can only fit training sets with significantly fewer samples than it has parameters;
(2) convolutional networks are more parameter-efficient than MLPs and ViTs,
even on randomly labeled data; (3) while stochastic training is thought to have
a regularizing effect, SGD actually finds minima that fit more training data than
full-batch gradient descent; (4) the difference in capacity to fit correctly labeled and
incorrectly labeled samples can be predictive of generalization; (5) ReLU activation
functions result in finding minima that fit more data despite being designed to avoid
vanishing and exploding gradients in deep architectures.
1 Introduction
Neural networks are often assumed to be capable of fitting about as many samples as they have
parameters [1, 2, 3]. This intuition can be most easily understood through linear regression, where a
regressor with more coefficients than training samples forms an underdetermined linear system of
equations and can therefore precisely fit any function of the training points. For example, consider
that for any training set {(xi , yi )}ni=0 with n ≤ d, there exist parameters {ai } such that f (x) =
Pd j
j=0 aj x has f (xi ) = yi for all i as long as no two training points contain the same input but
assigned different labels.
The theory underlying neural networks is significantly more complicated. A variety of approximation
theories bound the number of parameters or hidden units required by a neural network architecture
to approximate a certain function class on its domain, which is typically infinite [4, 5, 6]. On finite
domains, namely a training set, overparameterized neural networks with many more parameters than
training samples can easily fit randomly labeled data, raising questions regarding how such flexible
models can still generalize to new unseen test data [1].
In this work, we step back and ask just how flexible neural networks really are in practice. Although
neural networks are theoretically capable of universal function approximation [4], in practice we train
models with limited capacity and only find optima during training that are accessible via our training
∗
Authors contributed equally.
procedure, often leading to significantly reduced flexibility as suboptimal local minima exist [7]. How
much data we can fit depends on factors like the nature of the data itself, model architecture, size,
optimizer, and regularizers. In this work, we measure the capacity of models to fit data under realistic
training loops, and we examine the effects of various features like architectures and optimizers on the
number of training samples a model can fit in practice. Our findings are summarized as follows:
• The optimizers typically used for training neural networks often find minima where the
model can only perfectly fit training sets with far fewer samples than model parameters.
This observation calls into question whether we actually find overfitting local minima in
practice, contrary to conventional wisdom.
• Convolutional architectures (CNNs) are known to generalize better than multi-layer percep-
trons on computer vision problems due to their strong inductive bias for spatial relationships
and locality. However, we find that CNNs are actually more parameter efficient on randomly
labeled data as well, indicating that their superior capacity to fit data does not result from
superior generalization alone.
• The ability of a neural network to fit many more correctly labeled samples than incorrectly
labeled samples is predictive of generalization.
• ReLU activation functions enable fitting more training samples than sigmoidal activations
after successfully finding minima using models with each activation function, even though
ReLU nonlinearities were introduced to neural networks to prevent vanishing and exploding
gradients in deep neural networks with many layers.
• SGD is thought to have a regularizing effect that improves generalization, yet we find that
SGD actually enables fitting more training samples than full-batch gradient descent.
2 Related Work
Approximation theory. A primary area of early deep learning theory focused on upper bounding the
number of parameters or neurons required to well-approximate functions in a particular class, for
example uniform approximation of continuous functions on a compact domain [4]. Such approxima-
tion theories typically focus on arbitrary compact sets or data on a well-behaved manifold [4, 8]. The
resulting upper bounds are often proved constructively, and the constructions may be specific to a
particular neural network architecture, often very shallow networks with only a few layers, limiting
their generality. We focus on neural network flexibility on the training set, empirically measuring
the parameters needed to fit real data in practice, rather than theoretical bounds. This methodology
allows us to try any architecture or to inspect the influence of optimizers, and it measures quantities
that actually impact neural networks.
Overparameterized neural networks and generalization. Early generalization theories predicted
that highly constrained models which fit their training data yet fail to fit randomly labeled data (i.e.
have low Rademacher complexity or VC-dimension) can generalize to new unseen test data [9, 10].
However, these theories fail to account for the exceptional generalization behavior of neural networks
since they are often highly flexible and overparameterized, leading to vacuous error bounds [1].
Recent work on PAC-Bayes generalization theory explains that highly flexible and overparameterized
models can generalize well as long as they assign disproportionate prior mass to parameter vectors
which fit the training data [3, 11, 12]. Related empirical works explain why neural network inductive
bias and consequently generalization can actually benefit from overparametrization [13, 14, 15]. We
will see in our own experiments that whereas the Rademacher complexity of neural networks is
extremely high, they can fit many more correctly labeled samples than randomly labeled ones in
practice, and this gap predicts generalization. Nakkiran et. al [16] use the data-fitting capacity of
neural networks to understand the double-descent phenomenon. They train networks with many
fewer or many more parameters than the number of samples they can fit, and they study the impact
of such over- and underparameterization on generalization. In contrast, we are interested in what
influences that capacity to fit data itself.
3 Preliminaries
Quantifying capacity. While it is straightforward to determine the number of samples a linear
regression model can fit by counting its parameters, neural networks present a more complicated story.
2
Our goal is to measure the neural network’s capacity to fit real data using realistic training routines.
This metric should satisfy three essential criteria: (1) it must measure the real-world capacity to fit
data, enabling us to evaluate the effects of optimizers and regularizers; (2) it should be sensitive to
the training dataset, meaning it should reflect the capacity to fit different types of data or data with
specific labeling characteristics; and (3) it must be feasible to compute.
To that end, we adopt the Effective Model Complexity (EMC) metric [16], which estimates the largest
sample size that a model can perfectly fit. We apply this metric across various data types, including
those with random or semantic labels or even random inputs.
Calculating EMC involves an iterative approach for each network size. Initially, we train the model on
a small number of samples. If it achieves 100% training accuracy after training, we re-initialize and
train on a larger set of randomly chosen samples. We iteratively perform this process, incrementally
increasing the sample size each time until the model no longer fits all training samples perfectly. The
largest sample size where the model still achieves perfect fitting is taken as the network’s EMC. It is
important to note that the initialization and data subsets we sample on each iteration are independent
of those from previous iterations, ensuring that our capacity evaluation remains unbiased. Where the
network did not successfully reach 100% training accuracy, we re-run training three more times with
different random seeds to ensure that inability to fit all samples was not a fluke. Furthermore, we
also tried performing all analysis instead with a relaxed requirement that the network fit 98% of its
training data, which did not significantly affect results.
While it is possible to artificially prevent models from fitting their training set by under-training,
confounding any study of capacity to fit data, we ensure that all training runs reach a minimum of the
loss function by imposing three conditions: first, the norm of the gradients across all samples must
fall below a pre-defined threshold; second, the training loss should stabilize; third, we check for the
absence of negative eigenvalues in the loss Hessian to confirm that the model has indeed reached a
minimum rather than a saddle point. In Appendix A.3, we detail our method for computing the EMC
as well as how we enforce the above three conditions.
In contrast to Nakkiran et al. [16], we validate that each model reaches a minimum during optimization
by ensuring the absence of negative singular values of the loss Hessian. This step is important given
that we train models of various architectures and sizes, so we want to prevent under-training from
being a confounding variable.
Underparameterization and overparameterization. Linear models are described as underparame-
terized when they have fewer parameters than training samples and overparameterized when they
have more parameters than training samples. This threshold determines when a linear regression
model can fit any labeling of its data, and it often coincides with the transition to strict convexity
when a linear model has a unique optimal parameter vector. Neural networks behave differently than
linear regression models; their loss function is non-convex and can have multiple minima even when
training sets are large. Moreover, it is unclear exactly how many parameters a neural network needs
to fit its training set in practice. We will use EMC to investigate the latter quantity.
The differences between capacity, flexibility, expressiveness, and complexity. These terms are used
in numerous ways, sometimes interchangeably and sometimes distinctly. For example, Rademacher
complexity and VC-dimension are notions of complexity typically associated with flexibility, whereas
the PAC-Bayes notion of complexity is information-theoretic and instead measures compression.
Expressiveness can be used to described the breadth of an entire hypothesis class, that is all the
functions that a model can express across all possible parameter settings. Approximation theories
measure the expressiveness of a hypothesis class by the existence of elements of this class which well-
approximate functions of a specified type. We will abstain from using the terms “expressiveness” and
“complexity” when describing EMC to avoid confusion, and we will use “capacity” and “flexibility”
when referring to a model’s ability to fit data in practice.
Factors influencing the EMC. Unlike VC-dimension or expressiveness concepts in approximation
theories, EMC depends not only on the hypothesis class but on every aspect of neural network
training, from optimizers and regularizers to the specific parameterization induced by the model’s
architecture. Choices in architectural design and training algorithms influence the loss of surface
geometry, thereby affecting the accessibility of certain solutions.
3
4 Experimental Setup
We conduct a comprehensive dissection of the factors influencing neural network flexibility. To this
end, we consider a variety of datasets, architectures, and optimizers.
4.1 Datasets
We conduct experiments on a variety of datasets, including vision datasets like MNIST [17], CIFAR-
10, CIFAR-100 [18], and ImageNet [19], as well as tabular datasets like Forest Cover Type [20],
Adult Income [21], and the Credit dataset [22]. Due to the small size of these datasets, we also use
larger synthetic datasets. These are generated using the Efficient Diffusion Training via Min-SNR
Weighting Strategy [23], yielding diverse ImageNet-quality samples at a resolution of 128 × 128.
Specifically, we create ImageNet-20MS, containing 20 million samples across ten classes. Unless
otherwise specified, the main text describes results on ImageNet-20MS, while the appendix contains
results on additional datasets. We omit data augmentations to avoid confounding effects.
4.2 Models
We evaluate the flexibility of diverse architectures, including Multi-Layer Perceptrons (MLPs), CNNs
like ResNet [24] and EfficientNet [25], and Vision Transformers (ViTs) [26]. We systematically
adjust the width and depth of these architectures. For MLPs, we either increase the width by adding
neurons per layer while keeping the number of layers constant or increase the depth by adding
more layers while keeping the number of neurons per layer constant. For naive CNNs, we employ
multiple convolutional layers followed by a constant-sized fully connected layer, varying either the
number of filters per layer or the total number of layers. For ResNets, we scale either the number of
filters or the number of blocks (depth). In ViTs, we scale the number of encoder blocks (depth), the
dimensionality of patch embeddings, and self-attention (width). By default, we scale the width to
control the parameter count unless stated otherwise.
4.3 Optimizers
We employ several optimizers, including Stochastic Gradient Descent (SGD), Adam [27], AdamW
[28], full-batch Gradient Descent (GD), and the second-order Shampoo optimizer [29]. These choices
let us examine how features like stochasticity and preconditioning influence the minima. To ensure
effective optimization across datasets and model sizes, we carefully tune the learning rate and batch
size for each setup, omitting weight decay in all cases. Further details about our hyperparameter
tuning are provided in Appendix A.2. By default, we use SGD.
In this section, we dissect how data properties shape neural networks flexibility and how this behavior
can predict generalization.
Analysis of diverse datasets. We initiate our analysis by measuring the EMC of neural networks
across various datasets and modalities. We scale a 2-layer MLP by modifying the width of the hidden
layers and a CNN by modifying the number of layers and channels, and we train models on a range
of image classification (MNIST, CIFAR-10, CIFAR-100, ImageNet) and tabular (CoverType, Income,
and Credit) datasets. The results reveal significant disparities in the EMC of networks trained on
different data types (see Figure 1 (Left)). For instance, networks trained on tabular datasets exhibit
higher capacity. Among image classification datasets, we observe a strong correlation between test
accuracies and capacity. Notably, MNIST (where models achieve more than 99% test accuracy)
yields the highest EMC, whereas ImageNet shows the lowest, pointing to the relationship between
generalization and the data-fitting capability.
Considering the variety of datasets and network architectures and the myriad differences in their
EMC, the subsequent sections will explore the underlying causes of these variations. Our goal is to
identify the distinct factors in the data and architectures that contribute to these observed differences
in network flexibility.
4
16 CNN
106 MLP
EMC 8
104
103 4
2 3 4 5 6 7
10 10 10 10 10 10
Parameter Count
0.0 0.4 0.8 1.2
Generalization Gap in Loss
MNIST-MLP CIFAR-10-CNN ImageNet-CNN Covertype
MNIST-CNN CIFAR-100-MLP Income Forest
CIFAR-10-MLP CIFAR-100-CNN
Figure 1: Left: easier tasks tend to have higher EMC. EMC across datasets and data modalities.
The tabular data sets (Forest, Income, CoverType), which are easier to learn, have the highest EMC
compared to vision datasets. The dashed black line is the diagonal. ImageNet is the hardest dataset to
learn. Right: the difference in EDC on the original and random labels predicts generalization.
EMC improvement as a function of the parameter count for CIFAR-100.
We next analyze the inductive biases of different architectures and how factors like spatial structure
influence the ability of a model to fit its training data. To this end, we altering inputs and labels,
measuring resulting effects. We adjust the width of MLPs and 2-layer CNNs by varying the number of
neurons (MLPs) or filters (CNNs) in each layer, and we train them on ImageNet-20MS. We evaluate
EMC as a function of the model’s parameter count in four scenarios: semantic labels, random labels,
random inputs, and inputs under a fixed random permutation. In the case of random labels, we
maintain the input but sample the class labels randomly. For random inputs, we replace the original
inputs with Gaussian noise, while for the permuted input, we use the same fixed permutation for all
the images, breaking the spatial structure in the data.
7
10
6
10
EMC
EMC
5
10
4
10
3
10
4 5 6 7 4 5 6 7
10 10 10 10 10 10 10 10
Parameter Count Parameter Count
Figure 2: CNNs fit more semantically labeled samples than they have parameters due to their
superior image classification inductive bias, whereas MLPs cannot. EMC as a function of the
number of parameters for semantic labels vs. random input and labels for MLPs (a) and CNNs (b).
Experiments performed on ImageNet-20MS. Error bars represent one standard error over 5 trials.
5
Original Labels Original Labels
Random Labels Random Labels
4.6 4.6
Log (EMC)
Log (EDC)
4.4
4.2
4.5
4.0
20 40 60 80 100
Number of Classes SGD Shampoo AdamW Adam Full Batch
Optimizers
(a) More classes makes fitting data harder with se- (b) SGD and Shampoo are better for fitting with the
mantic labels but easier with random ones. original labels but not with random ones
Figure 3: The effect of the number of labels and optimizers on capacity. Average logarithm of
EMC across different model sizes of CNNs on CIFAR-100 for original and random labels varying
numbers of classes (a) and for different optimizers (b). Error bars are standard error over 5 trials.
of the inputs, then the model can fit infinitely many samples. In Figure 2, assigning random labels
instead of real ones allows us to explore an analogous notion of the boundary between over- and
under-parameterization, but in the context of neural networks. We see here that the networks fit
significantly fewer samples when assigned random labels compare to the original labels, indicating
that neural networks are less parameter efficient than linear models in this setting. Like linear models,
the amount of data they can fit appears to scale linearly in their parameter count.
The effect of high-dimensional data. Linear models exhibit increased capacity when adding more
features, primarily because their parameter count directly scales with the feature count. However,
the dynamics shift when examining CNNs. In our setup, we avoid adding parameters as the data
dimensionality increases by employing average pooling prior to the classification head, a standard
technique for CNNs. We investigate the EMC using the ImageNet-20MS, systematically resizing
input images to vary their spatial dimensions from (16 × 16) to (256 × 256).
In contrast to linear models, we find in Appendix Figure 17 that CNNs, which do not benefit from
additional parameters as the input dimensionality increases, can actually fit more semantically labeled
data in lower spatial dimensions. This trend underscores a broader narrative in neural networks:
CNNs, despite their intricate architectures and capacity for complex pattern recognition, tend to align
better with data of lower intrinsic dimension. This observation resonates with the findings of Pope et
al. [30], who find that CNNs generally showcase enhanced generalization capabilities with data of
lower intrinsic dimensionality.
The effect of the number of classes. In order to probe the influence of the number of classes on the
EMC, we randomly merge CIFAR-100 classes to artificially decrease the number of classes while still
preserving the size of the original dataset. We again consider a 2-layer CNN with various numbers of
filters, and consequently, parameters. In Figure 3a, we plot the average of the logarithm of the EMC
across different model sizes for various numbers of classes. We see that data with semantic labels
becomes harder and harder to fit as the number of classes increases, and generalization becomes
more challenging as the model has to encode more information about each sample in its weights. In
contrast, randomly labeled data is easier to fit as the number of classes increases because the model is
no longer forced to assign as many semantically different samples the same class label, which would
be at odds with the model’s inductive bias that prefers correct labels over random ones.
To compare different datasets while controlling for properties like number of classes, we convert
several datasets into binary classification problems. This modification enables us to assess the impact
of the number of classes on EMC and isolate the effects of input distribution. Our results (Appendix
Figure 21) show that even though the EMC among image datasets increases in the binary classification
setting over the original classification labels, tabular datasets consistently demonstrated higher EMC.
Furthermore, significant differences persist among the different tabular datasets. These outcomes
suggest that additional factors, perhaps intrinsic to the datasets themselves, contribute to EMC beyond
the number of classes.
6
5.2 Predicting generalization
Neural networks exhibit a marked preference for fitting semantically coherent labels over random
ones, a tendency reflecting their inductive biases. This propensity, as depicted in Figure 1 (right),
underscores a broader principle: a network’s adeptness at fitting semantic labels compared to
random ones often correlates with its generalization. Interestingly, this generalization enables certain
architectures, like CNNs, to fit more samples than their parameter count might suggest, blurring the
boundaries of over- and under-parameterization.
This observation bridges two seminal perspectives on model generalization. Traditional machine
learning wisdom posits that high-capacity models tend to overfit, compromising their generalization
on new data—a notion reflected in early generalization bounds, which are vacuous for neural networks
[9, 10]. In contrast, PAC-Bayes theory proposes that a model’s flexibility doesn’t inherently impede
generalization, provided its prior assigns disproportionate mass to the true labels compared to random
ones, or in other words the model prefers correct labelings of the data to incorrect labelings [3]. Our
empirical findings relate these two theories, revealing an empirical relationship between a model’s
increased ability to fit correct labels over random ones and its generalization.
Specifically, we compute the EMC for various CNN and MLP configurations on both correctly and
randomly labeled data. We measure the percent increase in EMC when models encounter semantic
labels versus random ones, effectively gauging their practical capacity to fit data that aligns with
natural label distributions.
The notable inverse correlation between this metric and the generalization gap (Pearson correlation
coefficient of −0.9281 for CNNs and −0.869 for MLPs), as illustrated in Figure 1 (Right), not only
confirms the theoretical underpinnings of generalization but also illuminates the practical implications
of these theories.
7
7
10
5 Original Labels
Random Labels
6
Log (EMC)
10
4.6
EMC
5
10 Dataset Size 4.2
Depth
Width
ResNet-RS
4 EfficientNet
10
4 5 6 7 3.8
10 10 10 10 CNN VIT MLP
Parameter Count
Architecture
(a) ResNet-RS is the most efficient among scaling (b) CNNs are far more parameter-efficient, even on ran-
strategies we test. domly labeled data.
Figure 4: The effect of the scaling strategy and the architecture on the EMC . (a) Scaling laws
for the EMC as a function of parameters counts for CNN. (b) Average logarithm of EMC across
parameter counts for different architectures using original and random labels. On ImageNet-20MS.
Error bars represent one standard error over 5 trials.
under various scaling configurations. For ResNets, these include increasing width (number of filters),
increasing depth, or increasing both width and depth according to two scaling laws: EfficientNet [25]
and ResNet-RS [36]. EfficientNet uses a balanced approach, scaling depth, width, and resolution
simultaneously with fixed coefficients. ResNet-RS adapts scaling based on model size, training
duration, and dataset size. For scaling ViTs, we use the SViT approach [37], SoViT [38], and also
try scaling the number of encoder blocks (depth) and the dimensionality of patch embeddings and
self-attention (width) separately.
Our analysis reveals that, although not initially crafted for optimizing capacity, specially designed
scaling laws perform well in this respect. Furthermore, consistent with earlier theoretical analyses
[39], our findings affirm that scaling depth is more parameter-efficient than scaling width. These
parameter-efficiency comparisons also hold on randomly labeled data, indicating that they are not an
artifact of generalization.
Activation functions. Nonlinear activation functions are crucial for neural network capacity because
without them, neural networks are just large factorized linear models. In this subsection, we examine
the effect of the activation functions on capacity, contrasting them with linear models.
Detailed in Appendix Figure 16, our findings show that ReLU functions significantly enhance capacity.
Though initially integrated to mitigate vanishing and exploding gradients, ReLU also boosts the
network’s data-fitting ability, likely by improving generalization. In contrast, tanh and identity
functions, while nonlinear, do not achieve similar effects, even though we are able to find minima
using these activations functions too. We note the latter fact to ensure that ReLUs are not only
boosting network capacity by making it easier to find minima.
The choice of optimization technique and regularization strategy is crucial in neural network training.
This choice affects not only training convergence but also the nature of the solutions found. This
section explores the role different optimization and regularization techniques play in a network’s
flexibility.
Comparing optimizers. We explore the influence of various optimizers, including SGD, full-batch
Gradient Descent, Adam [27], AdamW [28], and Shampoo [40].
Whereas previous works suggest that SGD has a strong flatness-seeking regularization effect [41], we
find in Figure 3b that SGD also enables fitting more data than full-batch (non-stochastic) training,
fitting a comparable volume of data as the high-powered Shampoo. This experiment, namely the
variety of EMC measurements across optimizers, demonstrates that optimizers differ not only in the
rate at which they converge but also in the types of minima they find. Repeating this experiment with
8
random labels shows that the higher EMC of SGD and Shampoo evaporates, indicating that their
greater ability to fit data may be related to their superior generalization.
Regularizers. Classical machine learning systems employed regularizers designed to reduce capacity.
For example, ridge regression applies a penalty on the parameter norm, improving performance of
overparameterized linear models [42]. Similarly XGBoost penalizes the sum of squared leaf weights
to prevent overfitting [43]. Modern deep learning pipelines use various regularization techniques to
improve generalization. We now examine if these regularizers also reduce the model’s capacity to
fit data. We previously found that stochastic training, which enhances generalization and provides
implicit regularization, actually increases EMC.
In Appendix Figure 15, we compute the EMC of a CNN trained on ImageNet-20MS using Sharpness-
Aware Minimization (SAM) [44], weight decay, and label smoothing [45]. Weight decay and label
smoothing limit capacity, but SAM improves generalization without reducing capacity, even on
randomly labeled data. Label smoothing modifies the loss function, so a model trained with the
smoothed objective may not find minima of the original non-smoothed loss. In contrast, SAM does
not change the loss function itself but finds different types of minima than SGD, which generalize
better at no capacity cost.
9 Discussion
Our findings show that parameter counting alone is not a useful tool for determining the number of
samples a neural network can fit, or the boundary between underparameterization and overparam-
eterization. Instead, many factors contribute to the effective model complexity, including virtually
all components of a training routine as well as the data itself. Moreover, we must re-evaluate our
understanding of why these components work. We saw that architectural components like ReLU
activation functions may solve additional problems that they weren’t designed for, and stochastic
optimization, for example, actually finds minima where we fit more training samples, contrasting
with conventional views of implicit regularization. Finally, our results suggest neural networks are
often parameter-wasteful, and new parameterizations might improve efficiency.
Acknowledgements
This work is supported by NSF CAREER IIS-2145492, NSF CDS&E-MSS 2134216, NSF HDR-
2118310, BigHat Biosciences, Capital One, and an Amazon Research Award.
9
References
[1] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding
deep learning requires rethinking generalization. In International Conference on Learning
Representations, 2016.
[2] Devansh Arpit, Stanisław Jastrz˛ebski, Nicolas Ballas, David Krueger, Emmanuel Bengio,
Maxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, et al. A
closer look at memorization in deep networks. In International conference on machine learning,
pages 233–242. PMLR, 2017.
[3] Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds
for deep (stochastic) neural networks with many more parameters than training data. arXiv
preprint arXiv:1703.11008, 2017.
[4] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are
universal approximators. Neural networks, 2(5):359–366, 1989.
[5] Andrew R Barron. Neural net approximation. In Proc. 7th Yale workshop on adaptive and
learning systems, volume 1, pages 69–72, 1992.
[6] Hrushikesh N Mhaskar and Tomaso Poggio. Deep vs. shallow networks: An approximation
theory perspective. Analysis and Applications, 14(06):829–848, 2016.
[7] Micah Goldblum, Jonas Geiping, Avi Schwarzschild, Michael Moeller, and Tom Goldstein.
Truth or backpropaganda? an empirical investigation of deep learning theory. In International
Conference on Learning Representations, 2020.
[8] Uri Shaham, Alexander Cloninger, and Ronald R Coifman. Provable approximation properties
for deep neural networks. Applied and Computational Harmonic Analysis, 44(3):537–557,
2018.
[9] Vladimir Vapnik. Principles of risk minimization for learning theory. Advances in neural
information processing systems, 4, 1991.
[10] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds
and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.
[11] Sanae Lotfi, Marc Finzi, Sanyam Kapoor, Andres Potapczynski, Micah Goldblum, and An-
drew G Wilson. Pac-bayes compression bounds so tight that they can explain generalization.
Advances in Neural Information Processing Systems, 35:31459–31473, 2022.
[12] Sanae Lotfi, Marc Finzi, Yilun Kuang, Tim GJ Rudner, Micah Goldblum, and Andrew Gor-
don Wilson. Non-vacuous generalization bounds for large language models. arXiv preprint
arXiv:2312.17173, 2023.
[13] W Ronny Huang, Zeyad Emam, Micah Goldblum, Liam Fowl, JK Terry, Furong Huang,
and Tom Goldstein. Understanding generalization through visualizations. arXiv preprint
arXiv:1906.03291, 2019.
[14] Ping-yeh Chiang, Renkun Ni, David Yu Miller, Arpit Bansal, Jonas Geiping, Micah Goldblum,
and Tom Goldstein. Loss landscapes are all you need: Neural network generalization can
be explained without the implicit bias of gradient descent. In The Eleventh International
Conference on Learning Representations, 2022.
[15] Wesley J Maddox, Gregory Benton, and Andrew Gordon Wilson. Rethinking parameter counting
in deep models: Effective dimensionality revisited. arXiv preprint arXiv:2003.02139, 2020.
[16] Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever.
Deep double descent: Where bigger models and more data hurt. Journal of Statistical Mechanics:
Theory and Experiment, 2021(12):124003, 2021.
[17] Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE
Signal Processing Magazine, 29(6):141–142, 2012.
10
[18] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images.
2009.
[19] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-
scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern
recognition, pages 248–255. Ieee, 2009.
[20] Jock A. Blackard and Denis J. Dean. Comparative accuracies of artificial neural networks and
discriminant analysis in predicting forest cover types from cartographic variables. Computers
and Electronics in Agriculture, 24(3):131–151, 1999.
[21] Barry Becker and Ronny Kohavi. Adult. UCI Machine Learning Repository, 1996. DOI:
https://doi.org/10.24432/C5XW20.
[22] Kaggle. Credit card dataset. 2021. Kaggle dataset.
[23] Tiankai Hang, Shuyang Gu, Chen Li, Jianmin Bao, Dong Chen, Han Hu, Xin Geng, and Baining
Guo. Efficient diffusion training via min-snr weighting strategy. 2023.
[24] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 770–778, 2016.
[25] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural
networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019.
[26] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,
Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al.
An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint
arXiv:2010.11929, 2020.
[27] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International
Conference on Learning Representations, 2015.
[28] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International
Conference on Learning Representations, 2018.
[29] Rohan Anil, Vineet Gupta, Tomer Koren, Kevin Regan, and Yoram Singer. Towards practical
second order optimization for deep learning, 2021.
[30] Phil Pope, Chen Zhu, Ahmed Abdelkader, Micah Goldblum, and Tom Goldstein. The intrinsic
dimension of images and its impact on learning. In International Conference on Learning
Representations, 2020.
[31] Stéphane d’Ascoli, Hugo Touvron, Matthew L Leavitt, Ari S Morcos, Giulio Biroli, and Levent
Sagun. Convit: Improving vision transformers with soft convolutional inductive biases. In
International Conference on Machine Learning, pages 2286–2296. PMLR, 2021.
[32] Badri N Patro and Vijay Agneeswaran. Efficiency 360: Efficient vision transformers. arXiv
preprint arXiv:2302.08374, 2023.
[33] José Maurício, Inês Domingues, and Jorge Bernardino. Comparing vision transformers and
convolutional neural networks for image classification: A literature review. Applied Sciences,
13(9), 2023.
[34] Micah Goldblum, Hossein Souri, Renkun Ni, Manli Shu, Viraj Uday Prabhu, Gowthami
Somepalli, Prithvijit Chattopadhyay, Mark Ibrahim, Adrien Bardes, Judy Hoffman, et al. Battle
of the backbones: A large-scale comparison of pretrained models across computer vision
tasks. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and
Benchmarks Track, 2023.
[35] Chenglong Bao, Qianxiao Li, Zuowei Shen, Cheng Tai, Lei Wu, and Xueshuang Xiang. Ap-
proximation analysis of convolutional neural networks. work, 65, 2014.
11
[36] Irwan Bello, William Fedus, Xianzhi Du, Ekin Dogus Cubuk, Aravind Srinivas, Tsung-Yi Lin,
Jonathon Shlens, and Barret Zoph. Revisiting resnets: Improved training and scaling strategies.
In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural
Information Processing Systems, 2021.
[37] Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transform-
ers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pages 12104–12113, 2022.
[38] Ibrahim Alabdulmohsin, Xiaohua Zhai, Alexander Kolesnikov, and Lucas Beyer. Getting vit
in shape: Scaling laws for compute-optimal model design. arXiv preprint arXiv:2305.13035,
2023.
[39] Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. In
Conference on learning theory, pages 907–940. PMLR, 2016.
[40] Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor
optimization. In International Conference on Machine Learning, pages 1842–1850. PMLR,
2018.
[41] Jonas Geiping, Micah Goldblum, Phil Pope, Michael Moeller, and Tom Goldstein. Stochastic
training is not necessary for generalization. In International Conference on Learning Represen-
tations, 2021.
[42] Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for nonorthogonal
problems. Technometrics, 12(1):55–67, 1970.
[43] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of
the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages
785–794, 2016.
[44] Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware min-
imization for efficiently improving generalization. In International Conference on Learning
Representations, 2020.
[45] Rafael Müller, Simon Kornblith, and Geoffrey E Hinton. When does label smoothing help?
Advances in neural information processing systems, 32, 2019.
[46] Zhewei Yao, Amir Gholami, Kurt Keutzer, and Michael W Mahoney. Pyhessian: Neural
networks through the lens of the hessian. In 2020 IEEE international conference on big data
(Big data), pages 581–590. IEEE, 2020.
12
A Appendix
Here, we present figures that include additional datasets and labelings, as well as detailed results
across all parameter counts, rather than just the aggregated averages shown in the main body. In the
main paper, for the ViT scaling laws, we followed the scaling approach proposed by [37] (SVIT),
which advocates for simultaneously and uniformly scaling all aspects—depth, width, MLP width, and
patch size. Additionally, we employed both SoViT, as per [38], and approaches where the number
of encoder blocks (depth) and the dimensionality of patch embeddings and self-attention (width)
in the ViT are scaled separately. fig. 5 in the Appendix demonstrates that scaling each dimension
independently can lead to suboptimal results, aligning with our observations from the EfficientNet
experiments. Furthermore, it shows that SoViT yields results that are slightly different from those
obtained using the laws from [37].
EfficientNet - CNN
7
10 SVIT
SoViT
VIT - Width
6 VIT - Depth
10
MLP
EMC
5
10
4
10
3
10
4 5 6 7
10 10 10 10
Parameter Count
Figure 5: Scaling laws - EMC as a function of the number of parameters for randomly labeled
ImageNet-20MS for VIT
13
7
10
6
10
EDC
5
10
Depth
Width
EfficientNet
4
ResNet-RS
10 dataset size
4 5 6 7
10 10 10 10
Parameter Count
Figure 6: Scaling laws - EMC as a function of the number of parameters for randomly labeled
ImageNet-20MS.
7
10
6
10
EMC
5
10 Dataset Size
Depth
Width
ResNet-RS
4 EfficientNet
10
4 5 6 7
10 10 10 10
Parameter Count
Figure 7: Scaling laws - EMC as a function of the number of parameters for a CNN on ImageNet-
20MS with original labels.
14
4
4 × 10
4
3 × 10
EMC
4
2 × 10
Width
Depth
4
10 EfficientNet
ResNet
dataset size
4
10
Parameter Count
Figure 8: Scaling laws - EMC as a function of the number of parameters for a CNN on CIFAR-10
with original labels.
7
10
6
10
EMC
5
10
Linear
ReLU
4 Tanh
10
4 5 6 7
10 10 10 10
Parameter Count
Figure 9: EMC as a function of the number of parameters across different activation functions
using CNNs on ImageNet-20MS with original labels.
15
7
10
6
10
EMC
5
10
4
Linear
10 ReLU
Tanh
4 5 6 7
10 10 10 10
Parameter Count
Figure 10: EMC as a function of the number of parameters across different activation functions
using CNNs and ImageNet-20MS with random labels.
Depth
7
10 Width
ResNet-RS
EfficientNet
dataset size
6
Flexability
10
5
10
4
10
4 5 6 7
10 10 10 10
Parameter Count
Figure 11: SGD and Shampoo fit more training data - EMC across different optimizers using
CNNs on CIFAR-10.
16
7
10
6
10
EMC
5
10
SGD
AdamW
Adam
4 Shampoo
10 dataset size
4 5 6 7
10 10 10 10
Parameter Count
Figure 12: EMC as a function of the number of parameters across different optimizers with
CNNs on ImageNet-20MS with original labels.
7
10
6
10
EMC
5
10
SGD
4
AdamW
10 Adam
Shampoo
dataset size
4 5 6 7
10 10 10 10
Parameter Count
Figure 13: EMC as a function of the number of parameters across different optimizers with
CNNs on ImageNet-20MS with random labels.
17
SAM
7
10 Weight Decay
Label Smothing
Original
6
10
EMC
5
10
4
10
4 5 6 7
10 10 10 10
Parameter Count
Figure 14: EMC as a function of the number of parameters across different regularizers on
ImageNet-20MS with random labels.
Original Labels
5.1 Random Labels
Log (EMC)
4.7
4.3
3.9
M inal ing ca
y
SA rig th e
O Smo htD
l ig
La
b e We
Regularizers
Figure 15: SAM has better generalization at no capacity cost - Average logarithm of EMC over
different model sizes for SAM, weight decay, and label smoothing using CNNs on ImageNet-20MS.
18
7
10
6
10
EMC
EMC
5
10
4
10
4 5 6 7 4 5 6 7
10 10 10 10 10 10 10 10
Parameter Count Parameter Count
Linear ReLU Tanh Dataset Size
Figure 16: ReLU networks exhibit higher flexibility. EMC as a function of the number of
parameters across different activation functions for original labels (left) and for random ones (right)
on ImageNet-20MS.
Figure 17: High-dimensional data is harder to fit. Average logarithm of EMC across different
model sizes for original and random labels varying input sizes for CNN architectures on CIFAR-100.
19
7
10 7
10
6
10
6
10
EMC
EMC
5
10
5
10
4
Random Input Random Input
10
Permuted Input Permuted Input
Random Label 4 Random Label
10
3
Semantic Label Semantic Label
10
dataset size dataset size
4 5 6 7 4 5 6 7
10 10 10 10 10 10 10 10
Parameter Count Parameter Count
5.2
4.7
4.2
ation ing l
Quantiz c e Train Origina
Subspa
Regularizers
Figure 19: Compression improves network efficiency - Average logarithm of EMC over different
model sizes and compression methods. CNNs on ImageNet-20MS.
Original Original
Subspace Training 7 Subspace Training
7 10
10
Quantization Quantization
6
10
6
10
EMC
EMC
5
10
5
10
4
10
4 5 6 7 4 5 6 7
10 10 10 10 10 10 10 10
Parameter Count Parameter Count
20
6
10
5
10
EMC
4
10
3
10
2 3 4 5 6
10 10 10 10 10
Parameter Count
Figure 21: EMC as function of the number of parameters for datasets that converted to binary
classification.
Unless otherwise mentioned, our hyperparameter tuning was conducted over the following hyperpa-
rameters: batch size - with the values [32, 64, 128, 256]. For the Stochastic Gradient Descent (SGD)
optimizer, we used an initial learning rate selected by grid search between 0.001 and 0.01 with Cosine
annealing. For Adam and AdamW optimizers, the learning rate was chosen by grid search between
1e − 5 and 1e − 2.
For other hyperparameters, we adhere to the standard PyTorch recipes.
To compute the Empirical Model Complexity (EMC), we adopt an iterative approach for each network
size. Initially, we start with a small number of samples and train the model. Post-training, we verify
if the model has perfectly fit all the samples by achieving 100% training accuracy. If this criterion is
met, we re-initialize the model with a random initialization and train it again on a larger number of
samples, randomly drawn from the full dataset. This process is iteratively performed, increasing the
number of samples in each iteration, until the model fails to perfectly fit all the training samples. The
largest sample size where the model achieves a perfect fit is taken as the Empirical Model Complexity
for that particular network size. It is important to note that data is sampled independently on each
iteration.
While it is possible to artificially prevent models from fitting their training set by under-training, thus
confounding any study of capacity to fit data, we ensure that all training runs reach a minimum of the
loss function by imposing three conditions:
First, the norm of the gradients across all samples must fall below a pre-defined threshold. We
observed that there is a high variance in the norms of the gradients between different networks;
therefore, we set this threshold manually after checking the norms for each network type when
training with a small number of samples, where it’s clear that the networks fit perfectly and converge
to a minimum.
Second, the training loss should stabilize. To ensure this, we stipulate that the average loss should not
decrease for 10 consecutive epochs.
21
Third, we check for the absence of negative eigenvalues in the loss Hessian to confirm that the
model has indeed reached a minimum rather than a saddle point. To do this, we calculate the
eigenvalues using the PyHessian Python package [46] and validate that after training converges, there
are no eigenvalues smaller than −1e − 2. This threshold was chosen after examining the eigenvalue
distributions of different networks that fit perfectly.
Our experiments were conducted using NVIDIA Tesla V100 GPUs with 32GB memory each for
model training and evaluation. The total compute time for the entire set of experiments was ap-
proximately 3000 GPU hours. All experiments were run on NUY’s cluster managed with SLURM,
ensuring efficient resource allocation and job scheduling. This setup allowed us to handle the exten-
sive computational demands of training large neural network models and conducting comprehensive
evaluations.
B Broader Impacts
Our research on the capacity of neural networks to fit data more efficiently has several important
implications. Positively, our findings could lead to more efficient AI models, which would benefit
various applications by making these technologies more accessible and effective. By understanding
how neural networks can be more efficient, we can also reduce the environmental impact associated
with training large models.
However, there are potential negative impacts as well. Improved neural network capabilities might
be used in ways that invade privacy, such as through enhanced surveillance or unauthorized data
analysis. Additionally, as AI technologies become more powerful, it is essential to consider ethical
implications, fairness, and potential biases in their development and use.
To address these concerns, our paper emphasizes the importance of responsible AI practices. We
encourage transparency, ethical considerations, and ongoing research into the societal impacts of
advanced machine learning technologies to ensure they are used for the greater good.
C Limitations
Our study has several limitations that should be considered when interpreting the results. First, the
datasets used in our experiments, while diverse, may not fully represent the wide variety of data
encountered in practical applications. This could introduce biases and limit the generalizability of our
findings. Second, our experiments are constrained by the available computational resources. While
we used NVIDIA Tesla V100 GPUs with 32GB memory, the total compute time was approximately
3000 GPU hours. This limitation restricted the scale and number of experiments we could perform,
potentially affecting the robustness of our conclusions.
Furthermore, our analysis primarily focuses on certain types of neural network architectures, such as
CNNs, MLPs, and ViTs. While these are common and widely used, there are many other architectures
that we did not explore. The impact of different training procedures, regularization techniques, and
hyperparameter choices on the EMC might vary with other architectures.
Additionally, we decided to test a wide range of factors affecting neural network flexibility but only
explored a limited number of settings for each factor, rather than delving deeply into any single factor.
This broad but shallow exploration might miss deeper insights that a more focused study could reveal.
Lastly, our method of measuring EMC, while rigorous, relies on specific criteria for determining
when a model has perfectly fit its training data. These criteria include achieving 100% training
accuracy and the absence of negative eigenvalues in the loss Hessian. Different criteria might yield
slightly different EMC values, and this should be taken into account when applying our findings to
other contexts.
Despite these limitations, we believe our study provides valuable insights into the factors influencing
neural network flexibility and highlights areas for further research.
22