0% found this document useful (0 votes)

13 views22 pages

Just How Flexible Are Neural Networks in Practice

Uploaded by

Anto Diaz-Cano

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views22 pages

Just How Flexible Are Neural Networks in Practice

Uploaded by

Anto Diaz-Cano

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Just How Flexible are Neural Networks in Practice?

Ravid Shwartz-Ziv∗ Micah Goldblum∗ Arpit Bansal

New York University New York University University of Maryland
[email protected]

C. Bayan Bruss Yann LeCun Andrew Gordon Wilson

arXiv:2406.11463v1 [cs.LG] 17 Jun 2024

Capital One New York University New York University

Meta AI, FAIR

Abstract

It is widely believed that a neural network can fit a training set containing at least
as many samples as it has parameters, underpinning notions of overparameterized
and underparameterized models. In practice, however, we only find solutions
accessible via our training procedure, including the optimizer and regularizers,
limiting flexibility. Moreover, the exact parameterization of the function class,
built into an architecture, shapes its loss surface and impacts the minima we find.
In this work, we examine the ability of neural networks to fit data in practice.
Our findings indicate that: (1) standard optimizers find minima where the model
can only fit training sets with significantly fewer samples than it has parameters;
(2) convolutional networks are more parameter-efficient than MLPs and ViTs,
even on randomly labeled data; (3) while stochastic training is thought to have
a regularizing effect, SGD actually finds minima that fit more training data than
full-batch gradient descent; (4) the difference in capacity to fit correctly labeled and
incorrectly labeled samples can be predictive of generalization; (5) ReLU activation
functions result in finding minima that fit more data despite being designed to avoid
vanishing and exploding gradients in deep architectures.

1 Introduction
Neural networks are often assumed to be capable of fitting about as many samples as they have
parameters [1, 2, 3]. This intuition can be most easily understood through linear regression, where a
regressor with more coefficients than training samples forms an underdetermined linear system of
equations and can therefore precisely fit any function of the training points. For example, consider
that for any training set {(xi , yi )}ni=0 with n ≤ d, there exist parameters {ai } such that f (x) =
Pd j
j=0 aj x has f (xi ) = yi for all i as long as no two training points contain the same input but
assigned different labels.
The theory underlying neural networks is significantly more complicated. A variety of approximation
theories bound the number of parameters or hidden units required by a neural network architecture
to approximate a certain function class on its domain, which is typically infinite [4, 5, 6]. On finite
domains, namely a training set, overparameterized neural networks with many more parameters than
training samples can easily fit randomly labeled data, raising questions regarding how such flexible
models can still generalize to new unseen test data [1].
In this work, we step back and ask just how flexible neural networks really are in practice. Although
neural networks are theoretically capable of universal function approximation [4], in practice we train
models with limited capacity and only find optima during training that are accessible via our training
∗
Authors contributed equally.
procedure, often leading to significantly reduced flexibility as suboptimal local minima exist [7]. How
much data we can fit depends on factors like the nature of the data itself, model architecture, size,
optimizer, and regularizers. In this work, we measure the capacity of models to fit data under realistic
training loops, and we examine the effects of various features like architectures and optimizers on the
number of training samples a model can fit in practice. Our findings are summarized as follows:
• The optimizers typically used for training neural networks often find minima where the
model can only perfectly fit training sets with far fewer samples than model parameters.
This observation calls into question whether we actually find overfitting local minima in
practice, contrary to conventional wisdom.
• Convolutional architectures (CNNs) are known to generalize better than multi-layer percep-
trons on computer vision problems due to their strong inductive bias for spatial relationships
and locality. However, we find that CNNs are actually more parameter efficient on randomly
labeled data as well, indicating that their superior capacity to fit data does not result from
superior generalization alone.
• The ability of a neural network to fit many more correctly labeled samples than incorrectly
labeled samples is predictive of generalization.
• ReLU activation functions enable fitting more training samples than sigmoidal activations
after successfully finding minima using models with each activation function, even though
ReLU nonlinearities were introduced to neural networks to prevent vanishing and exploding
gradients in deep neural networks with many layers.
• SGD is thought to have a regularizing effect that improves generalization, yet we find that
SGD actually enables fitting more training samples than full-batch gradient descent.

2 Related Work
Approximation theory. A primary area of early deep learning theory focused on upper bounding the
number of parameters or neurons required to well-approximate functions in a particular class, for
example uniform approximation of continuous functions on a compact domain [4]. Such approxima-
tion theories typically focus on arbitrary compact sets or data on a well-behaved manifold [4, 8]. The
resulting upper bounds are often proved constructively, and the constructions may be specific to a
particular neural network architecture, often very shallow networks with only a few layers, limiting
their generality. We focus on neural network flexibility on the training set, empirically measuring
the parameters needed to fit real data in practice, rather than theoretical bounds. This methodology
allows us to try any architecture or to inspect the influence of optimizers, and it measures quantities
that actually impact neural networks.
Overparameterized neural networks and generalization. Early generalization theories predicted
that highly constrained models which fit their training data yet fail to fit randomly labeled data (i.e.
have low Rademacher complexity or VC-dimension) can generalize to new unseen test data [9, 10].
However, these theories fail to account for the exceptional generalization behavior of neural networks
since they are often highly flexible and overparameterized, leading to vacuous error bounds [1].
Recent work on PAC-Bayes generalization theory explains that highly flexible and overparameterized
models can generalize well as long as they assign disproportionate prior mass to parameter vectors
which fit the training data [3, 11, 12]. Related empirical works explain why neural network inductive
bias and consequently generalization can actually benefit from overparametrization [13, 14, 15]. We
will see in our own experiments that whereas the Rademacher complexity of neural networks is
extremely high, they can fit many more correctly labeled samples than randomly labeled ones in
practice, and this gap predicts generalization. Nakkiran et. al [16] use the data-fitting capacity of
neural networks to understand the double-descent phenomenon. They train networks with many
fewer or many more parameters than the number of samples they can fit, and they study the impact
of such over- and underparameterization on generalization. In contrast, we are interested in what
influences that capacity to fit data itself.

3 Preliminaries
Quantifying capacity. While it is straightforward to determine the number of samples a linear
regression model can fit by counting its parameters, neural networks present a more complicated story.

2
Our goal is to measure the neural network’s capacity to fit real data using realistic training routines.
This metric should satisfy three essential criteria: (1) it must measure the real-world capacity to fit
data, enabling us to evaluate the effects of optimizers and regularizers; (2) it should be sensitive to
the training dataset, meaning it should reflect the capacity to fit different types of data or data with
specific labeling characteristics; and (3) it must be feasible to compute.
To that end, we adopt the Effective Model Complexity (EMC) metric [16], which estimates the largest
sample size that a model can perfectly fit. We apply this metric across various data types, including
those with random or semantic labels or even random inputs.
Calculating EMC involves an iterative approach for each network size. Initially, we train the model on
a small number of samples. If it achieves 100% training accuracy after training, we re-initialize and
train on a larger set of randomly chosen samples. We iteratively perform this process, incrementally
increasing the sample size each time until the model no longer fits all training samples perfectly. The
largest sample size where the model still achieves perfect fitting is taken as the network’s EMC. It is
important to note that the initialization and data subsets we sample on each iteration are independent
of those from previous iterations, ensuring that our capacity evaluation remains unbiased. Where the
network did not successfully reach 100% training accuracy, we re-run training three more times with
different random seeds to ensure that inability to fit all samples was not a fluke. Furthermore, we
also tried performing all analysis instead with a relaxed requirement that the network fit 98% of its
training data, which did not significantly affect results.
While it is possible to artificially prevent models from fitting their training set by under-training,
confounding any study of capacity to fit data, we ensure that all training runs reach a minimum of the
loss function by imposing three conditions: first, the norm of the gradients across all samples must
fall below a pre-defined threshold; second, the training loss should stabilize; third, we check for the
absence of negative eigenvalues in the loss Hessian to confirm that the model has indeed reached a
minimum rather than a saddle point. In Appendix A.3, we detail our method for computing the EMC
as well as how we enforce the above three conditions.
In contrast to Nakkiran et al. [16], we validate that each model reaches a minimum during optimization
by ensuring the absence of negative singular values of the loss Hessian. This step is important given
that we train models of various architectures and sizes, so we want to prevent under-training from
being a confounding variable.
Underparameterization and overparameterization. Linear models are described as underparame-
terized when they have fewer parameters than training samples and overparameterized when they
have more parameters than training samples. This threshold determines when a linear regression
model can fit any labeling of its data, and it often coincides with the transition to strict convexity
when a linear model has a unique optimal parameter vector. Neural networks behave differently than
linear regression models; their loss function is non-convex and can have multiple minima even when
training sets are large. Moreover, it is unclear exactly how many parameters a neural network needs
to fit its training set in practice. We will use EMC to investigate the latter quantity.
The differences between capacity, flexibility, expressiveness, and complexity. These terms are used
in numerous ways, sometimes interchangeably and sometimes distinctly. For example, Rademacher
complexity and VC-dimension are notions of complexity typically associated with flexibility, whereas
the PAC-Bayes notion of complexity is information-theoretic and instead measures compression.
Expressiveness can be used to described the breadth of an entire hypothesis class, that is all the
functions that a model can express across all possible parameter settings. Approximation theories
measure the expressiveness of a hypothesis class by the existence of elements of this class which well-
approximate functions of a specified type. We will abstain from using the terms “expressiveness” and
“complexity” when describing EMC to avoid confusion, and we will use “capacity” and “flexibility”
when referring to a model’s ability to fit data in practice.
Factors influencing the EMC. Unlike VC-dimension or expressiveness concepts in approximation
theories, EMC depends not only on the hypothesis class but on every aspect of neural network
training, from optimizers and regularizers to the specific parameterization induced by the model’s
architecture. Choices in architectural design and training algorithms influence the loss of surface
geometry, thereby affecting the accessibility of certain solutions.

3
4 Experimental Setup

We conduct a comprehensive dissection of the factors influencing neural network flexibility. To this
end, we consider a variety of datasets, architectures, and optimizers.

4.1 Datasets

We conduct experiments on a variety of datasets, including vision datasets like MNIST [17], CIFAR-
10, CIFAR-100 [18], and ImageNet [19], as well as tabular datasets like Forest Cover Type [20],
Adult Income [21], and the Credit dataset [22]. Due to the small size of these datasets, we also use
larger synthetic datasets. These are generated using the Efficient Diffusion Training via Min-SNR
Weighting Strategy [23], yielding diverse ImageNet-quality samples at a resolution of 128 × 128.
Specifically, we create ImageNet-20MS, containing 20 million samples across ten classes. Unless
otherwise specified, the main text describes results on ImageNet-20MS, while the appendix contains
results on additional datasets. We omit data augmentations to avoid confounding effects.

4.2 Models

We evaluate the flexibility of diverse architectures, including Multi-Layer Perceptrons (MLPs), CNNs
like ResNet [24] and EfficientNet [25], and Vision Transformers (ViTs) [26]. We systematically
adjust the width and depth of these architectures. For MLPs, we either increase the width by adding
neurons per layer while keeping the number of layers constant or increase the depth by adding
more layers while keeping the number of neurons per layer constant. For naive CNNs, we employ
multiple convolutional layers followed by a constant-sized fully connected layer, varying either the
number of filters per layer or the total number of layers. For ResNets, we scale either the number of
filters or the number of blocks (depth). In ViTs, we scale the number of encoder blocks (depth), the
dimensionality of patch embeddings, and self-attention (width). By default, we scale the width to
control the parameter count unless stated otherwise.

4.3 Optimizers

We employ several optimizers, including Stochastic Gradient Descent (SGD), Adam [27], AdamW
[28], full-batch Gradient Descent (GD), and the second-order Shampoo optimizer [29]. These choices
let us examine how features like stochasticity and preconditioning influence the minima. To ensure
effective optimization across datasets and model sizes, we carefully tune the learning rate and batch
size for each setup, omitting weight decay in all cases. Further details about our hyperparameter
tuning are provided in Appendix A.2. By default, we use SGD.

5 The Effect of the Data on EMC

In this section, we dissect how data properties shape neural networks flexibility and how this behavior
can predict generalization.
Analysis of diverse datasets. We initiate our analysis by measuring the EMC of neural networks
across various datasets and modalities. We scale a 2-layer MLP by modifying the width of the hidden
layers and a CNN by modifying the number of layers and channels, and we train models on a range
of image classification (MNIST, CIFAR-10, CIFAR-100, ImageNet) and tabular (CoverType, Income,
and Credit) datasets. The results reveal significant disparities in the EMC of networks trained on
different data types (see Figure 1 (Left)). For instance, networks trained on tabular datasets exhibit
higher capacity. Among image classification datasets, we observe a strong correlation between test
accuracies and capacity. Notably, MNIST (where models achieve more than 99% test accuracy)
yields the highest EMC, whereas ImageNet shows the lowest, pointing to the relationship between
generalization and the data-fitting capability.
Considering the variety of datasets and network architectures and the myriad differences in their
EMC, the subsequent sections will explore the underlying causes of these variations. Our goal is to
identify the distinct factors in the data and architectures that contribute to these observed differences
in network flexibility.

4
16 CNN
106 MLP

EMC Improvement [%]

12
105

EMC 8
104

103 4

2 3 4 5 6 7
10 10 10 10 10 10
Parameter Count
0.0 0.4 0.8 1.2
Generalization Gap in Loss
MNIST-MLP CIFAR-10-CNN ImageNet-CNN Covertype
MNIST-CNN CIFAR-100-MLP Income Forest
CIFAR-10-MLP CIFAR-100-CNN

Figure 1: Left: easier tasks tend to have higher EMC. EMC across datasets and data modalities.
The tabular data sets (Forest, Income, CoverType), which are easier to learn, have the highest EMC
compared to vision datasets. The dashed black line is the diagonal. ImageNet is the hardest dataset to
learn. Right: the difference in EDC on the original and random labels predicts generalization.
EMC improvement as a function of the parameter count for CIFAR-100.

5.1 The role of inputs and labels

We next analyze the inductive biases of different architectures and how factors like spatial structure
influence the ability of a model to fit its training data. To this end, we altering inputs and labels,
measuring resulting effects. We adjust the width of MLPs and 2-layer CNNs by varying the number of
neurons (MLPs) or filters (CNNs) in each layer, and we train them on ImageNet-20MS. We evaluate
EMC as a function of the model’s parameter count in four scenarios: semantic labels, random labels,
random inputs, and inputs under a fixed random permutation. In the case of random labels, we
maintain the input but sample the class labels randomly. For random inputs, we replace the original
inputs with Gaussian noise, while for the permuted input, we use the same fixed permutation for all
the images, breaking the spatial structure in the data.

7
10

6
10
EMC

EMC

5
10

4
10

3
10
4 5 6 7 4 5 6 7
10 10 10 10 10 10 10 10
Parameter Count Parameter Count

Dataset Size Random Input Random Label Semantic Label

(a) CNN (b) MLP

Figure 2: CNNs fit more semantically labeled samples than they have parameters due to their
superior image classification inductive bias, whereas MLPs cannot. EMC as a function of the
number of parameters for semantic labels vs. random input and labels for MLPs (a) and CNNs (b).
Experiments performed on ImageNet-20MS. Error bars represent one standard error over 5 trials.

The boundary between overparameterization and underparameterization. Linear regression

models can fit at least as many samples as they have parameters, regardless of whether the labels are
naturally occurring or random. The boundary between where a model has too few parameters to fit
its data and where it has extra degrees of freedom is clear for linear regression. Naturally occurring
labels present a more complicated scenario; for instance, if the data’s labels are a linear function

5
Original Labels Original Labels
Random Labels Random Labels
4.6 4.6

Log (EMC)

Log (EDC)
4.4

4.2
4.5
4.0

20 40 60 80 100
Number of Classes SGD Shampoo AdamW Adam Full Batch
Optimizers

(a) More classes makes fitting data harder with se- (b) SGD and Shampoo are better for fitting with the
mantic labels but easier with random ones. original labels but not with random ones
Figure 3: The effect of the number of labels and optimizers on capacity. Average logarithm of
EMC across different model sizes of CNNs on CIFAR-100 for original and random labels varying
numbers of classes (a) and for different optimizers (b). Error bars are standard error over 5 trials.

of the inputs, then the model can fit infinitely many samples. In Figure 2, assigning random labels
instead of real ones allows us to explore an analogous notion of the boundary between over- and
under-parameterization, but in the context of neural networks. We see here that the networks fit
significantly fewer samples when assigned random labels compare to the original labels, indicating
that neural networks are less parameter efficient than linear models in this setting. Like linear models,
the amount of data they can fit appears to scale linearly in their parameter count.
The effect of high-dimensional data. Linear models exhibit increased capacity when adding more
features, primarily because their parameter count directly scales with the feature count. However,
the dynamics shift when examining CNNs. In our setup, we avoid adding parameters as the data
dimensionality increases by employing average pooling prior to the classification head, a standard
technique for CNNs. We investigate the EMC using the ImageNet-20MS, systematically resizing
input images to vary their spatial dimensions from (16 × 16) to (256 × 256).
In contrast to linear models, we find in Appendix Figure 17 that CNNs, which do not benefit from
additional parameters as the input dimensionality increases, can actually fit more semantically labeled
data in lower spatial dimensions. This trend underscores a broader narrative in neural networks:
CNNs, despite their intricate architectures and capacity for complex pattern recognition, tend to align
better with data of lower intrinsic dimension. This observation resonates with the findings of Pope et
al. [30], who find that CNNs generally showcase enhanced generalization capabilities with data of
lower intrinsic dimensionality.
The effect of the number of classes. In order to probe the influence of the number of classes on the
EMC, we randomly merge CIFAR-100 classes to artificially decrease the number of classes while still
preserving the size of the original dataset. We again consider a 2-layer CNN with various numbers of
filters, and consequently, parameters. In Figure 3a, we plot the average of the logarithm of the EMC
across different model sizes for various numbers of classes. We see that data with semantic labels
becomes harder and harder to fit as the number of classes increases, and generalization becomes
more challenging as the model has to encode more information about each sample in its weights. In
contrast, randomly labeled data is easier to fit as the number of classes increases because the model is
no longer forced to assign as many semantically different samples the same class label, which would
be at odds with the model’s inductive bias that prefers correct labels over random ones.
To compare different datasets while controlling for properties like number of classes, we convert
several datasets into binary classification problems. This modification enables us to assess the impact
of the number of classes on EMC and isolate the effects of input distribution. Our results (Appendix
Figure 21) show that even though the EMC among image datasets increases in the binary classification
setting over the original classification labels, tabular datasets consistently demonstrated higher EMC.
Furthermore, significant differences persist among the different tabular datasets. These outcomes
suggest that additional factors, perhaps intrinsic to the datasets themselves, contribute to EMC beyond
the number of classes.

6
5.2 Predicting generalization

Neural networks exhibit a marked preference for fitting semantically coherent labels over random
ones, a tendency reflecting their inductive biases. This propensity, as depicted in Figure 1 (right),
underscores a broader principle: a network’s adeptness at fitting semantic labels compared to
random ones often correlates with its generalization. Interestingly, this generalization enables certain
architectures, like CNNs, to fit more samples than their parameter count might suggest, blurring the
boundaries of over- and under-parameterization.
This observation bridges two seminal perspectives on model generalization. Traditional machine
learning wisdom posits that high-capacity models tend to overfit, compromising their generalization
on new data—a notion reflected in early generalization bounds, which are vacuous for neural networks
[9, 10]. In contrast, PAC-Bayes theory proposes that a model’s flexibility doesn’t inherently impede
generalization, provided its prior assigns disproportionate mass to the true labels compared to random
ones, or in other words the model prefers correct labelings of the data to incorrect labelings [3]. Our
empirical findings relate these two theories, revealing an empirical relationship between a model’s
increased ability to fit correct labels over random ones and its generalization.
Specifically, we compute the EMC for various CNN and MLP configurations on both correctly and
randomly labeled data. We measure the percent increase in EMC when models encounter semantic
labels versus random ones, effectively gauging their practical capacity to fit data that aligns with
natural label distributions.
The notable inverse correlation between this metric and the generalization gap (Pearson correlation
coefficient of −0.9281 for CNNs and −0.869 for MLPs), as illustrated in Figure 1 (Right), not only
confirms the theoretical underpinnings of generalization but also illuminates the practical implications
of these theories.

6 The Effect of Model Architecture on EMC

After dissecting the influence of the data on flexibility, we shift focus to the impact of architecture.
In this section, we examine how various architectural properties, including MLPs vs. CNNs vs.
transformers, activation functions, and scaling strategies, contribute to flexibility.
Architectural style and parameter efficiency. There is an ongoing debate regarding the efficiency
and generalization of CNNs and Vision Transformers (ViTs) [31, 32, 33, 34]. In light of this debate,
we put three neural network architecture paradigms to the test: (1) MLPs, (2) CNNs, and (3) ViTs.
Our findings, summarized in Appendix Figure 4b, reveal a consistent pattern: CNNs, characterized
by hard-coded inductive biases like locality and translation equivariance, outperform both ViTs and
MLPs in EMC (see Figure 18 for detailed EMC scaling laws across various parameter counts). This
superiority persists across all model sizes when evaluated on semantically labeled data. As analyzed
in the previous section, this trend could be misconstrued as purely a result of better generalization
capabilities of CNNs over ViTs and, subsequently, MLPs.
To test this hypothesis, we examine the networks’ flexibility on randomized data (Figure 18). CNNs,
which benefit strongly from data with a spatial structure, fit fewer samples when the spatial structure
is broken via permutation. On the other hand, MLPs lack this preference for spatial structure, so their
ability to fit data is unchanged. Replacing inputs with Gaussian noise increases both architectures’
capacity. This trend might be explained by the fact that in high dimensions, noisy data lies far apart
and is, therefore, easier to separate. Notably, CNNs can fit far more samples with semantic labels
than with random inputs. In contrast, this trend is reversed for MLPs, again highlighting the superior
generalization of CNNs on image classification.
However, even though random data impacts architectures differently, their hierarchy in terms of
parameter efficiency remains unchanged. This intriguing consistency underscores that the enhanced
parameter efficiency of CNNs is not just a byproduct of their generalization ability but is inher-
ently rooted in their architectural design. This observation aligns with theoretical perspectives in
approximation theory, which posit CNNs as more parameter-efficient compared to MLPs [35].
Strategies for scaling network size. The debate on how to best scale width and depth in neural
networks has been ongoing. We now focus on how different scaling strategies affect a network’s
ability to fit data. Our examination, shown in Figure 4a and Appendix Figure 5, explores EMC

7
7
10
5 Original Labels
Random Labels
6

Log (EMC)
10
4.6
EMC

5
10 Dataset Size 4.2
Depth
Width
ResNet-RS
4 EfficientNet
10
4 5 6 7 3.8
10 10 10 10 CNN VIT MLP
Parameter Count
Architecture
(a) ResNet-RS is the most efficient among scaling (b) CNNs are far more parameter-efficient, even on ran-
strategies we test. domly labeled data.
Figure 4: The effect of the scaling strategy and the architecture on the EMC . (a) Scaling laws
for the EMC as a function of parameters counts for CNN. (b) Average logarithm of EMC across
parameter counts for different architectures using original and random labels. On ImageNet-20MS.
Error bars represent one standard error over 5 trials.

under various scaling configurations. For ResNets, these include increasing width (number of filters),
increasing depth, or increasing both width and depth according to two scaling laws: EfficientNet [25]
and ResNet-RS [36]. EfficientNet uses a balanced approach, scaling depth, width, and resolution
simultaneously with fixed coefficients. ResNet-RS adapts scaling based on model size, training
duration, and dataset size. For scaling ViTs, we use the SViT approach [37], SoViT [38], and also
try scaling the number of encoder blocks (depth) and the dimensionality of patch embeddings and
self-attention (width) separately.
Our analysis reveals that, although not initially crafted for optimizing capacity, specially designed
scaling laws perform well in this respect. Furthermore, consistent with earlier theoretical analyses
[39], our findings affirm that scaling depth is more parameter-efficient than scaling width. These
parameter-efficiency comparisons also hold on randomly labeled data, indicating that they are not an
artifact of generalization.
Activation functions. Nonlinear activation functions are crucial for neural network capacity because
without them, neural networks are just large factorized linear models. In this subsection, we examine
the effect of the activation functions on capacity, contrasting them with linear models.
Detailed in Appendix Figure 16, our findings show that ReLU functions significantly enhance capacity.
Though initially integrated to mitigate vanishing and exploding gradients, ReLU also boosts the
network’s data-fitting ability, likely by improving generalization. In contrast, tanh and identity
functions, while nonlinear, do not achieve similar effects, even though we are able to find minima
using these activations functions too. We note the latter fact to ensure that ReLUs are not only
boosting network capacity by making it easier to find minima.

7 The Role of Optimization in Fitting Data

The choice of optimization technique and regularization strategy is crucial in neural network training.
This choice affects not only training convergence but also the nature of the solutions found. This
section explores the role different optimization and regularization techniques play in a network’s
flexibility.
Comparing optimizers. We explore the influence of various optimizers, including SGD, full-batch
Gradient Descent, Adam [27], AdamW [28], and Shampoo [40].
Whereas previous works suggest that SGD has a strong flatness-seeking regularization effect [41], we
find in Figure 3b that SGD also enables fitting more data than full-batch (non-stochastic) training,
fitting a comparable volume of data as the high-powered Shampoo. This experiment, namely the
variety of EMC measurements across optimizers, demonstrates that optimizers differ not only in the
rate at which they converge but also in the types of minima they find. Repeating this experiment with

8
random labels shows that the higher EMC of SGD and Shampoo evaporates, indicating that their
greater ability to fit data may be related to their superior generalization.
Regularizers. Classical machine learning systems employed regularizers designed to reduce capacity.
For example, ridge regression applies a penalty on the parameter norm, improving performance of
overparameterized linear models [42]. Similarly XGBoost penalizes the sum of squared leaf weights
to prevent overfitting [43]. Modern deep learning pipelines use various regularization techniques to
improve generalization. We now examine if these regularizers also reduce the model’s capacity to
fit data. We previously found that stochastic training, which enhances generalization and provides
implicit regularization, actually increases EMC.
In Appendix Figure 15, we compute the EMC of a CNN trained on ImageNet-20MS using Sharpness-
Aware Minimization (SAM) [44], weight decay, and label smoothing [45]. Weight decay and label
smoothing limit capacity, but SAM improves generalization without reducing capacity, even on
randomly labeled data. Label smoothing modifies the loss function, so a model trained with the
smoothed objective may not find minima of the original non-smoothed loss. In contrast, SAM does
not change the loss function itself but finds different types of minima than SGD, which generalize
better at no capacity cost.

8 Reparameterization for Increased Parameter Efficiency

In previous sections, we examined how different pipeline components affect parameter efficiency.
We found that neural networks often fit fewer samples than they have parameters. To close this
gap, we adopt two reparameterization methods - subspace training [11], which was designed for
compression, whereby we take a CNN’s parameter vector and randomly project it into a lower-
dimensional subspace, training in the lower dimensional space. We also try a quantization experiment
where we train the CNN in 8-bit precision instead of the standard 32-bit precision but with four times
as many parameters. This quantization deviates from our other parameter-count studies. Here, we
count a model with 4 × n 8-bit parameters with a model containing n 32-bit parameters as they are
specified by the same number of bits.
Our empirical observations, detailed in Appendix Figure 19, underscore the effectiveness of these
strategies. Subspace training, in particular, significantly improves parameter efficiency for both
semantic and random labels. This highlights that neural networks, in their standard form, are wasteful
of parameters. The quantization experiment further supports this finding, revealing that even with
reduced precision, networks can achieve comparable levels of flexibility. Interestingly, we also see
here that an 8-bit quantized model can fit a quarter times as many randomly labeled samples as it has
parameters, closing the same gap on a per-bit basis.

9 Discussion
Our findings show that parameter counting alone is not a useful tool for determining the number of
samples a neural network can fit, or the boundary between underparameterization and overparam-
eterization. Instead, many factors contribute to the effective model complexity, including virtually
all components of a training routine as well as the data itself. Moreover, we must re-evaluate our
understanding of why these components work. We saw that architectural components like ReLU
activation functions may solve additional problems that they weren’t designed for, and stochastic
optimization, for example, actually finds minima where we fit more training samples, contrasting
with conventional views of implicit regularization. Finally, our results suggest neural networks are
often parameter-wasteful, and new parameterizations might improve efficiency.

Acknowledgements

This work is supported by NSF CAREER IIS-2145492, NSF CDS&E-MSS 2134216, NSF HDR-
2118310, BigHat Biosciences, Capital One, and an Amazon Research Award.

9
References
[1] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding
deep learning requires rethinking generalization. In International Conference on Learning
Representations, 2016.
[2] Devansh Arpit, Stanisław Jastrz˛ebski, Nicolas Ballas, David Krueger, Emmanuel Bengio,
Maxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, et al. A
closer look at memorization in deep networks. In International conference on machine learning,
pages 233–242. PMLR, 2017.
[3] Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds
for deep (stochastic) neural networks with many more parameters than training data. arXiv
preprint arXiv:1703.11008, 2017.
[4] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are
universal approximators. Neural networks, 2(5):359–366, 1989.
[5] Andrew R Barron. Neural net approximation. In Proc. 7th Yale workshop on adaptive and
learning systems, volume 1, pages 69–72, 1992.
[6] Hrushikesh N Mhaskar and Tomaso Poggio. Deep vs. shallow networks: An approximation
theory perspective. Analysis and Applications, 14(06):829–848, 2016.
[7] Micah Goldblum, Jonas Geiping, Avi Schwarzschild, Michael Moeller, and Tom Goldstein.
Truth or backpropaganda? an empirical investigation of deep learning theory. In International
Conference on Learning Representations, 2020.
[8] Uri Shaham, Alexander Cloninger, and Ronald R Coifman. Provable approximation properties
for deep neural networks. Applied and Computational Harmonic Analysis, 44(3):537–557,
2018.
[9] Vladimir Vapnik. Principles of risk minimization for learning theory. Advances in neural
information processing systems, 4, 1991.
[10] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds
and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.
[11] Sanae Lotfi, Marc Finzi, Sanyam Kapoor, Andres Potapczynski, Micah Goldblum, and An-
drew G Wilson. Pac-bayes compression bounds so tight that they can explain generalization.
Advances in Neural Information Processing Systems, 35:31459–31473, 2022.
[12] Sanae Lotfi, Marc Finzi, Yilun Kuang, Tim GJ Rudner, Micah Goldblum, and Andrew Gor-
don Wilson. Non-vacuous generalization bounds for large language models. arXiv preprint
arXiv:2312.17173, 2023.
[13] W Ronny Huang, Zeyad Emam, Micah Goldblum, Liam Fowl, JK Terry, Furong Huang,
and Tom Goldstein. Understanding generalization through visualizations. arXiv preprint
arXiv:1906.03291, 2019.
[14] Ping-yeh Chiang, Renkun Ni, David Yu Miller, Arpit Bansal, Jonas Geiping, Micah Goldblum,
and Tom Goldstein. Loss landscapes are all you need: Neural network generalization can
be explained without the implicit bias of gradient descent. In The Eleventh International
Conference on Learning Representations, 2022.
[15] Wesley J Maddox, Gregory Benton, and Andrew Gordon Wilson. Rethinking parameter counting
in deep models: Effective dimensionality revisited. arXiv preprint arXiv:2003.02139, 2020.
[16] Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever.
Deep double descent: Where bigger models and more data hurt. Journal of Statistical Mechanics:
Theory and Experiment, 2021(12):124003, 2021.
[17] Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE
Signal Processing Magazine, 29(6):141–142, 2012.

10
[18] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images.
2009.
[19] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-
scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern
recognition, pages 248–255. Ieee, 2009.
[20] Jock A. Blackard and Denis J. Dean. Comparative accuracies of artificial neural networks and
discriminant analysis in predicting forest cover types from cartographic variables. Computers
and Electronics in Agriculture, 24(3):131–151, 1999.
[21] Barry Becker and Ronny Kohavi. Adult. UCI Machine Learning Repository, 1996. DOI:
https://doi.org/10.24432/C5XW20.
[22] Kaggle. Credit card dataset. 2021. Kaggle dataset.
[23] Tiankai Hang, Shuyang Gu, Chen Li, Jianmin Bao, Dong Chen, Han Hu, Xin Geng, and Baining
Guo. Efficient diffusion training via min-snr weighting strategy. 2023.
[24] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 770–778, 2016.
[25] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural
networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019.
[26] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,
Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al.
An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint
arXiv:2010.11929, 2020.
[27] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International
Conference on Learning Representations, 2015.
[28] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International
Conference on Learning Representations, 2018.
[29] Rohan Anil, Vineet Gupta, Tomer Koren, Kevin Regan, and Yoram Singer. Towards practical
second order optimization for deep learning, 2021.
[30] Phil Pope, Chen Zhu, Ahmed Abdelkader, Micah Goldblum, and Tom Goldstein. The intrinsic
dimension of images and its impact on learning. In International Conference on Learning
Representations, 2020.
[31] Stéphane d’Ascoli, Hugo Touvron, Matthew L Leavitt, Ari S Morcos, Giulio Biroli, and Levent
Sagun. Convit: Improving vision transformers with soft convolutional inductive biases. In
International Conference on Machine Learning, pages 2286–2296. PMLR, 2021.
[32] Badri N Patro and Vijay Agneeswaran. Efficiency 360: Efficient vision transformers. arXiv
preprint arXiv:2302.08374, 2023.
[33] José Maurício, Inês Domingues, and Jorge Bernardino. Comparing vision transformers and
convolutional neural networks for image classification: A literature review. Applied Sciences,
13(9), 2023.
[34] Micah Goldblum, Hossein Souri, Renkun Ni, Manli Shu, Viraj Uday Prabhu, Gowthami
Somepalli, Prithvijit Chattopadhyay, Mark Ibrahim, Adrien Bardes, Judy Hoffman, et al. Battle
of the backbones: A large-scale comparison of pretrained models across computer vision
tasks. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and
Benchmarks Track, 2023.
[35] Chenglong Bao, Qianxiao Li, Zuowei Shen, Cheng Tai, Lei Wu, and Xueshuang Xiang. Ap-
proximation analysis of convolutional neural networks. work, 65, 2014.

11
[36] Irwan Bello, William Fedus, Xianzhi Du, Ekin Dogus Cubuk, Aravind Srinivas, Tsung-Yi Lin,
Jonathon Shlens, and Barret Zoph. Revisiting resnets: Improved training and scaling strategies.
In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural
Information Processing Systems, 2021.
[37] Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transform-
ers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pages 12104–12113, 2022.
[38] Ibrahim Alabdulmohsin, Xiaohua Zhai, Alexander Kolesnikov, and Lucas Beyer. Getting vit
in shape: Scaling laws for compute-optimal model design. arXiv preprint arXiv:2305.13035,
2023.
[39] Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. In
Conference on learning theory, pages 907–940. PMLR, 2016.
[40] Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor
optimization. In International Conference on Machine Learning, pages 1842–1850. PMLR,
2018.
[41] Jonas Geiping, Micah Goldblum, Phil Pope, Michael Moeller, and Tom Goldstein. Stochastic
training is not necessary for generalization. In International Conference on Learning Represen-
tations, 2021.
[42] Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for nonorthogonal
problems. Technometrics, 12(1):55–67, 1970.
[43] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of
the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages
785–794, 2016.
[44] Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware min-
imization for efficiently improving generalization. In International Conference on Learning
Representations, 2020.
[45] Rafael Müller, Simon Kornblith, and Geoffrey E Hinton. When does label smoothing help?
Advances in neural information processing systems, 32, 2019.
[46] Zhewei Yao, Amir Gholami, Kurt Keutzer, and Michael W Mahoney. Pyhessian: Neural
networks through the lens of the hessian. In 2020 IEEE international conference on big data
(Big data), pages 581–590. IEEE, 2020.

12
A Appendix

A.1 Additional Results

Here, we present figures that include additional datasets and labelings, as well as detailed results
across all parameter counts, rather than just the aggregated averages shown in the main body. In the
main paper, for the ViT scaling laws, we followed the scaling approach proposed by [37] (SVIT),
which advocates for simultaneously and uniformly scaling all aspects—depth, width, MLP width, and
patch size. Additionally, we employed both SoViT, as per [38], and approaches where the number
of encoder blocks (depth) and the dimensionality of patch embeddings and self-attention (width)
in the ViT are scaled separately. fig. 5 in the Appendix demonstrates that scaling each dimension
independently can lead to suboptimal results, aligning with our observations from the EfficientNet
experiments. Furthermore, it shows that SoViT yields results that are slightly different from those
obtained using the laws from [37].

EfficientNet - CNN
7
10 SVIT
SoViT
VIT - Width
6 VIT - Depth
10
MLP
EMC

5
10

4
10

3
10

4 5 6 7
10 10 10 10
Parameter Count

Figure 5: Scaling laws - EMC as a function of the number of parameters for randomly labeled
ImageNet-20MS for VIT

13
7
10

6
10
EDC

5
10
Depth
Width
EfficientNet
4
ResNet-RS
10 dataset size
4 5 6 7
10 10 10 10
Parameter Count

Figure 6: Scaling laws - EMC as a function of the number of parameters for randomly labeled
ImageNet-20MS.

7
10

6
10
EMC

5
10 Dataset Size
Depth
Width
ResNet-RS
4 EfficientNet
10
4 5 6 7
10 10 10 10
Parameter Count

Figure 7: Scaling laws - EMC as a function of the number of parameters for a CNN on ImageNet-
20MS with original labels.

14
4
4 × 10

4
3 × 10
EMC

4
2 × 10

Width
Depth
4
10 EfficientNet
ResNet
dataset size
4
10
Parameter Count

Figure 8: Scaling laws - EMC as a function of the number of parameters for a CNN on CIFAR-10
with original labels.

7
10

6
10
EMC

5
10

Linear
ReLU
4 Tanh
10
4 5 6 7
10 10 10 10
Parameter Count

Figure 9: EMC as a function of the number of parameters across different activation functions
using CNNs on ImageNet-20MS with original labels.

15
7
10

6
10
EMC

5
10

4
Linear
10 ReLU
Tanh
4 5 6 7
10 10 10 10
Parameter Count

Figure 10: EMC as a function of the number of parameters across different activation functions
using CNNs and ImageNet-20MS with random labels.

Depth
7
10 Width
ResNet-RS
EfficientNet
dataset size

6
Flexability

5
10

4
10
4 5 6 7
10 10 10 10
Parameter Count

Figure 11: SGD and Shampoo fit more training data - EMC across different optimizers using
CNNs on CIFAR-10.

16
7
10

6
10
EMC

5
10
SGD
AdamW
Adam
4 Shampoo
10 dataset size
4 5 6 7
10 10 10 10
Parameter Count

Figure 12: EMC as a function of the number of parameters across different optimizers with
CNNs on ImageNet-20MS with original labels.

7
10

6
10
EMC

5
10

SGD
4
AdamW
10 Adam
Shampoo
dataset size
4 5 6 7
10 10 10 10
Parameter Count

Figure 13: EMC as a function of the number of parameters across different optimizers with
CNNs on ImageNet-20MS with random labels.

17
SAM
7
10 Weight Decay
Label Smothing
Original

6
10
EMC

5
10

4
10

4 5 6 7
10 10 10 10
Parameter Count

Figure 14: EMC as a function of the number of parameters across different regularizers on
ImageNet-20MS with random labels.

Original Labels
5.1 Random Labels
Log (EMC)

4.7

4.3

3.9
M inal ing ca
y
SA rig th e
O Smo htD
l ig
La
b e We
Regularizers
Figure 15: SAM has better generalization at no capacity cost - Average logarithm of EMC over
different model sizes for SAM, weight decay, and label smoothing using CNNs on ImageNet-20MS.

18
7
10

6
10
EMC

EMC
5
10

4
10
4 5 6 7 4 5 6 7
10 10 10 10 10 10 10 10
Parameter Count Parameter Count
Linear ReLU Tanh Dataset Size

(a) Original Labels (b) Random Labels

Figure 16: ReLU networks exhibit higher flexibility. EMC as a function of the number of
parameters across different activation functions for original labels (left) and for random ones (right)
on ImageNet-20MS.

Figure 17: High-dimensional data is harder to fit. Average logarithm of EMC across different
model sizes for original and random labels varying input sizes for CNN architectures on CIFAR-100.

19
7
10 7
10

6
10
6
10
EMC

EMC
5
10

5
10

4
Random Input Random Input
10
Permuted Input Permuted Input
Random Label 4 Random Label
10

3
Semantic Label Semantic Label
10
dataset size dataset size
4 5 6 7 4 5 6 7
10 10 10 10 10 10 10 10
Parameter Count Parameter Count

(a) MLP (b) CNN

Figure 18: Generalization boosts EMC - EMC as a function of the number of parameters for
semantic labels vs. random input and labels using MLP and CNN architectures on ImageNet-20MS.

5.7 Original Labels

Random Labels
Log (EMC)

5.2

4.7

4.2

ation ing l
Quantiz c e Train Origina
Subspa
Regularizers
Figure 19: Compression improves network efficiency - Average logarithm of EMC over different
model sizes and compression methods. CNNs on ImageNet-20MS.

Original Original
Subspace Training 7 Subspace Training
7 10
10
Quantization Quantization

6
10
6
10
EMC

EMC

5
10

4
10

4 5 6 7 4 5 6 7
10 10 10 10 10 10 10 10
Parameter Count Parameter Count

(a) Original Labels (b) Random Labels

Figure 20: Compression improves Network efficiency. The EMC across different model sizes for
original and random labels. CNN architectures on ImageNet-20MS.

20
6
10

5
10
EMC

4
10

3
10

2 3 4 5 6
10 10 10 10 10
Parameter Count

CIFAR-100-MLP ImageNet-CNN Credit Covertype

CIFAR-100-CNN Income

Figure 21: EMC as function of the number of parameters for datasets that converted to binary
classification.

A.2 Implementation Details

Unless otherwise mentioned, our hyperparameter tuning was conducted over the following hyperpa-
rameters: batch size - with the values [32, 64, 128, 256]. For the Stochastic Gradient Descent (SGD)
optimizer, we used an initial learning rate selected by grid search between 0.001 and 0.01 with Cosine
annealing. For Adam and AdamW optimizers, the learning rate was chosen by grid search between
1e − 5 and 1e − 2.
For other hyperparameters, we adhere to the standard PyTorch recipes.

A.3 Empirical Model Complexity

To compute the Empirical Model Complexity (EMC), we adopt an iterative approach for each network
size. Initially, we start with a small number of samples and train the model. Post-training, we verify
if the model has perfectly fit all the samples by achieving 100% training accuracy. If this criterion is
met, we re-initialize the model with a random initialization and train it again on a larger number of
samples, randomly drawn from the full dataset. This process is iteratively performed, increasing the
number of samples in each iteration, until the model fails to perfectly fit all the training samples. The
largest sample size where the model achieves a perfect fit is taken as the Empirical Model Complexity
for that particular network size. It is important to note that data is sampled independently on each
iteration.
While it is possible to artificially prevent models from fitting their training set by under-training, thus
confounding any study of capacity to fit data, we ensure that all training runs reach a minimum of the
loss function by imposing three conditions:
First, the norm of the gradients across all samples must fall below a pre-defined threshold. We
observed that there is a high variance in the norms of the gradients between different networks;
therefore, we set this threshold manually after checking the norms for each network type when
training with a small number of samples, where it’s clear that the networks fit perfectly and converge
to a minimum.
Second, the training loss should stabilize. To ensure this, we stipulate that the average loss should not
decrease for 10 consecutive epochs.

21
Third, we check for the absence of negative eigenvalues in the loss Hessian to confirm that the
model has indeed reached a minimum rather than a saddle point. To do this, we calculate the
eigenvalues using the PyHessian Python package [46] and validate that after training converges, there
are no eigenvalues smaller than −1e − 2. This threshold was chosen after examining the eigenvalue
distributions of different networks that fit perfectly.

A.4 Compute Resources

Our experiments were conducted using NVIDIA Tesla V100 GPUs with 32GB memory each for
model training and evaluation. The total compute time for the entire set of experiments was ap-
proximately 3000 GPU hours. All experiments were run on NUY’s cluster managed with SLURM,
ensuring efficient resource allocation and job scheduling. This setup allowed us to handle the exten-
sive computational demands of training large neural network models and conducting comprehensive
evaluations.

B Broader Impacts
Our research on the capacity of neural networks to fit data more efficiently has several important
implications. Positively, our findings could lead to more efficient AI models, which would benefit
various applications by making these technologies more accessible and effective. By understanding
how neural networks can be more efficient, we can also reduce the environmental impact associated
with training large models.
However, there are potential negative impacts as well. Improved neural network capabilities might
be used in ways that invade privacy, such as through enhanced surveillance or unauthorized data
analysis. Additionally, as AI technologies become more powerful, it is essential to consider ethical
implications, fairness, and potential biases in their development and use.
To address these concerns, our paper emphasizes the importance of responsible AI practices. We
encourage transparency, ethical considerations, and ongoing research into the societal impacts of
advanced machine learning technologies to ensure they are used for the greater good.

C Limitations
Our study has several limitations that should be considered when interpreting the results. First, the
datasets used in our experiments, while diverse, may not fully represent the wide variety of data
encountered in practical applications. This could introduce biases and limit the generalizability of our
findings. Second, our experiments are constrained by the available computational resources. While
we used NVIDIA Tesla V100 GPUs with 32GB memory, the total compute time was approximately
3000 GPU hours. This limitation restricted the scale and number of experiments we could perform,
potentially affecting the robustness of our conclusions.
Furthermore, our analysis primarily focuses on certain types of neural network architectures, such as
CNNs, MLPs, and ViTs. While these are common and widely used, there are many other architectures
that we did not explore. The impact of different training procedures, regularization techniques, and
hyperparameter choices on the EMC might vary with other architectures.
Additionally, we decided to test a wide range of factors affecting neural network flexibility but only
explored a limited number of settings for each factor, rather than delving deeply into any single factor.
This broad but shallow exploration might miss deeper insights that a more focused study could reveal.
Lastly, our method of measuring EMC, while rigorous, relies on specific criteria for determining
when a model has perfectly fit its training data. These criteria include achieving 100% training
accuracy and the absence of negative eigenvalues in the loss Hessian. Different criteria might yield
slightly different EMC values, and this should be taken into account when applying our findings to
other contexts.
Despite these limitations, we believe our study provides valuable insights into the factors influencing
neural network flexibility and highlights areas for further research.

Module 1 - S8 CSE NOTES - KTU DEEP LEARNING NOTES - CST414
100% (1)
Module 1 - S8 CSE NOTES - KTU DEEP LEARNING NOTES - CST414
18 pages
Chapter 08
100% (2)
Chapter 08
202 pages
Practical Issues in Neural Network Training
No ratings yet
Practical Issues in Neural Network Training
15 pages
Unit-Ii DLL
No ratings yet
Unit-Ii DLL
19 pages
PLAB 2 Notes Part 1
No ratings yet
PLAB 2 Notes Part 1
483 pages
Tutorial Math Deep Learning 2018 PDF
No ratings yet
Tutorial Math Deep Learning 2018 PDF
103 pages
Towards A Mathematical Understanding of Neural Network-Based Machine Learning: What We Know and What We Don't
No ratings yet
Towards A Mathematical Understanding of Neural Network-Based Machine Learning: What We Know and What We Don't
56 pages
DSA5105 Lecture5
No ratings yet
DSA5105 Lecture5
52 pages
Neural Networks
No ratings yet
Neural Networks
37 pages
On The Power and Limitations of Random Features For Understanding Neural Networks
No ratings yet
On The Power and Limitations of Random Features For Understanding Neural Networks
30 pages
BV Capacity Neural Networks
No ratings yet
BV Capacity Neural Networks
49 pages
The Impact of Neural Network Overparameterization On Gradient Confusion and Stochastic Gradient Descent
No ratings yet
The Impact of Neural Network Overparameterization On Gradient Confusion and Stochastic Gradient Descent
46 pages
Lec 3 Learning The Network I
No ratings yet
Lec 3 Learning The Network I
139 pages
CSD411-Week 5-Generalization 6
No ratings yet
CSD411-Week 5-Generalization 6
43 pages
On The Expressive Power of Deep Neural Networks 1606.05336v4
No ratings yet
On The Expressive Power of Deep Neural Networks 1606.05336v4
20 pages
Lecture 2
No ratings yet
Lecture 2
67 pages
A Mathematical Theory of Generalization: Part I: David H. Wolpert
No ratings yet
A Mathematical Theory of Generalization: Part I: David H. Wolpert
50 pages
Deep Neural Network Approximation Theory
No ratings yet
Deep Neural Network Approximation Theory
80 pages
Reliable Neural Network Activation
No ratings yet
Reliable Neural Network Activation
18 pages
Operator Learning Algorithms and Analysis
No ratings yet
Operator Learning Algorithms and Analysis
36 pages
Overparameterization Affecting Features
No ratings yet
Overparameterization Affecting Features
19 pages
Neural Networks Learning and Memorization With (Almost) No Over-Parameterization
No ratings yet
Neural Networks Learning and Memorization With (Almost) No Over-Parameterization
10 pages
Transformer Turing
No ratings yet
Transformer Turing
13 pages
1611 03530 PDF
No ratings yet
1611 03530 PDF
15 pages
Richi's Neural Nets Summary
No ratings yet
Richi's Neural Nets Summary
114 pages
Dynamic Neural Diversification Path To Computation
No ratings yet
Dynamic Neural Diversification Path To Computation
9 pages
Lab Lec 1a - Laboratory Rules and Safety Precautions
No ratings yet
Lab Lec 1a - Laboratory Rules and Safety Precautions
52 pages
Deep Learning As A Building Block in Probabilistic Models: Pierre-Alexandre Mattei
No ratings yet
Deep Learning As A Building Block in Probabilistic Models: Pierre-Alexandre Mattei
57 pages
A Practical Approach To Sizing Neural Networks
No ratings yet
A Practical Approach To Sizing Neural Networks
13 pages
Survey of Neural Transfer Functions
No ratings yet
Survey of Neural Transfer Functions
50 pages
Transfer Functions: Hidden Possibilities For Better Neural Networks
No ratings yet
Transfer Functions: Hidden Possibilities For Better Neural Networks
14 pages
Deep Neural Networks Have An Inbuilt Occam S Razor: Article
No ratings yet
Deep Neural Networks Have An Inbuilt Occam S Razor: Article
9 pages
Convex Formulation of Overparameterized Deep Neural Networks
No ratings yet
Convex Formulation of Overparameterized Deep Neural Networks
13 pages
Neural Networks Five
No ratings yet
Neural Networks Five
65 pages
Apicella Et Al. 2019 - A Simple and Efficient Architecture For Trainable Activation Functions
No ratings yet
Apicella Et Al. 2019 - A Simple and Efficient Architecture For Trainable Activation Functions
15 pages
Unit Online 1.4
No ratings yet
Unit Online 1.4
132 pages
The Mathematics of Artificial Intelligence: 1 Supervised Learning
No ratings yet
The Mathematics of Artificial Intelligence: 1 Supervised Learning
10 pages
Reed 1993
No ratings yet
Reed 1993
8 pages
Memory Capacity of Neural Networks With Threshold and Relu Activations
No ratings yet
Memory Capacity of Neural Networks With Threshold and Relu Activations
26 pages
Creating Artificial Neural Networks That Generalize
No ratings yet
Creating Artificial Neural Networks That Generalize
13 pages
Are All Layers Created Equal
No ratings yet
Are All Layers Created Equal
28 pages
The Computational Limits of Deep Learning: Neil C. Thompson, Kristjan Greenewald, Keeheon Lee, Gabriel F. Manso
No ratings yet
The Computational Limits of Deep Learning: Neil C. Thompson, Kristjan Greenewald, Keeheon Lee, Gabriel F. Manso
46 pages
LBDL
No ratings yet
LBDL
143 pages
Statistical Neurodynamics of Deep Networks: Geometry of Signal Spaces
No ratings yet
Statistical Neurodynamics of Deep Networks: Geometry of Signal Spaces
24 pages
Lecture W15ab
No ratings yet
Lecture W15ab
44 pages
Guilhoto Math
No ratings yet
Guilhoto Math
25 pages
Neural Network As Universal Approximates
No ratings yet
Neural Network As Universal Approximates
5 pages
1810 01075 PDF
No ratings yet
1810 01075 PDF
59 pages
Neural Network Theory22
No ratings yet
Neural Network Theory22
60 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
143 pages
Deep Learning
No ratings yet
Deep Learning
78 pages
NeurIPS 2021 Towards Lower Bounds On The Depth of Relu Neural Networks Paper
No ratings yet
NeurIPS 2021 Towards Lower Bounds On The Depth of Relu Neural Networks Paper
13 pages
DNN Hyperparameter Tuning
No ratings yet
DNN Hyperparameter Tuning
105 pages
Artificial Neural Networks - Lect - 4
No ratings yet
Artificial Neural Networks - Lect - 4
17 pages
Deep Neural Network Module 4 Regularization
No ratings yet
Deep Neural Network Module 4 Regularization
53 pages
Machine Learning-Lecture 16 (Student)
No ratings yet
Machine Learning-Lecture 16 (Student)
10 pages
The Deep Neural Network-A Review
No ratings yet
The Deep Neural Network-A Review
5 pages
Breaking The Curse of Dimensionality With Convex Neural Networks
No ratings yet
Breaking The Curse of Dimensionality With Convex Neural Networks
53 pages
On The Expressive Power of Deep Neural Networks
No ratings yet
On The Expressive Power of Deep Neural Networks
8 pages
ECE/CS 559 - Neural Networks Lecture Notes #3 Some Example Neural Networks
No ratings yet
ECE/CS 559 - Neural Networks Lecture Notes #3 Some Example Neural Networks
7 pages
Inns: Civil War: Tithe Causes
No ratings yet
Inns: Civil War: Tithe Causes
262 pages
Control System Configuration PDF
100% (1)
Control System Configuration PDF
2 pages
LogicEditor enUS
No ratings yet
LogicEditor enUS
254 pages
Levinson - Music and Negative Emotion
100% (1)
Levinson - Music and Negative Emotion
30 pages
Upsc Cms Guru Answerkey2022p1
No ratings yet
Upsc Cms Guru Answerkey2022p1
45 pages
Up Jumped Spring by Freddie Hubbard - Voice - Digital Sheet Music Sheet Music Plus
No ratings yet
Up Jumped Spring by Freddie Hubbard - Voice - Digital Sheet Music Sheet Music Plus
1 page
BBE Fiitjee
No ratings yet
BBE Fiitjee
46 pages
VH Cbse-Gr-8 Mathematics Sample QP Half-Yearly
No ratings yet
VH Cbse-Gr-8 Mathematics Sample QP Half-Yearly
10 pages
215 PDF
No ratings yet
215 PDF
7 pages
DLL Mapeh-5 Q2
No ratings yet
DLL Mapeh-5 Q2
99 pages
Lumbar Herniation Case Study
No ratings yet
Lumbar Herniation Case Study
1 page
Attitude Is Everything
No ratings yet
Attitude Is Everything
27 pages
2.multiple Currencies in Purchase Order Release Strategy
No ratings yet
2.multiple Currencies in Purchase Order Release Strategy
4 pages
Chapter 4 - Kanban Agile Method
No ratings yet
Chapter 4 - Kanban Agile Method
5 pages
MAD - PRACTICAL EXAM Slips - 23 - 24
No ratings yet
MAD - PRACTICAL EXAM Slips - 23 - 24
9 pages
Manual Bomba Horizontal Clase D PDF
No ratings yet
Manual Bomba Horizontal Clase D PDF
24 pages
Practice Problem Set#2
No ratings yet
Practice Problem Set#2
2 pages
1 Text For Reading Comprehension
100% (1)
1 Text For Reading Comprehension
3 pages
Enterprise Structure
No ratings yet
Enterprise Structure
4 pages
SSC CPO 2023 Answer Key in English GS2
No ratings yet
SSC CPO 2023 Answer Key in English GS2
7 pages
3dsro CompanyPresentation
No ratings yet
3dsro CompanyPresentation
10 pages
Sistem Reproduksi Wanita
No ratings yet
Sistem Reproduksi Wanita
24 pages
Prediction of Compressive Strength of Concrete With Agricultural Waste and Natural Fibre 2024
No ratings yet
Prediction of Compressive Strength of Concrete With Agricultural Waste and Natural Fibre 2024
5 pages
Alluvial Soil Black Soil
No ratings yet
Alluvial Soil Black Soil
1 page
Bhumika Kasar
No ratings yet
Bhumika Kasar
1 page
TX - L-Band - LC12 2150A
No ratings yet
TX - L-Band - LC12 2150A
1 page
System Monitoring With Sar and Ksar
No ratings yet
System Monitoring With Sar and Ksar
9 pages
Kayleigh O'Keeffe: Ph. D. in Biology
No ratings yet
Kayleigh O'Keeffe: Ph. D. in Biology
4 pages
Feedforward Neural Networks: Fundamentals and Applications for The Architecture of Thinking Machines and Neural Webs
From Everand
Feedforward Neural Networks: Fundamentals and Applications for The Architecture of Thinking Machines and Neural Webs
Fouad Sabry
No ratings yet
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
From Everand
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
Fouad Sabry
No ratings yet

Just How Flexible Are Neural Networks in Practice

Uploaded by

Just How Flexible Are Neural Networks in Practice

Uploaded by

Just How Flexible are Neural Networks in Practice?

Ravid Shwartz-Ziv∗ Micah Goldblum∗ Arpit Bansal

C. Bayan Bruss Yann LeCun Andrew Gordon Wilson

Capital One New York University New York University

5 The Effect of the Data on EMC

EMC Improvement [%]

5.1 The role of inputs and labels

Dataset Size Random Input Random Label Semantic Label

(a) CNN (b) MLP

The boundary between overparameterization and underparameterization. Linear regression

6 The Effect of Model Architecture on EMC

7 The Role of Optimization in Fitting Data

8 Reparameterization for Increased Parameter Efficiency

A.1 Additional Results

(a) Original Labels (b) Random Labels

(a) MLP (b) CNN

5.7 Original Labels

(a) Original Labels (b) Random Labels

CIFAR-100-MLP ImageNet-CNN Credit Covertype

A.2 Implementation Details

A.3 Empirical Model Complexity

A.4 Compute Resources

You might also like