Ds 1
Ds 1
but a necessity. It is of utmost importance that re- reproduce because it also follows a stochastic
searchers, software, and hardware engineers com- process in complex systems. Even system avail-
pare performance and cost effectiveness of such ability (e.g., can I find a computer system from
architectures in reproducible and interpretable 10 years ago?) makes reproducing results hard
ways. and forces scientists and engineers towards in-
terpretability [2]. In both cases—data science
When designing systems, software, or algo- and compute performance—rigorous statistics are
rithms for AI we must consider the intersection necessary to model either accuracy or compute
between the general fields of data science performance. Similarly, in both cases, the fields
and systems engineering. In this paper, we are largely driven by empirical results. Combining
discuss a set of fallacies that emerge from rigorous benchmarking with rigorous data science
this intersection and we propose mitigations as is necessary to design, build, and understand
well as a general methodology to improve re- next-generation artificial intelligence systems. For
producibility and interpretability in the quickly example, cost forces many to employ the most
growing field of “Systems for AI”. efficient (low bitwidth) datatype for learning sys-
tems but one must pay attention that the resulting
Interpretability has long been a sore topic in lossy compression does not invalidate the results.
data science—going back to Huff’s now 67-year-
old book “How to Lie with Statistics”. While this The rush for good results
book falls into the general category of data liter- Pressure and competition in the field are
acy (how to interpret data), it contains many foun- fierce—numerous startups, vendors, and large
dational lessons that remain important for modern cloud providers compete for the expected trillion-
data science. The fundamentally hard problem of dollar business. Efficiency is key because many
reproducibility and interpretability remains a hot tasks, especially grand-challenges, such as more
topic today. There is rarely a year where not one general intelligence benefit from larger models.
or two keynotes at top-class machine learning Thus, specialized parallel high-performance ar-
conferences focus on these topics. Ali Rahimi’s chitectures are most important to train such large
test of time award talk at NeurIPS’17 provoked models. At the same time, models must be usable
a significant debate by comparing the state of in practice, which limits the cost of training and
the art in machine learning with alchemy in the as well as inference.
medieval ages. Startups are fighting for their place and often
Ensuring reproducibility in data science is in take extreme positions such as systems without
part hard because algorithms usually base on large main memories or wafer-scale computing to
statistics where we cannot (and do not want to) claim their 2-30x benefit over all other solutions.
expect bit-wise reproducibility. The fundamen- Those approaches push the system balance into
tally stochastic nature of computations requires a extreme regimes, which may be applicable to
more complex model for reproducibility: instead the workload but needs to be carefully analyzed.
of reproducing a specific result, we need to re- One big danger is to only focus on a specific
produce a distribution [1]. Thus, the measure of aspect or metric of the system. For example,
success is often defined as achieving a specific in many High-Performance Computing (HPC)
accuracy on a test-set, e.g., classifying 95% of environments, the floating point rate (flop/s) is
the examples correctly. Here, it does not matter seen as the measure of performance. This leads
which 95% set—implying that many different to many studies focusing on «floptimization»1
models can be considered to reproduce a specific and not sustained performance. Many researchers
result. In fact, due to complex hyperparameters are in a similar position because competition to
that control the often nondeterministic learning disseminate new ideas is fierce. With dozens of
algorithms, finding the exact same model in dif- papers appearing each week on arXiv alone, fast
ferent training runs is a rarity, if not practically publication of new results is imperative. In such
impossible.
Compute performance is similarly hard to 1 we use «text» to indicate irony in the quoted text
Computer
2
high-pressure environments, it is most important The methodology and style is inspired by Bailey’s
to avoid pitfalls in performance optimization and classical “twelve ways to fool the masses” [3];
analysis of data science workloads. we summarize each humorously, explain it in
detail, and then follow with a recommendation.
Life at the intersection We believe that each of our recommendations
Performance optimization, measurement, and contributes to establishing a good performance
reproducibility has been a topic in the HPC com- benchmarking etiquette in data science.
munity for decades. Sets of rules and best/worst
practices exist for performance measurements [3], #1 Scale computations at all cost
[2] and major HPC conferences launched serious «You should adjust your system to yield the
reproducibility initiatives. highest possible performance. Forget about inci-
Similarly, the data science community has dentals such as convergence or accuracy!»
established a set of rules for ensuring result repro-
ducibility [4] that have been endorsed by top ML The biggest peculiarity of data science work-
conferences. Yet, the two fields are fundamentally loads is that they enable a trade-off between ac-
different—the HPC field focuses on both results curacy and performance (implementation) aspects
and performance, while today’s machine learning of the execution. While it is necessary to consider
field predominantly targets at results. Arguably, this trade-off, it can be very tempting to navigate
the result of HPC simulations can often be defined it to one extreme and focus on performance at
in a bit-reproducible manner [5], [6] while ML all cost. We have observed that accuracy was
results are usually of stochastic nature requiring sacrificed for performance in HPC settings—what
to share exact experimental setups [4]. value does a fast but incorrect calculation have?
Both communities focused on their specific While this trade-off and its misuse manifests
fields in isolation—however, the fields are start- itself in many data science workloads, let us
ing to mix in productive ways. This gives rise consider Stochastic Gradient Descent (SGD) as
to machine learning tracks at traditional HPC an example in the following.
conferences and systems tracks at traditional ML SGD and its variants form the basis of much
conferences or even new conferences aimed at the of deep learning training today. Supervised SGD
intersection, such as MLSys. training works on examples that are sampled from
the input and output sets of the true function to
Our paper aims to bridge the silos between be learned. The training process adapts weight
ML and HPC by identifying a set of fallacies terms in a fixed computational structure (e.g., a
in performance analysis and system design for neural network) to approximate the true function
data science workloads. These fallacies and re- represented by those examples. SGD proceeds in
sulting guidelines are useful for practitioners, an iterative process, where a subset of examples
system vendors, scientists, and end-users alike. is used to calculate an averaged update for the
We hope to establish some form of bench- weights in each iteration. This set is usually called
marking etiquette for data science workloads. “minibatch”, and its size determines the quality
Vendors and scientists can check their own of the overall algorithm—intuitively, if it is too
messages for violations of the etiquette while small, the updates can be very noisy for complex
users can quickly identify the right questions functions or inexact examples and if the set is to
to ask. large, nuances of the function to be learned could
be lost in the averaging. Thus, the size of the
minibatch, a simple algorithmic hyperparameter,
Twelve ways to fool the masses is crucial for the accuracy of the resulting model.
We now identify twelve different fallacies that The simplest way to parallelize training in
we continue to observe regularly in the wild such deep learning is to replicate the full model and
as in vendor white papers, demos, and product its weights and train on different parallel proces-
specifications but also many scientific papers, sors. Then, each processor computes the weight
talks, and even prestigious award presentations. updates for a set of examples in parallel. This
May/June 2021
3
IEEE Computer
is possible because weights are only updated #2 Trade convergence for performance
after processing a minibatch, so if there are E «Do not worry if your performance optimiza-
examples in a minibatch, one can employ up to E tion slows convergence! Simply report the time
parallel processors with this technique. However, per iteration!»
for efficiency reasons, one would need more
than a single example per processor to re-use Machine learning workloads can be surpris-
the weights, or in HPC language, turn a set of ingly robust and can work with substantial in-
Matrix Vector (BLAS Level 2) operations into accuracies during the calculation as long as the
Matrix Matrix (BLAS Level 3) operations. The overall statistics are maintained. This means that
per-processor set of examples is often called “mi- one can discard much of the calculation as
crobatch”. If we now have P processors, and a long as, in expectation (eventually), the statistical
microbatch size of M , we would need E > M P properties of the calculation are maintained.
for our minibatch size. Examples include stochastic rounding for low-
bit datatypes and top-k gradient methods where
As mentioned before, E is limited by statis- we send only the largest gradient values while
tical properties of the data and choosing E too accumulating the discarded gradients locally in
large may negatively affect convergence [7]. Yet, data-parallel training [13], [14]. However, such
if you want to set a speed record on a large-scale methods usually slow down convergence of the
parallel computer, you would scale E to tens of model optimization and thus often require many
thousands or more! This has been a typical issue more iterations to maintain the same accuracy
in the early days and can still be seen regularly as exact calculations. If you save 50% of the
in practice. compute time per iteration but need 4x more
Similarly, “weak scaling”, i.e., keeping the iterations, then you are 2x slower at the end.
microbatch size per processor constant while in- Thus, one needs to be very careful when influ-
creasing the number of processors, will change encing convergence rates through approximation
the statistics. If one keeps the number of epochs techniques!
constant, it will reduce the update steps, which Many works consider per-iteration times and
may be most relevant for training [8]. At the end, rarely analyze convergence. Even if they analyze
most learning workloads are strong scaling as the it, results are often presented separately, such as
model size is constant and the set of examples is “we sped up iterations by 2x” in the case above.
typically constant as well. Another very related
technique for floptimization that is even more For any optimization that may slow con-
common is running hundreds of ensembles (that vergence, analyze and measure the resulting
may later be averaged) to show computational convergence rate for representative examples.
performance and (in this case trivial) scaling. Or simply always report the total runtime to
However, those additional ensembles may not find the final model.
improve the quality to rationalize the investment.
In general, one needs to carefully study admis- #3 Do not consider generalization accuracy
sible batch sizes [9] or use specialized methods to
«Train the biggest model to minimize training
tune the optimization algorithm to support larger
loss for highest performance.»
batch sizes for data parallelism [10]. One could
also achieve higher parallelism with synchronous Large models often allow more efficient use
methods that do not change the statistics [11], of accelerators and parallelism. But model size is
[12]. not always a guarantee for quality as large models
can simply store all presented examples but still
not approximate the true function they draw from
When applying a technique that changes well. This phenomenon is called “overfitting”
the calculation statistics, carefully consider its and is a well-understood danger in data sciences.
impact on the quality of the result. Yet, when presenting performance numbers, it is
regularly overlooked.
Computer
4
For example, larger batch sizes required for #5 Report highest (exa)op/s rates
highly-parallel training can reduce the generaliza- «Floptimization is about reporting the highest
tion accuracy while achieving low training error, flop/s rates! Thus use the smallest datatypes—
leading to a “generalization gap” [7]. General- after all your laptop can do 1018 bit flip (exa-)ops
ization can be improved by careful tuning of the per second! Maybe you can even get away with
training dynamics through hyperparameters such reporting flop/s that you never do?»
as adapting the learning rate during training or
increasing the number of iterations [8]. Many machine learning workloads allow ag-
gressive optimizations for low bitwidth data rep-
Thus, even when mainly comparing perfor-
resentation. Reducing the number of bits in the
mance, we need to test and report generalization
number representation can lead to substantial
(sometimes called test-) accuracy. In practice, this
speedups because energy and silicon area for
is done by evaluating the model on unseen exam-
integer multiplication shrink quadratically with
ples from the same distribution. This may require
the bitwidth. Furthermore, the required mem-
careful tuning to achieve good generalization and
ory bandwidth and storage shrinks linearly with
shows again that data science and performance
bitwidth. Low-width 4-8-bit integers are com-
engineering must be combined. Since hyperpa-
monplace in inference and 4-16 bit floating
rameters are important - we must make sure to
points numbers in training. While low-precision
document and share those for reproducibility -
datatypes are very effective in deep learning, they
ideally, share the whole code and training setup.
often form a trade-off between accuracy and per-
formance: very low precision quantization losses
Always measure and report generalization accuracy, which should always be analyzed and
accuracy after performance optimizations. reported. Some company brochures even name
datatypes as 32 bits but in fact some bits are
simply discarded during the calculation.
Sparse computations omit zero values dur-
#4 Do not report hyperparameter tuning cost ing the computation or data loading—gaining
«Hyperparameter tuning can be expensive, so performance benefits for compute or bandwidth.
do not talk about it when reporting costs or However, some vendors count operations that
runtimes!» are never performed. Sometimes, but not always,
such practice can be identified by the term “ef-
It may not be practical if one has to train a fective operations”. While managing sparsity re-
model 20 times in order to find the parameters quires often significant additional resources (stor-
that make it 10% faster on some hardware. Also, age and compute), those can hardly be counted as
why would we need to train the same model arithmetic operations.
20 times? However, this is quite often done in
practice when reporting performance in both, Clearly specify the bitwidth of the op-
science and industry. Specifically when tuning erations performed and only count the ac-
for errors and error patterns coming from the tual operations that are computed. Most often,
computer itself such as inherent inaccuracies in a mix of different precisions (bitwidths) is
analog devices, shortcuts in rounding modes, used for machine learning. In such cases, we
or quantization errors in low-bitwidth datatypes. can specify the relative proportions, e.g., “we
Some of these errors may even be characteristic achieved 1 exaflop/s with 85% 8-bit, 10% 16-
for each individual device. bit, and 5% 32-bit floating point operations”.
May/June 2021
5
IEEE Computer
A general danger when accelerating computa- language corpus to state of the art accuracy?
tions is to focus too much on one specific part of Such benchmarks are extremely useful and foster
the problem. In early deep learning accelerators, reproducibility and interpretability but we must
computations were clearly limited by matrix mul- be careful to not overfit the design to them.
tiplication performance. However, reducing the Many of those examples are instances of
datatype bitwidth made the basic multiplications Goodhart’s law that roughly states that when a
quadratically cheaper [15] while the memory benchmark becomes an optimization target, it
bandwidth cost was only reduced linearly. Spe- looses its value as benchmark!
cialized acceleration units such as Tensor Cores
lowered the multiplication overhead further. After Always analyze and present a carefully
accelerating those workloads by more than 10x, selected set of workloads covering the full
the bottleneck shifts to other aspects such as data workload space of interest.
movement. For transformer networks on modern
hardware, 99.8% of the floating point operations
only take 61% of the time, while the remaining #8 Compare outdated/general purpose HW to
0.2% are data-movement bound, which takes 39% specialized HW
of the time [16]. This demonstrates that it can «Modern accelerators show highest speedups
be very misleading to only show the best matrix against old hardware - so make sure to compare
multiply unit or to report only operation counts. to the oldest machine you can find!»
The same idea is of course true for any subset This is another classic problem - many works
kernel-selection! compare aged General Purpose Graphics Pro-
Always consider a complete problem when cessing Units (GPGPUs) with their shiny new
benchmarking and showing performance re- ML accelerator or even a Central Processing
sults. When analyzing specific kernels, put Unit (CPU) that was never meant for specialized
them into proportion to the overall workload. computation. The resulting huge speedups have
very little meaning. Similar problems arise when
comparing completely different architectures, for
#7 Optimize only for one network example Field Programmable Gate Arrays (FP-
«If you carefully tuned your experiment, code, GAs) and GPGPUs.
and architecture to a specific problem, then make For comparing different accelerator types, one
sure to only talk about this!» should ensure that those are manufactured in a
similar silicon process with a similar die size,
This is a standard issue similar to #6 but one
design power, and cost. If accelerators are made
abstraction level higher. Any compute system is
in fundamentally different processes and/or die
developed with a (set of) specific use-case(s) in
sizes, then one can scale the performance num-
mind. However, the workloads usually have some
bers by the difference in silicon efficiency. In
variability and the compute system will be used
any case, the exact comparison points need to be
for a variety of tasks. Thus, it needs to perform
documented carefully.
reasonably well for many scenarios and an honest
analysis should test a variety of such scenarios. Ensure a fair comparison for different hard-
The closest comparison in the HPC space are ware by selecting devices of equal cost and age
systems that were solely designed for a good or scaling accordingly.
top-500 score (solving a single large system of
linear equations with Gaussian elimination). This
is not necessarily representative of modern HPC #9 Don’t worry about inference costs
workloads but high top-500 rankings still make
a good selling argument. In machine learning, «If you want to show quickest time to some
specific networks, such as ResNet or BERT begin accuracy, use a large model!»
to play a similar role - how fast can I train a OpenAI showed that increasing model
ResNet-50 on ImageNet or BERT on a specific size and stopping training before convergence
Computer
6
achieves the same accuracy as more expensive Sensors often return high-resolution data. For
training of smaller models [17]. While this example, modern commodity camera sensors
sounds great from a training performance record tens of megapixels but those are rarely
perspective, the main goal of learning is to later needed for typical object detection or classifica-
use the model in an inference setting. It may tion tasks. Following this observation, the Ima-
make sense to train large models and then distill geNet benchmark scales input images down by
them [18] but those additional costs must be orders of magnitude for computational efficiency.
included in the overall analysis. This input feature compression and selection is
important, otherwise, the network would need to
For practical ML workloads, the center learn the compression wasting valuable compute
of attention should lie on inference efficiency cycles and weight storage. Thus, one needs to
because inference computations will dominate carefully analyze how to format the input data
during the lifetime of most models. to the machine learning model. This can be as
important as the design of the model and the
optimization algorithm itself.
#10 Do not consider reading the data
Always define and consider input sizes,
«When measuring performance numbers, format, and transformations carefully in per-
make sure all needed data is already loaded into formance analysis of learning systems.
main memory!»
ML models are usually trained on large
amounts of data and reading this data can be a #12 Choose your comparison points well
substantial bottleneck. After all, the deep learning «Make sure to only compare performance or
revolution is fueled by algorithms, data, and com- accuracy if your model is unlikely to win in both
pute. Thus, for each iteration, data needs to be categories!»
loaded from storage, converted, augmented, and As machine learning is often about an
computed by the model. Many toolchains exist for accuracy-performance trade-off, it is necessary
the data input pipeline, but Input/Output (I/O) is to look at both in tandem. Pareto optimality is
often ignored in performance experiments. a useful measure for comparing methods. Any
The data is not always coming from storage— method that is not dominated by the Pareto front
examples for reinforcement learning environ- may be useful in practice.
ments often come from a simulation process Another issue is how to present relative ac-
executed on CPUs. Running those simulations curacy. Since approximation methods usually re-
and transmitting the data to the training ac- duce accuracy, it is natural to report closeness to
celerators can quickly become a bottleneck in the state of the art result. Often, this is reported
highly-optimized learning processes. One general similar to “we achieve 99% of the baseline perfor-
systems design issue is to have enough external mance”. However, what is meant by this phrase
bandwidth for I/O into the training system. depends on the exact interpretation. Let us assume
a 95% state of the art accuracy for some task.
Consider the whole system pipeline when This means that the model function outputs the
analyzing performance of machine learning same as the true function in 9,500 out of 10,000
workloads including storage access and other examples.
data sources. Some interpret 99% of baseline as 9,405
correct examples while others (falsely?) assume
they can loose an absolute 1% going down to
#11 Train on unreasonably large examples 9,400 correct examples. A second, more stringent,
«Larger computations are simpler to paral- interpretation could consider the incorrect results.
lelize and achieve higher flop rates, so make To be within 99% with respect to the incorrect
sure to choose the largest inputs for highest results means to increase them by no more than
performance!» 1%. In this case, we only allow 5 more incorrect
May/June 2021
7
IEEE Computer
examples, i.e., the model would need to classify ence collaborators at SPCL, ETH Zurich, and
9,495 examples correctly. It is thus quite impor- Microsoft.
tant to be precise when defining relative accuracy
to state of the art results. REFERENCES
1. V. Stodden, “Reproducing statistical results,” Annual
Always present both accuracy and perfor- Review of Statistics and Its Application, vol. 2, no. 1,
mance and be precise when defining relative pp. 1–19, 2015.
accuracies. 2. T. Hoefler and R. Belli, “Scientific Benchmarking of
Parallel Computing Systems,” pp. 73:1–73:12, ACM,
Nov. 2015. Proceedings of the International Conference
Discussion for High Performance Computing, Networking, Storage
We present twelve ways to sugercoat perfor-
and Analysis (SC15).
mance results of data science workloads. Many of 3. D. H. Bailey, “Twelve ways to fool the masses when
those adjust the trade-off between accuracy and giving performance results on parallel computers,” Su-
performance in order to shed the best possible percomputing Review, pp. 54–55, 08/1991 1991.
light on the performance of specific machine 4. J. Pineau, P. Vincent-Lamarre, K. Sinha, V. Lariv-
learning systems. Each of these ways points ière, A. Beygelzimer, F. d’Alché Buc, E. Fox, and
at a potential problem with setups for analyz- H. Larochelle, “Improving reproducibility in machine
ing performance of machine learning workloads. learning research (a report from the neurips 2019 re-
While we propose simple mitigations for each, producibility program),” 2020.
we find that general principles emerge that can 5. A. Arteaga, O. Fuhrer, and T. Hoefler, “Designing Bit-
help to make performance analysis in this field Reproducible Portable High-Performance Applications,”
transparent and reproducible. The most important in Proceedings of the 28th IEEE International Parallel
principle here is documentation and transparency and Distributed Processing Symposium (IPDPS), IEEE
to enable interpretability of the results. This can Computer Society, Apr. 2014.
ideally be achieved by sharing the whole experi- 6. J. Demmel and H. D. Nguyen, “Parallel reproducible
mental setup. We hope to spark a discussion and summation,” IEEE Transactions on Computers, vol. 64,
present a blueprint for sanity-checking results, no. 7, pp. 2060–2070, 2015.
reports, and presentations. 7. M. Li, T. Zhang, Y. Chen, and A. J. Smola, “Efficient
While we outline the interaction between data mini-batch training for stochastic optimization,” in Pro-
science and performance, we note that both fields ceedings of the 20th ACM SIGKDD International Con-
have their own standards for scientific repro- ference on Knowledge Discovery and Data Mining, KDD
ducibility (e.g., [4], [2]) that need to be consid- ’14, (New York, NY, USA), p. 661670, Association for
ered! In addition, we recommend following the Computing Machinery, 2014.
general good scientific practice [19]. 8. E. Hoffer, I. Hubara, and D. Soudry, “Train longer, gen-
All-in-all we aim this work to contribute to eralize better: closing the generalization gap in large
the definition of a good benchmarking etiquette batch training of neural networks,” 2018.
in the quickly growing field of “Systems for AI”. 9. S. McCandlish, J. Kaplan, D. Amodei, and O. D. Team,
“An empirical model of large-batch training,” 2018.
Acknowledgments 10. Y. You, I. Gitman, and B. Ginsburg, “Large batch training
The core of this article was sparked at the of convolutional networks,” 2017.
IPAM workshop “HPC for Computationally and 11. T. Ben-Nun and T. Hoefler, “Demystifying Parallel and
Data-Intensive Problems” organized by Joachim Distributed Deep Learning: An In-Depth Concurrency
Buhmann, Jennifer Chayes, Vipin Kumar, Yann Analysis,” ACM Comput. Surv., vol. 52, pp. 65:1–65:43,
LeCun, and Tandy Warnow. There, I presented a Aug. 2019.
preliminary version of the twelve rules during a 12. S. Li and T. Hoefler, “Chimera: Efficiently Training Large-
spontaneous evening talk and thank all attendees, Scale Neural Networks with Bidirectional Pipelines,”
especially Yann, Vipin, and Joachim for great in Proceedings of the International Conference for
comments and suggestions. The later refinement High Performance Computing, Networking, Storage and
of the list was influenced by all my data sci- Analysis (SC21), ACM, 11 2021.
Computer
8
13. D. Alistarh, T. Hoefler, M. Johansson, S. Khirirat, N. Kon- from Indiana University. Torsten was elected into the
stantinov, and C. Renggli, “The Convergence of Sparsi- first steering committee of ACM’s SIGHPC in 2013
fied Gradient Methods,” in Advances in Neural Informa- and he was re-elected in 2016 and 2019. He was
tion Processing Systems 31, Curran Associates, Inc., the first European to receive many of those honors
Dec. 2018. he also received both an ERC Starting and an ERC
14. C. Renggli, D. Alistarh, M. Aghagolzadeh, and T. Hoe-
Consolidator grant. His research interests revolve
around the central topic of "Performance-centric Sys-
fler, “SparCML: High-Performance Sparse Communica-
tem Design" and include scalable networks, parallel
tion for Machine Learning,” in Proceedings of the Inter-
programming techniques, and performance modeling.
national Conference for High Performance Computing,
Additional information about Torsten can be found on
Networking, Storage and Analysis (SC19), Nov. 2019. his homepage at http://htor.inf.ethz.ch.
15. A. Karatsuba, “The complexity of computations,” Pro-
ceedings of the Steklov Institute of Mathematics,
vol. 211, pp. 169–, 01 1995.
16. A. Ivanov, N. Dryden, T. Ben-Nun, S. Li, and T. Hoefler,
“Data Movement Is All You Need: A Case Study on
Optimizing Transformers,” in Proceedings of Machine
Learning and Systems 3 (MLSys 2021), Apr. 2021.
17. J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown,
B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and
D. Amodei, “Scaling laws for neural language models,”
2020.
18. Z. Li, E. Wallace, S. Shen, K. Lin, K. Keutzer, D. Klein,
and J. E. Gonzalez, “Train large, then compress: Re-
thinking model size for efficient training and inference of
transformers,” 2020.
19. N. A. of Sciences Engineering, Medicine, et al., Re-
producibility and Replicability in Science. National
Academies Press, 2019.
May/June 2021
9