Ds 1

Uploaded by

knaguk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views9 pages

Ds 1

Uploaded by

knaguk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

IEEE Computer: Special Issue

Editor: Name, xxxx@email

Benchmarking data science:

Twelve ways to lie with
statistics and performance on
parallel computers.
Torsten Hoefler
ETH Zürich

Abstract—Progress in artificial intelligence and machine learning is largely driven by growing

compute capabilities of specialized accelerators and large datasets. A remarkable property is
that model accuracy can be traded off for compute performance: the computation only needs to
be as accurate as the statistical noise in the dataset. Given this sensitive trade off, we recognize
that reproducibility and interpretability of compute performance of data science workloads
differs fundamentally from both core HPC and machine learning. In this article, we humorously
discuss twelve fallacies when focusing on compute performance that we frequently observed in
practice—fast but wrong models are worse than slow models. We follow each with a
recommendation to mitigate the danger and hope to contribute to establishing a good
benchmarking etiquette for data science. Our work aims to start a discussion that eventually
leads to better science in the quickly emerging field of “systems for AI”.

I NTRODUCTION quest towards more general human-like intelli-

gence. OpenAI, for example, follows a strategy
Data science, Artificial Intelligence (AI), and that trains extremely large models (e.g., GPT-3
Machine Learning (ML) have gained significant with 175 billion parameters) on a large language
importance in the recent past. The AI revolution is corpus gathered from human-written texts. It is
in full swing after machine learning, specifically speculated that training just one single model
Deep Learning (DL) systems, have demonstrated costs upwards of $10 million with growing costs
to beat human performance in many tasks. Market for even larger models.
analysts expect that AI and DL workloads will
soon consume the vast majority of compute cycles Thus offering cost-effective high-performance
spent in datacenters, edge, and consumer devices. computer systems for artificial intelligence and
This development will be further fueled by the deep learning is becoming not just a large market

Computer Published by the IEEE Computer Society © 2021 IEEE

1
IEEE Computer

but a necessity. It is of utmost importance that re- reproduce because it also follows a stochastic
searchers, software, and hardware engineers com- process in complex systems. Even system avail-
pare performance and cost effectiveness of such ability (e.g., can I find a computer system from
architectures in reproducible and interpretable 10 years ago?) makes reproducing results hard
ways. and forces scientists and engineers towards in-
terpretability [2]. In both cases—data science
When designing systems, software, or algo- and compute performance—rigorous statistics are
rithms for AI we must consider the intersection necessary to model either accuracy or compute
between the general fields of data science performance. Similarly, in both cases, the fields
and systems engineering. In this paper, we are largely driven by empirical results. Combining
discuss a set of fallacies that emerge from rigorous benchmarking with rigorous data science
this intersection and we propose mitigations as is necessary to design, build, and understand
well as a general methodology to improve re- next-generation artificial intelligence systems. For
producibility and interpretability in the quickly example, cost forces many to employ the most
growing field of “Systems for AI”. efficient (low bitwidth) datatype for learning sys-
tems but one must pay attention that the resulting
Interpretability has long been a sore topic in lossy compression does not invalidate the results.
data science—going back to Huff’s now 67-year-
old book “How to Lie with Statistics”. While this The rush for good results
book falls into the general category of data liter- Pressure and competition in the field are
acy (how to interpret data), it contains many foun- fierce—numerous startups, vendors, and large
dational lessons that remain important for modern cloud providers compete for the expected trillion-
data science. The fundamentally hard problem of dollar business. Efficiency is key because many
reproducibility and interpretability remains a hot tasks, especially grand-challenges, such as more
topic today. There is rarely a year where not one general intelligence benefit from larger models.
or two keynotes at top-class machine learning Thus, specialized parallel high-performance ar-
conferences focus on these topics. Ali Rahimi’s chitectures are most important to train such large
test of time award talk at NeurIPS’17 provoked models. At the same time, models must be usable
a significant debate by comparing the state of in practice, which limits the cost of training and
the art in machine learning with alchemy in the as well as inference.
medieval ages. Startups are fighting for their place and often
Ensuring reproducibility in data science is in take extreme positions such as systems without
part hard because algorithms usually base on large main memories or wafer-scale computing to
statistics where we cannot (and do not want to) claim their 2-30x benefit over all other solutions.
expect bit-wise reproducibility. The fundamen- Those approaches push the system balance into
tally stochastic nature of computations requires a extreme regimes, which may be applicable to
more complex model for reproducibility: instead the workload but needs to be carefully analyzed.
of reproducing a specific result, we need to re- One big danger is to only focus on a specific
produce a distribution [1]. Thus, the measure of aspect or metric of the system. For example,
success is often defined as achieving a specific in many High-Performance Computing (HPC)
accuracy on a test-set, e.g., classifying 95% of environments, the floating point rate (flop/s) is
the examples correctly. Here, it does not matter seen as the measure of performance. This leads
which 95% set—implying that many different to many studies focusing on «floptimization»1
models can be considered to reproduce a specific and not sustained performance. Many researchers
result. In fact, due to complex hyperparameters are in a similar position because competition to
that control the often nondeterministic learning disseminate new ideas is fierce. With dozens of
algorithms, finding the exact same model in dif- papers appearing each week on arXiv alone, fast
ferent training runs is a rarity, if not practically publication of new results is imperative. In such
impossible.
Compute performance is similarly hard to 1 we use «text» to indicate irony in the quoted text

Computer
2
high-pressure environments, it is most important The methodology and style is inspired by Bailey’s
to avoid pitfalls in performance optimization and classical “twelve ways to fool the masses” [3];
analysis of data science workloads. we summarize each humorously, explain it in
detail, and then follow with a recommendation.
Life at the intersection We believe that each of our recommendations
Performance optimization, measurement, and contributes to establishing a good performance
reproducibility has been a topic in the HPC com- benchmarking etiquette in data science.
munity for decades. Sets of rules and best/worst
practices exist for performance measurements [3], #1 Scale computations at all cost
[2] and major HPC conferences launched serious «You should adjust your system to yield the
reproducibility initiatives. highest possible performance. Forget about inci-
Similarly, the data science community has dentals such as convergence or accuracy!»
established a set of rules for ensuring result repro-
ducibility [4] that have been endorsed by top ML The biggest peculiarity of data science work-
conferences. Yet, the two fields are fundamentally loads is that they enable a trade-off between ac-
different—the HPC field focuses on both results curacy and performance (implementation) aspects
and performance, while today’s machine learning of the execution. While it is necessary to consider
field predominantly targets at results. Arguably, this trade-off, it can be very tempting to navigate
the result of HPC simulations can often be defined it to one extreme and focus on performance at
in a bit-reproducible manner [5], [6] while ML all cost. We have observed that accuracy was
results are usually of stochastic nature requiring sacrificed for performance in HPC settings—what
to share exact experimental setups [4]. value does a fast but incorrect calculation have?
Both communities focused on their specific While this trade-off and its misuse manifests
fields in isolation—however, the fields are start- itself in many data science workloads, let us
ing to mix in productive ways. This gives rise consider Stochastic Gradient Descent (SGD) as
to machine learning tracks at traditional HPC an example in the following.
conferences and systems tracks at traditional ML SGD and its variants form the basis of much
conferences or even new conferences aimed at the of deep learning training today. Supervised SGD
intersection, such as MLSys. training works on examples that are sampled from
the input and output sets of the true function to
Our paper aims to bridge the silos between be learned. The training process adapts weight
ML and HPC by identifying a set of fallacies terms in a fixed computational structure (e.g., a
in performance analysis and system design for neural network) to approximate the true function
data science workloads. These fallacies and re- represented by those examples. SGD proceeds in
sulting guidelines are useful for practitioners, an iterative process, where a subset of examples
system vendors, scientists, and end-users alike. is used to calculate an averaged update for the
We hope to establish some form of bench- weights in each iteration. This set is usually called
marking etiquette for data science workloads. “minibatch”, and its size determines the quality
Vendors and scientists can check their own of the overall algorithm—intuitively, if it is too
messages for violations of the etiquette while small, the updates can be very noisy for complex
users can quickly identify the right questions functions or inexact examples and if the set is to
to ask. large, nuances of the function to be learned could
be lost in the averaging. Thus, the size of the
minibatch, a simple algorithmic hyperparameter,
Twelve ways to fool the masses is crucial for the accuracy of the resulting model.
We now identify twelve different fallacies that The simplest way to parallelize training in
we continue to observe regularly in the wild such deep learning is to replicate the full model and
as in vendor white papers, demos, and product its weights and train on different parallel proces-
specifications but also many scientific papers, sors. Then, each processor computes the weight
talks, and even prestigious award presentations. updates for a set of examples in parallel. This

May/June 2021
3
IEEE Computer

is possible because weights are only updated #2 Trade convergence for performance
after processing a minibatch, so if there are E «Do not worry if your performance optimiza-
examples in a minibatch, one can employ up to E tion slows convergence! Simply report the time
parallel processors with this technique. However, per iteration!»
for efficiency reasons, one would need more
than a single example per processor to re-use Machine learning workloads can be surpris-
the weights, or in HPC language, turn a set of ingly robust and can work with substantial in-
Matrix Vector (BLAS Level 2) operations into accuracies during the calculation as long as the
Matrix Matrix (BLAS Level 3) operations. The overall statistics are maintained. This means that
per-processor set of examples is often called “mi- one can discard much of the calculation as
crobatch”. If we now have P processors, and a long as, in expectation (eventually), the statistical
microbatch size of M , we would need E > M P properties of the calculation are maintained.
for our minibatch size. Examples include stochastic rounding for low-
bit datatypes and top-k gradient methods where
As mentioned before, E is limited by statis- we send only the largest gradient values while
tical properties of the data and choosing E too accumulating the discarded gradients locally in
large may negatively affect convergence [7]. Yet, data-parallel training [13], [14]. However, such
if you want to set a speed record on a large-scale methods usually slow down convergence of the
parallel computer, you would scale E to tens of model optimization and thus often require many
thousands or more! This has been a typical issue more iterations to maintain the same accuracy
in the early days and can still be seen regularly as exact calculations. If you save 50% of the
in practice. compute time per iteration but need 4x more
Similarly, “weak scaling”, i.e., keeping the iterations, then you are 2x slower at the end.
microbatch size per processor constant while in- Thus, one needs to be very careful when influ-
creasing the number of processors, will change encing convergence rates through approximation
the statistics. If one keeps the number of epochs techniques!
constant, it will reduce the update steps, which Many works consider per-iteration times and
may be most relevant for training [8]. At the end, rarely analyze convergence. Even if they analyze
most learning workloads are strong scaling as the it, results are often presented separately, such as
model size is constant and the set of examples is “we sped up iterations by 2x” in the case above.
typically constant as well. Another very related
technique for floptimization that is even more For any optimization that may slow con-
common is running hundreds of ensembles (that vergence, analyze and measure the resulting
may later be averaged) to show computational convergence rate for representative examples.
performance and (in this case trivial) scaling. Or simply always report the total runtime to
However, those additional ensembles may not find the final model.
improve the quality to rationalize the investment.
In general, one needs to carefully study admis- #3 Do not consider generalization accuracy
sible batch sizes [9] or use specialized methods to
«Train the biggest model to minimize training
tune the optimization algorithm to support larger
loss for highest performance.»
batch sizes for data parallelism [10]. One could
also achieve higher parallelism with synchronous Large models often allow more efficient use
methods that do not change the statistics [11], of accelerators and parallelism. But model size is
[12]. not always a guarantee for quality as large models
can simply store all presented examples but still
not approximate the true function they draw from
When applying a technique that changes well. This phenomenon is called “overfitting”
the calculation statistics, carefully consider its and is a well-understood danger in data sciences.
impact on the quality of the result. Yet, when presenting performance numbers, it is
regularly overlooked.

Computer
4
For example, larger batch sizes required for #5 Report highest (exa)op/s rates
highly-parallel training can reduce the generaliza- «Floptimization is about reporting the highest
tion accuracy while achieving low training error, flop/s rates! Thus use the smallest datatypes—
leading to a “generalization gap” [7]. General- after all your laptop can do 1018 bit flip (exa-)ops
ization can be improved by careful tuning of the per second! Maybe you can even get away with
training dynamics through hyperparameters such reporting flop/s that you never do?»
as adapting the learning rate during training or
increasing the number of iterations [8]. Many machine learning workloads allow ag-
gressive optimizations for low bitwidth data rep-
Thus, even when mainly comparing perfor-
resentation. Reducing the number of bits in the
mance, we need to test and report generalization
number representation can lead to substantial
(sometimes called test-) accuracy. In practice, this
speedups because energy and silicon area for
is done by evaluating the model on unseen exam-
integer multiplication shrink quadratically with
ples from the same distribution. This may require
the bitwidth. Furthermore, the required mem-
careful tuning to achieve good generalization and
ory bandwidth and storage shrinks linearly with
shows again that data science and performance
bitwidth. Low-width 4-8-bit integers are com-
engineering must be combined. Since hyperpa-
monplace in inference and 4-16 bit floating
rameters are important - we must make sure to
points numbers in training. While low-precision
document and share those for reproducibility -
datatypes are very effective in deep learning, they
ideally, share the whole code and training setup.
often form a trade-off between accuracy and per-
formance: very low precision quantization losses
Always measure and report generalization accuracy, which should always be analyzed and
accuracy after performance optimizations. reported. Some company brochures even name
datatypes as 32 bits but in fact some bits are
simply discarded during the calculation.
Sparse computations omit zero values dur-
#4 Do not report hyperparameter tuning cost ing the computation or data loading—gaining
«Hyperparameter tuning can be expensive, so performance benefits for compute or bandwidth.
do not talk about it when reporting costs or However, some vendors count operations that
runtimes!» are never performed. Sometimes, but not always,
such practice can be identified by the term “ef-
It may not be practical if one has to train a fective operations”. While managing sparsity re-
model 20 times in order to find the parameters quires often significant additional resources (stor-
that make it 10% faster on some hardware. Also, age and compute), those can hardly be counted as
why would we need to train the same model arithmetic operations.
20 times? However, this is quite often done in
practice when reporting performance in both, Clearly specify the bitwidth of the op-
science and industry. Specifically when tuning erations performed and only count the ac-
for errors and error patterns coming from the tual operations that are computed. Most often,
computer itself such as inherent inaccuracies in a mix of different precisions (bitwidths) is
analog devices, shortcuts in rounding modes, used for machine learning. In such cases, we
or quantization errors in low-bitwidth datatypes. can specify the relative proportions, e.g., “we
Some of these errors may even be characteristic achieved 1 exaflop/s with 85% 8-bit, 10% 16-
for each individual device. bit, and 5% 32-bit floating point operations”.

Document all hyperparameter tuning re-

quired for achieving the results. Analyze #6 Show only kernels/subsets when scaling
whether the discovered parameters generalize «Always present the fastest kernel such as
to different examples, models, or even tasks. matrix multiplication as this will result in highest
flop rates!»

May/June 2021
5
IEEE Computer

A general danger when accelerating computa- language corpus to state of the art accuracy?
tions is to focus too much on one specific part of Such benchmarks are extremely useful and foster
the problem. In early deep learning accelerators, reproducibility and interpretability but we must
computations were clearly limited by matrix mul- be careful to not overfit the design to them.
tiplication performance. However, reducing the Many of those examples are instances of
datatype bitwidth made the basic multiplications Goodhart’s law that roughly states that when a
quadratically cheaper [15] while the memory benchmark becomes an optimization target, it
bandwidth cost was only reduced linearly. Spe- looses its value as benchmark!
cialized acceleration units such as Tensor Cores
lowered the multiplication overhead further. After Always analyze and present a carefully
accelerating those workloads by more than 10x, selected set of workloads covering the full
the bottleneck shifts to other aspects such as data workload space of interest.
movement. For transformer networks on modern
hardware, 99.8% of the floating point operations
only take 61% of the time, while the remaining #8 Compare outdated/general purpose HW to
0.2% are data-movement bound, which takes 39% specialized HW
of the time [16]. This demonstrates that it can «Modern accelerators show highest speedups
be very misleading to only show the best matrix against old hardware - so make sure to compare
multiply unit or to report only operation counts. to the oldest machine you can find!»
The same idea is of course true for any subset This is another classic problem - many works
kernel-selection! compare aged General Purpose Graphics Pro-
Always consider a complete problem when cessing Units (GPGPUs) with their shiny new
benchmarking and showing performance re- ML accelerator or even a Central Processing
sults. When analyzing specific kernels, put Unit (CPU) that was never meant for specialized
them into proportion to the overall workload. computation. The resulting huge speedups have
very little meaning. Similar problems arise when
comparing completely different architectures, for
#7 Optimize only for one network example Field Programmable Gate Arrays (FP-
«If you carefully tuned your experiment, code, GAs) and GPGPUs.
and architecture to a specific problem, then make For comparing different accelerator types, one
sure to only talk about this!» should ensure that those are manufactured in a
similar silicon process with a similar die size,
This is a standard issue similar to #6 but one
design power, and cost. If accelerators are made
abstraction level higher. Any compute system is
in fundamentally different processes and/or die
developed with a (set of) specific use-case(s) in
sizes, then one can scale the performance num-
mind. However, the workloads usually have some
bers by the difference in silicon efficiency. In
variability and the compute system will be used
any case, the exact comparison points need to be
for a variety of tasks. Thus, it needs to perform
documented carefully.
reasonably well for many scenarios and an honest
analysis should test a variety of such scenarios. Ensure a fair comparison for different hard-
The closest comparison in the HPC space are ware by selecting devices of equal cost and age
systems that were solely designed for a good or scaling accordingly.
top-500 score (solving a single large system of
linear equations with Gaussian elimination). This
is not necessarily representative of modern HPC #9 Don’t worry about inference costs
workloads but high top-500 rankings still make
a good selling argument. In machine learning, «If you want to show quickest time to some
specific networks, such as ResNet or BERT begin accuracy, use a large model!»
to play a similar role - how fast can I train a OpenAI showed that increasing model
ResNet-50 on ImageNet or BERT on a specific size and stopping training before convergence

Computer
6
achieves the same accuracy as more expensive Sensors often return high-resolution data. For
training of smaller models [17]. While this example, modern commodity camera sensors
sounds great from a training performance record tens of megapixels but those are rarely
perspective, the main goal of learning is to later needed for typical object detection or classifica-
use the model in an inference setting. It may tion tasks. Following this observation, the Ima-
make sense to train large models and then distill geNet benchmark scales input images down by
them [18] but those additional costs must be orders of magnitude for computational efficiency.
included in the overall analysis. This input feature compression and selection is
important, otherwise, the network would need to
For practical ML workloads, the center learn the compression wasting valuable compute
of attention should lie on inference efficiency cycles and weight storage. Thus, one needs to
because inference computations will dominate carefully analyze how to format the input data
during the lifetime of most models. to the machine learning model. This can be as
important as the design of the model and the
optimization algorithm itself.
#10 Do not consider reading the data
Always define and consider input sizes,
«When measuring performance numbers, format, and transformations carefully in per-
make sure all needed data is already loaded into formance analysis of learning systems.
main memory!»
ML models are usually trained on large
amounts of data and reading this data can be a #12 Choose your comparison points well
substantial bottleneck. After all, the deep learning «Make sure to only compare performance or
revolution is fueled by algorithms, data, and com- accuracy if your model is unlikely to win in both
pute. Thus, for each iteration, data needs to be categories!»
loaded from storage, converted, augmented, and As machine learning is often about an
computed by the model. Many toolchains exist for accuracy-performance trade-off, it is necessary
the data input pipeline, but Input/Output (I/O) is to look at both in tandem. Pareto optimality is
often ignored in performance experiments. a useful measure for comparing methods. Any
The data is not always coming from storage— method that is not dominated by the Pareto front
examples for reinforcement learning environ- may be useful in practice.
ments often come from a simulation process Another issue is how to present relative ac-
executed on CPUs. Running those simulations curacy. Since approximation methods usually re-
and transmitting the data to the training ac- duce accuracy, it is natural to report closeness to
celerators can quickly become a bottleneck in the state of the art result. Often, this is reported
highly-optimized learning processes. One general similar to “we achieve 99% of the baseline perfor-
systems design issue is to have enough external mance”. However, what is meant by this phrase
bandwidth for I/O into the training system. depends on the exact interpretation. Let us assume
a 95% state of the art accuracy for some task.
Consider the whole system pipeline when This means that the model function outputs the
analyzing performance of machine learning same as the true function in 9,500 out of 10,000
workloads including storage access and other examples.
data sources. Some interpret 99% of baseline as 9,405
correct examples while others (falsely?) assume
they can loose an absolute 1% going down to
#11 Train on unreasonably large examples 9,400 correct examples. A second, more stringent,
«Larger computations are simpler to paral- interpretation could consider the incorrect results.
lelize and achieve higher flop rates, so make To be within 99% with respect to the incorrect
sure to choose the largest inputs for highest results means to increase them by no more than
performance!» 1%. In this case, we only allow 5 more incorrect

May/June 2021
7
IEEE Computer

examples, i.e., the model would need to classify ence collaborators at SPCL, ETH Zurich, and
9,495 examples correctly. It is thus quite impor- Microsoft.
tant to be precise when defining relative accuracy
to state of the art results. REFERENCES
1. V. Stodden, “Reproducing statistical results,” Annual
Always present both accuracy and perfor- Review of Statistics and Its Application, vol. 2, no. 1,
mance and be precise when defining relative pp. 1–19, 2015.
accuracies. 2. T. Hoefler and R. Belli, “Scientific Benchmarking of
Parallel Computing Systems,” pp. 73:1–73:12, ACM,
Nov. 2015. Proceedings of the International Conference
Discussion for High Performance Computing, Networking, Storage
We present twelve ways to sugercoat perfor-
and Analysis (SC15).
mance results of data science workloads. Many of 3. D. H. Bailey, “Twelve ways to fool the masses when
those adjust the trade-off between accuracy and giving performance results on parallel computers,” Su-
performance in order to shed the best possible percomputing Review, pp. 54–55, 08/1991 1991.
light on the performance of specific machine 4. J. Pineau, P. Vincent-Lamarre, K. Sinha, V. Lariv-
learning systems. Each of these ways points ière, A. Beygelzimer, F. d’Alché Buc, E. Fox, and
at a potential problem with setups for analyz- H. Larochelle, “Improving reproducibility in machine
ing performance of machine learning workloads. learning research (a report from the neurips 2019 re-
While we propose simple mitigations for each, producibility program),” 2020.
we find that general principles emerge that can 5. A. Arteaga, O. Fuhrer, and T. Hoefler, “Designing Bit-
help to make performance analysis in this field Reproducible Portable High-Performance Applications,”
transparent and reproducible. The most important in Proceedings of the 28th IEEE International Parallel
principle here is documentation and transparency and Distributed Processing Symposium (IPDPS), IEEE
to enable interpretability of the results. This can Computer Society, Apr. 2014.
ideally be achieved by sharing the whole experi- 6. J. Demmel and H. D. Nguyen, “Parallel reproducible
mental setup. We hope to spark a discussion and summation,” IEEE Transactions on Computers, vol. 64,
present a blueprint for sanity-checking results, no. 7, pp. 2060–2070, 2015.
reports, and presentations. 7. M. Li, T. Zhang, Y. Chen, and A. J. Smola, “Efficient
While we outline the interaction between data mini-batch training for stochastic optimization,” in Pro-
science and performance, we note that both fields ceedings of the 20th ACM SIGKDD International Con-
have their own standards for scientific repro- ference on Knowledge Discovery and Data Mining, KDD
ducibility (e.g., [4], [2]) that need to be consid- ’14, (New York, NY, USA), p. 661670, Association for
ered! In addition, we recommend following the Computing Machinery, 2014.
general good scientific practice [19]. 8. E. Hoffer, I. Hubara, and D. Soudry, “Train longer, gen-
All-in-all we aim this work to contribute to eralize better: closing the generalization gap in large
the definition of a good benchmarking etiquette batch training of neural networks,” 2018.
in the quickly growing field of “Systems for AI”. 9. S. McCandlish, J. Kaplan, D. Amodei, and O. D. Team,
“An empirical model of large-batch training,” 2018.
Acknowledgments 10. Y. You, I. Gitman, and B. Ginsburg, “Large batch training
The core of this article was sparked at the of convolutional networks,” 2017.
IPAM workshop “HPC for Computationally and 11. T. Ben-Nun and T. Hoefler, “Demystifying Parallel and
Data-Intensive Problems” organized by Joachim Distributed Deep Learning: An In-Depth Concurrency
Buhmann, Jennifer Chayes, Vipin Kumar, Yann Analysis,” ACM Comput. Surv., vol. 52, pp. 65:1–65:43,
LeCun, and Tandy Warnow. There, I presented a Aug. 2019.
preliminary version of the twelve rules during a 12. S. Li and T. Hoefler, “Chimera: Efficiently Training Large-
spontaneous evening talk and thank all attendees, Scale Neural Networks with Bidirectional Pipelines,”
especially Yann, Vipin, and Joachim for great in Proceedings of the International Conference for
comments and suggestions. The later refinement High Performance Computing, Networking, Storage and
of the list was influenced by all my data sci- Analysis (SC21), ACM, 11 2021.

Computer
8
13. D. Alistarh, T. Hoefler, M. Johansson, S. Khirirat, N. Kon- from Indiana University. Torsten was elected into the
stantinov, and C. Renggli, “The Convergence of Sparsi- first steering committee of ACM’s SIGHPC in 2013
fied Gradient Methods,” in Advances in Neural Informa- and he was re-elected in 2016 and 2019. He was
tion Processing Systems 31, Curran Associates, Inc., the first European to receive many of those honors
Dec. 2018. he also received both an ERC Starting and an ERC
14. C. Renggli, D. Alistarh, M. Aghagolzadeh, and T. Hoe-
Consolidator grant. His research interests revolve
around the central topic of "Performance-centric Sys-
fler, “SparCML: High-Performance Sparse Communica-
tem Design" and include scalable networks, parallel
tion for Machine Learning,” in Proceedings of the Inter-
programming techniques, and performance modeling.
national Conference for High Performance Computing,
Additional information about Torsten can be found on
Networking, Storage and Analysis (SC19), Nov. 2019. his homepage at http://htor.inf.ethz.ch.
15. A. Karatsuba, “The complexity of computations,” Pro-
ceedings of the Steklov Institute of Mathematics,
vol. 211, pp. 169–, 01 1995.
16. A. Ivanov, N. Dryden, T. Ben-Nun, S. Li, and T. Hoefler,
“Data Movement Is All You Need: A Case Study on
Optimizing Transformers,” in Proceedings of Machine
Learning and Systems 3 (MLSys 2021), Apr. 2021.
17. J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown,
B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and
D. Amodei, “Scaling laws for neural language models,”
2020.
18. Z. Li, E. Wallace, S. Shen, K. Lin, K. Keutzer, D. Klein,
and J. E. Gonzalez, “Train large, then compress: Re-
thinking model size for efficient training and inference of
transformers,” 2020.
19. N. A. of Sciences Engineering, Medicine, et al., Re-
producibility and Replicability in Science. National
Academies Press, 2019.

Torsten Hoefler is a full Professor of Computer

Science at ETH Zurich, Switzerland. Before joining
ETH, he led the performance modeling and sim-
ulation efforts of parallel petascale applications for
the NSF-funded Blue Waters project at NCSA/UIUC.
He is also a key member of the Message Passing
Interface (MPI) Forum where he chairs the "Col-
lective Operations and Topologies" working group.
Torsten won best paper awards at the ACM/IEEE
Supercomputing Conference 2010 (SC10), EuroMPI
2013, SC13, SC14, SC19, IPDPS’15, ACM HPDC’15
and HPDC’16, ACM OOPSLA’16, and other confer-
ences. He published numerous peer-reviewed sci-
entific conference and journal articles and authored
chapters of the MPI-2.2 and MPI-3.0 standards. For
his work, Torsten received the ACM Gordon Bell Prize
in 2019, the IEEE TCSC Award of Excellence (MCR)
in 2019, ETH Zurich’s Latsis Prize in 2015, the SIAM
SIAG/Supercomputing Junior Scientist Prize in 2012,
and the IEEE TCSC Young Achievers in Scalable
Computing Award in 2013. He was also awarded the
BenchCouncil Rising Star Award in 2020. Following
his Ph.D., he received the Young Alumni Award 2014

May/June 2021
9

Medical Writing A Guide For Clinicians, Educators, and Researchers, 2nd Edition PDF
100% (3)
Medical Writing A Guide For Clinicians, Educators, and Researchers, 2nd Edition PDF
387 pages
Ai and ML qp1 Solved
No ratings yet
Ai and ML qp1 Solved
20 pages
Answers 111111111111111111111111111
No ratings yet
Answers 111111111111111111111111111
21 pages
Sushil 7th (1 PDF
No ratings yet
Sushil 7th (1 PDF
29 pages
ML-Unit 1
No ratings yet
ML-Unit 1
101 pages
Mlops 5
No ratings yet
Mlops 5
8 pages
AI and ML Notes
No ratings yet
AI and ML Notes
8 pages
Sysml: The New Frontier of Machine Learning Systems: March 29, 2019
No ratings yet
Sysml: The New Frontier of Machine Learning Systems: March 29, 2019
4 pages
Lecture 1
No ratings yet
Lecture 1
75 pages
USAII Reviewer
No ratings yet
USAII Reviewer
100 pages
Data Science Fir Civil Engineering Unit 1 Notes and Assignments
No ratings yet
Data Science Fir Civil Engineering Unit 1 Notes and Assignments
29 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
255 pages
Unit-1 Data Science
No ratings yet
Unit-1 Data Science
12 pages
Undergraduate Topics in Computer Science: Series Editor
No ratings yet
Undergraduate Topics in Computer Science: Series Editor
13 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
168 pages
INF442 DataScienceBooklet
No ratings yet
INF442 DataScienceBooklet
248 pages
Meek Models Shall Inherit The Earth
No ratings yet
Meek Models Shall Inherit The Earth
13 pages
Data Science at The Singularity
No ratings yet
Data Science at The Singularity
42 pages
Mastering AI and ML With Python - ACE - INTL
No ratings yet
Mastering AI and ML With Python - ACE - INTL
223 pages
Datascience
No ratings yet
Datascience
8 pages
Data Science in Context V.99 Web Beta
No ratings yet
Data Science in Context V.99 Web Beta
293 pages
10.1007@978 3 030 38445 6
No ratings yet
10.1007@978 3 030 38445 6
243 pages
DL Lab Manual
No ratings yet
DL Lab Manual
65 pages
Machine Learning Unit-1.1
No ratings yet
Machine Learning Unit-1.1
29 pages
Subtitle
No ratings yet
Subtitle
3 pages
Steven Skiena-The Algorithm Design Manual-En
50% (2)
Steven Skiena-The Algorithm Design Manual-En
27 pages
Digital Twin - Old Wine in A New Bottle
No ratings yet
Digital Twin - Old Wine in A New Bottle
20 pages
Antim Prahar 2024 AI and ML For Business
No ratings yet
Antim Prahar 2024 AI and ML For Business
43 pages
Paullada Et Al. (2021) - 1-S2.0-S2666389921001847-Main
No ratings yet
Paullada Et Al. (2021) - 1-S2.0-S2666389921001847-Main
14 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
167 pages
Data Science Ethics - Lecture 9 - Ethical Reporting
No ratings yet
Data Science Ethics - Lecture 9 - Ethical Reporting
35 pages
Unit 2
No ratings yet
Unit 2
19 pages
Chartered Data Scientists Curriculum 2023 - 2
No ratings yet
Chartered Data Scientists Curriculum 2023 - 2
4 pages
Data Science S (2 Files Merged)
No ratings yet
Data Science S (2 Files Merged)
30 pages
Cranmer ML SymbolicRegression
No ratings yet
Cranmer ML SymbolicRegression
136 pages
Machine Learning Unit-1.1
No ratings yet
Machine Learning Unit-1.1
43 pages
AI
No ratings yet
AI
242 pages
(Texts in Computer Science) Tomas Hrycej, Bernhard Bermeitinger, Matthias Cetto, Siegfried Handschuh - Mathematical Foundations of Data Science-Springer (2023)
No ratings yet
(Texts in Computer Science) Tomas Hrycej, Bernhard Bermeitinger, Matthias Cetto, Siegfried Handschuh - Mathematical Foundations of Data Science-Springer (2023)
219 pages
EE353 - 769 00 Course Introduction
No ratings yet
EE353 - 769 00 Course Introduction
28 pages
Introduction To Responsible AI
No ratings yet
Introduction To Responsible AI
19 pages
W03 Benchmarking
No ratings yet
W03 Benchmarking
25 pages
Azhar Ahmed - Full Paper
No ratings yet
Azhar Ahmed - Full Paper
12 pages
Quick Notes Class 10 Ai
No ratings yet
Quick Notes Class 10 Ai
7 pages
20
No ratings yet
20
19 pages
EL4106Intro 2024
No ratings yet
EL4106Intro 2024
69 pages
Christ Lecture 9 AI Intro, Evolution, & Terminology
No ratings yet
Christ Lecture 9 AI Intro, Evolution, & Terminology
62 pages
111722202030-M Ramya ML Assignment-1
No ratings yet
111722202030-M Ramya ML Assignment-1
13 pages
Design of Feedback For A System
No ratings yet
Design of Feedback For A System
170 pages
Lbdlu
No ratings yet
Lbdlu
168 pages
Insidedeeplearning Preview
No ratings yet
Insidedeeplearning Preview
5 pages
1b Datascience
No ratings yet
1b Datascience
26 pages
Navigating The Frontiers of Computer Science
No ratings yet
Navigating The Frontiers of Computer Science
18 pages
Iml Material
No ratings yet
Iml Material
139 pages
Datascience With Python
No ratings yet
Datascience With Python
178 pages
Data Science Using ML and AI at Learn N Build Yashvardhan 19evj
No ratings yet
Data Science Using ML and AI at Learn N Build Yashvardhan 19evj
22 pages
347 862932 Introduction
No ratings yet
347 862932 Introduction
35 pages
Research Paper
No ratings yet
Research Paper
14 pages
Explainable and Interpretable Models in Computer Vision and Machine Learning
No ratings yet
Explainable and Interpretable Models in Computer Vision and Machine Learning
305 pages
Ai
No ratings yet
Ai
13 pages
Data Science With Python (MSC 3rd Sem) Unit 1
No ratings yet
Data Science With Python (MSC 3rd Sem) Unit 1
17 pages
Certificate (10 Files Merged)
No ratings yet
Certificate (10 Files Merged)
10 pages
ICS Certification Roadmap
No ratings yet
ICS Certification Roadmap
4 pages
DCRUST B.tech First Counseling Results
No ratings yet
DCRUST B.tech First Counseling Results
72 pages
Interview Questions
No ratings yet
Interview Questions
5 pages
ks3 Lesson Plan 1 Healthy Relationships
No ratings yet
ks3 Lesson Plan 1 Healthy Relationships
9 pages
Sanskrit - Samarth.edu - in Index - PHP Examstudent Hall-Admit-Card View Id 39229
No ratings yet
Sanskrit - Samarth.edu - in Index - PHP Examstudent Hall-Admit-Card View Id 39229
2 pages
In Memoriam - Sunil Dua
No ratings yet
In Memoriam - Sunil Dua
64 pages
Lesson 14 Cellular Respiration
No ratings yet
Lesson 14 Cellular Respiration
18 pages
L&D Profiling and Actionplan
No ratings yet
L&D Profiling and Actionplan
6 pages
This Content Downloaded From 132.208.246.237 On Tue, 16 Mar 2021 13:32:04 UTC
100% (1)
This Content Downloaded From 132.208.246.237 On Tue, 16 Mar 2021 13:32:04 UTC
39 pages
Portfolio Management George Starr
No ratings yet
Portfolio Management George Starr
28 pages
Lab#03 Searching and Sorting
No ratings yet
Lab#03 Searching and Sorting
3 pages
PTE Preparation Plans, Subscription Pricing - AlfaPTE
No ratings yet
PTE Preparation Plans, Subscription Pricing - AlfaPTE
8 pages
SAP PS - Project Information System 25
No ratings yet
SAP PS - Project Information System 25
3 pages
Unit 2 Approach and Method in ELT 2015-2016
100% (1)
Unit 2 Approach and Method in ELT 2015-2016
19 pages
Automated Pixel-Level Pavement Crack Detection On 3D Asphalt Surfaces Using A Deep-Learning Network
No ratings yet
Automated Pixel-Level Pavement Crack Detection On 3D Asphalt Surfaces Using A Deep-Learning Network
15 pages
Pakikipagkapwa RRL
No ratings yet
Pakikipagkapwa RRL
4 pages
Example Beginning Band Assessment
No ratings yet
Example Beginning Band Assessment
2 pages
Afazija
No ratings yet
Afazija
6 pages
BSTM-IT Fillable Curriculum
No ratings yet
BSTM-IT Fillable Curriculum
2 pages
Naukri GreeshmaKantipudi (1y 0m)
No ratings yet
Naukri GreeshmaKantipudi (1y 0m)
1 page
Language Testing Then and Now
No ratings yet
Language Testing Then and Now
20 pages
Training and Development
No ratings yet
Training and Development
54 pages
FINAL C1ES 108903 New Reporting Template For LESF
No ratings yet
FINAL C1ES 108903 New Reporting Template For LESF
528 pages
Worksheet Practicing Either-Neither - So - and Nor
No ratings yet
Worksheet Practicing Either-Neither - So - and Nor
2 pages
Web Technology 512B Examination 2023
No ratings yet
Web Technology 512B Examination 2023
3 pages
(2024) Educational Insights - Chatgpt's Impacts On Environmental Literacy
No ratings yet
(2024) Educational Insights - Chatgpt's Impacts On Environmental Literacy
14 pages
55 Second Drill - Dribbling Drill: How The Drill Works
No ratings yet
55 Second Drill - Dribbling Drill: How The Drill Works
2 pages
Tle 9 - Tos
No ratings yet
Tle 9 - Tos
2 pages