0% found this document useful (0 votes)

143 views17 pages

Coa PDF

This document discusses the implications of advances in deep learning and machine learning for computer architecture and chip design. It notes that while the core ideas of neural networks have existed since the 1960s, it is only recently that computers have become powerful enough to apply these techniques to real-world problems. However, the demands of machine learning are growing rapidly, even as traditional improvements in CPU performance have slowed. This has implications for the need to develop specialized hardware for machine learning tasks to meet these increasing computational demands.

Uploaded by

Hasan Rizvi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

143 views17 pages

Coa PDF

Uploaded by

Hasan Rizvi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

The Deep Learning Revolution and Its Implications

for Computer Architecture and Chip Design

Jeffrey Dean
Google Research
[email protected]

Abstract
The past decade has seen a remarkable series of advances in machine learning, and in particular deep
learning approaches based on artificial neural networks, to improve our abilities to build more accurate
systems across a broad range of areas, including computer vision, speech recognition, language
translation, and natural language understanding tasks. This paper is a companion paper to a keynote talk
at the 2020 International Solid-State Circuits Conference (ISSCC) discussing some of the advances in
machine learning, and their implications on the kinds of computational devices we need to build,
especially in the post-Moore’s Law-era. It also discusses some of the ways that machine learning may
also be able to help with some aspects of the circuit design process. Finally, it provides a sketch of at
least one interesting direction towards much larger-scale multi-task models that are sparsely activated
and employ much more dynamic, example- and task-based routing than the machine learning models of
today.

Introduction
The past decade has seen a remarkable series of advances in machine learning (ML), and in particular
deep learning approaches based on artificial neural networks, to improve our abilities to build more
accurate systems across a broad range of areas [LeCun et al. 2015]. Major areas of significant advances
include computer vision [Krizhevsky et al. 2012, Szegedy et al. 2015, He et al. 2016, Real et al. 2017, Tan
and Le 2019], speech recognition [Hinton et al. 2012, Chan et al. 2016], language translation [Wu et al.
2016] and other natural language tasks [Collobert et al. 2011, Mikolov et al. 2013, Sutskever et al. 2014,
Shazeer et al. 2017, Vaswani et al. 2017, Devlin et al. 2018]. The machine learning research community
has also been able to train systems to accomplish some challenging tasks by learning from interacting
with environments, often using reinforcement learning, showing success and promising advances in areas
such as playing the game of Go [Silver et al. 2017], playing video games such as Atari games [Mnih et al.
2013, Mnih et al. 2015] and Starcraft [Vinyals et al. 2019], accomplishing robotics tasks such as
substantially improved grasping for unseen objects [Levine et al. 2016, Kalashnikov et al. 2018],
emulating observed human behavior [Sermanet et al. 2018], and navigating complex urban environments
using autonomous vehicles [Angelova et al. 2015, Bansal et al. 2018].

As an illustration of the dramatic progress in the field of computer vision, Figure 1 shows a graph of the
improvement over time for the Imagenet challenge, an annual contest run by Stanford University [Deng et
al. 2009] where contestants are given a training set of one million color images across 1000 categories,
and then use this data to train a model to generalize to an evaluation set of images across the same
categories. In 2010 and 2011, prior to the use of deep learning approaches in this contest, the winning
entrants used hand-engineered computer vision features and the top-5 error rate was above 25%. In
2012, Alex Krishevsky, Ilya Sutskever, and Geoffrey Hinton used a deep neural network, commonly
referred to as “AlexNet”, to take first place in the contest with a major reduction in the top-5 error rate to
012]. Their team was the only team that used a neural network in 2012. The
16% [Krishevsky et al. 2
next year, the deep learning computer vision revolution was in full force with the vast majority of entries
from teams using deep neural networks, and the winning error rate again dropped substantially to 11.7%.
We know from a careful study that Andrej Karpathy performed that human error on this task is just above
5% if the human practices for ~20 hours, or 12% if a different person practices for just a few hours
[Karpathy 2014]. Over the course of the years 2011 to 2017, the winning Imagenet error rate dropped
sharply from 26% in 2011 to 2.3% in 2017.

Figure 1: ImageNet classification contest winner accuracy over time

These advances in fundamental areas like computer vision, speech recognition, language understanding,
and large-scale reinforcement learning have dramatic implications for many fields. We have seen a
steady series of results in many different fields of science and medicine by applying the basic research
results that have been generated over the past decade to these problem areas. Examples include
promising areas of medical imaging diagnostic tasks including for diabetic retinopathy [Gulshan et al.
2016, Krause et al. 2018], breast cancer pathology [Liu et al. 2017], lung cancer CT scan interpretation
[Ardila et al. 2019], and dermatology [Esteva et al. 2017]. Sequential prediction methods that are useful
for language translation also turn out to be useful for making accurate predictions for a variety of different
medically-relevant tasks from electronic medical records [Rajkomar et al. 2018]. These early signs point
the way for machine learning to have a significant impact across many areas of health and medical care
[Rajkomar et al. 2019, Esteva et al. 2019].

Other fields that have been improved by the use of deep learning-based approaches include quantum
chemistry [Gilmer et al. 2017], earthquake prediction [DeVries et al. 2018], flood forecasting [Nevo 2019],
genomics [Poplin et al. 2018], protein folding [Evans et al. 2018], high energy physics [Baldi et al. 2014],
and agriculture [Ramcharan et al. 2017].
With these significant advances, it is clear that the potential for ML to change many different fields of
endeavor is substantial.

Moore’s Law, Post Moore’s Law, and the Computational

Demands of Machine Learning
Many of the key ideas and algorithms underlying deep learning and artificial neural networks have been
around since the 1960s, 1970s, 1980s, and 1990s [Minsky and Papert 1969, Rumelhart et al. 1988,
Tesauro 1994]. In the late 1980s and early 1990s there was a surge of excitement in the ML and AI
community as people realized that neural networks could solve some problems in interesting ways, with
substantial advantages stemming from their ability to accept very raw forms of (sometimes
heterogeneous) input data and to have the model automatically build up hierarchical representations in
the course of training the model to perform some predictive task. At that time, though, computers were
not powerful enough to allow this approach to work on anything but small, almost toy-sized problems.
Some work at the time attempted to extend the amount of computation available for training neural
networks by using parallel algorithms [Shaw 1981, Dean 1990], but for the most part, the focus of most
people in the AI and ML community shifted away from neural network-based approaches. It was not until
the later parts of the decade of the 2000s, after two more decades of computational performance
improvements driven by Moore’s Law that computers finally started to become powerful enough to train
large neural networks on realistic, real-world problems like Imagenet [Deng et al. 2009] , rather than
smaller-scale, toy problems like MNIST [LeCun et al. 2000] and CIFAR [Krizhevsky et al. 2009]. In
particular, the paradigm of general-purpose computing on GPU cards (GPGPU) [Luebke et al. 2006],
because of GPU cards’ high floating point performance relative to CPUs, started to allow neural networks
to show interesting results on difficult problems of real consequence.

It is perhaps unfortunate that just as we started to have enough computational performance to start to
tackle interesting real-world problems and the increased scale and applicability of machine learning has
led to a dramatic thirst for additional computational resources to tackle larger problems, the computing
industry as a whole has experienced a dramatic slowdown in the year-over-year improvement of general
purpose CPU performance. Figure 2 shows this dramatic slowdown, where we have gone from doubling
general-purpose CPU performance every 1.5 years (1985 through 2003) or 2 years (2003 to 2010) to now
being in an era where general purpose CPU performance is expected to double only every 20 years
[Hennessy and Patterson 2017]. Figure 3 shows the dramatic surge in computational demands for some
important recent machine learning advances (note the logarithmic Y-axis, with the best-fit line showing a
doubling time in computational demand of 3.43 months for this select set of important ML research
results) [OpenAI 2018]. Figure 4 shows the dramatic surge in research output in the field of machine
learning and its applications, measured via the number of papers posted to the machine-learning-related
categories of Arxiv, a popular paper preprint hosting service, with more than 32 times as many papers
posted in 2018 as in 2009 (a growth rate of more than doubling every 2 years). There are now more than
100 research papers per day posted to Arxiv in the machine-learning-related subtopic areas, and this
growth shows no signs of slowing down.
Figure 2: Computing Performance in the Moore’s Law and the Post-Moore’s Law Periods

Figure 3: Some important AI Advances and their Computational Requirements

(Source: openai.com/blog/ai-and-compute/)
Figure 4: Machine learning-related Arxiv papers since 2009

Machine-Learning-Specialized Hardware
In 2011 and 2012, a small team of researchers and system engineers at Google built an early distributed
system called DistBelief to enable parallel, distributed training of very large scale neural networks, using a
combination of model and data parallel training and asynchronous updates to the parameters of the
model by many different computational replicas [Dean et al. 2012]. This enabled us to train much larger
neural networks on substantially larger data sets and, by mid-2012, using DistBelief as an underlying
framework, we were seeing dramatically better accuracy for speech recognition [Hinton et al. 2012] and
image classification models [Le et al. 2012]. The serving of these models in demanding settings of
systems with hundreds of millions of users, though, was another matter, as the computational demands
were very large. One back of the envelope calculation showed that in order to deploy the deep neural
network system that was showing significant word error rate improvements for our main speech
recognition system using CPU-based computational devices would require doubling the number of
computers in Google datacenters (with some bold-but-still-plausible assumptions about significantly
increased usage due to more accuracy). Even if this was economically reasonable, it would still take
significant time, as it would involve pouring concrete, striking arrangements for windmill farm contracts,
ordering and installing lots of computers, etc., and the speech system was just the tip of the iceberg in
terms of what we saw as the potential set of the application of neural networks to many of our core
problems and products. This thought exercise started to get us thinking about building specialized
hardware for neural networks, first for inference, and then later systems for both training and inference.

Why Does Specialized Hardware Make Sense for Deep Learning Models?

Deep learning models have three properties that make them different than many other kinds of more
general purpose computations. First, they are very tolerant of reduced-precision computations. Second,
the computations performed by most models are simply different compositions of a relatively small
handful of operations like matrix multiplies, vector operations, application of convolutional kernels, and
other dense linear algebra calculations [Vanhoucke et al. 2011]. Third, many of the mechanisms
developed over the past 40 years to enable general-purpose programs to run with high performance on
modern CPUs, such as branch predictors, speculative execution, hyperthreaded-execution processing
cores, and deep cache memory hierarchies and TLB subsystems are unnecessary for machine learning
computations. So, the opportunity exists to build computational hardware that is specialized for dense,
low-precision linear algebra, and not much else, but is still programmable at the level of specifying
programs as different compositions of mostly linear algebra-style operations. This confluence of
characteristics is not dissimilar from the observations that led to the development of specialized digital
signal processors (DSPs) for telecom applications starting in the 1980s
[en.wikipedia.org/wiki/Digital_signal_processor]. A key difference though, is because of the broad
applicability of deep learning to huge swaths of computational problems across many domains and fields
of endeavor, this hardware, despite its narrow set of supported operations, can be used for a wide variety
of important computations, rather than the more narrowly tailored uses of DSPs. Based on our thought
experiment about the dramatically increased computational demands of deep neural networks for some of
our high volume inference applications like speech recognition and image classification, we decided to
start an effort to design a series of accelerators called Tensor Processing Units for accelerating deep
learning inference and training. The first such system, called TPUv1, was a single chip design designed
to target inference acceleration [Jouppi et al. 2017].

For inference (after a model has been trained, and we want to apply the already-trained model to new
inputs in order to make predictions), 8-bit integer-only calculations have been shown to be sufficient for
many important models [Jouppi et al. 2017], with further widespread work going on in the research
community to push this boundary further using things like even lower precision weights, and techniques to
encourage sparsity of weights and/or activations.

The heart of the TPUv1 is a 65,536 8-bit multiply-accumulate matrix multiply unit that offers a peak
throughput of 92 TeraOps/second (TOPS). TPUv1 is on average about 15X -- 30X faster than its
contemporary GPU or CPU, with TOPS/Watt about 30X -- 80X higher, and was able to run production
neural net applications representing about 95% of Google datacenters' neural network inference demand
at the time with significant cost and power advantages [Jouppi et al. 2017].

Inference on low-power mobile devices is also incredibly important for many uses of machine learning.
Being able to run machine learning models on-device, where the devices themselves are often the source
of the raw data inputs used for models in areas like speech or vision, can have substantial latency as well
as privacy benefits. It is possible to take the same design principles used for TPUv1 (a simple design
targeting low precision linear algebra computations at high performance/Watt) and apply these principles
to much lower power environments, such as mobile phones. Google’s Edge TPU is one example of such
a system, offering 4 TOps in a 2W power envelope [cloud.google.com/edge-tpu/,
coral.withgoogle.com/products/]. On-device computation is already critical to many interesting use cases
of deep learning, where we want computer vision, speech and other kinds of models that can run directly
on sensory inputs without requiring connectivity. One such example is on-device agriculture applications,
like identification of diseases in plants such as cassava, in the middle of cassava fields which may not
have reliable network connectivity [Ramcharan et al. 2 017].

With the widespread adoption of machine learning and its growing importance as a key type of
computation in the world, a Cambrian-style explosion of new and interesting accelerators for machine
learning computations is underway. There are more than XX venture-backed startup companies, as well
as a variety of large, established companies, that are each producing various new chips and systems for
machine learning. Some, such as Cerebras [www.cerebras.net/], Graphcore [www.graphcore.ai/], and
Nervana (acquired by Intel) [www.intel.ai/nervana-nnp/] are focused on a variety of designs for ML
training. Others, such as Alibaba
[www.alibabacloud.com/blog/alibaba-unveils-ai-chip-to-enhance-cloud-computing-power_595409] are
designing chips focused on inference.. Some of the designs eschew larger memory-capacity DRAM or
HBM to focus on very high performance designs for models that are small enough that their entire set of
parameters and intermediate values fit in SRAM. Others focus on designs that include DRAM or HBM
that make them suitable for larger-scale models. Some, like Cerebras, are exploring full wafer-scale
integration. Others, such as Google’s Edge TPUs [cloud.google.com/edge-tpu/] are building very low
power chips for inference in environments such as mobile phones and distributed sensing devices.

Designing customized machine learning hardware for training (rather than just inference) is a more
complex endeavor than single chip inference accelerators. The reason is that single-chip systems for
training are unable to solve many problems that we want to solve in reasonable periods of time (e.g.
hours or days, rather than weeks or months), because a single-chip system cannot deliver sufficient
computational power. Furthermore, the desire to train larger models on larger data sets is such that, even
if a single chip could deliver enough computation to solve a given problem in a reasonable amount of
time, that would just mean that we would often want to solve even larger problems (necessitating the use
of multiple chips in a parallel or distributed system anyway). Therefore, designing training systems is
really about designing larger-scale, holistic computer systems, and requires thinking about individual
accelerator chip design, as well as high performance interconnects to form tightly coupled machine
learning supercomputers. Google’s second- and third-generation TPUs, TPUv2 and TPUv3
[cloud.google.com/tpu/], are designed to support both training and inference, and the basic individual
devices, each consisting of four chips, were designed to be connected together into larger configurations
called pods. Figure 5 shows the block diagram of a single Google TPUv2 chip, with two cores, with the
main computational capacity in each core provided by a large matrix multiply unit that can yield the results
of multiplying a pair of 128x128 matrices each cycle. Each chip has 16 GB (TPUv2) or 32 GB (TPUv3) of
attached high-bandwidth memory (HBM). Figure 6 shows the deployment form of a Google’s TPUv3 Pod
of 1024 accelerator chips, consisting of eight racks of chips and accompanying servers, with the chips
connected together in a 32x32 toroidal mesh, providing a peak system performance of more than 100
petaflop/s.

Figure 5: A block diagram of Google’s Tensor Processing Unit v2 (TPUv2)

Figure 6: Google’s TPUv3 Pod, consisting of 1024 TPUv3 chips w/peak performance of >100 petaflop/s

Low Precision Numeric Formats for Machine Learning

TPUv2 and TPUv3 use a custom-designed floating point format called bfloat16 [Wang and Kanwar 2019],
which departs from the IEEE half-precision 16-bit format to provide a format that is more useful for
machine learning and also enables much cheaper multiplier circuits. bfloat16 was originally developed as
a lossy compression technique to help reduce bandwidth requirements during network communications of
machine learning weights and activations in the DistBelief system, and was described briefly in section
5.5 of the TensorFlow white paper [Abadi et al. 2016, sec. 5.5]. It has been the workhorse floating format
in TPUv2 and TPUv3 since 2015. As of December, 2018, Intel announced plans to add bfloat16 support
to future generations of Intel processors [Morgan 2018].

Figure 7 below shows the split between sign, exponent, and mantissa bits for the IEEE fp32
single-precision floating point format, the IEEE fp16 half-precision floating point format, and the bfloat16
format.

Figure 7: Differences between single-precision IEEE/half-precision IEEE/brain16 Floating Point Formats

As it turns out, machine learning computations used in deep learning models care more about dynamic
range than they do about precision. Furthermore, one major area & power cost of multiplier circuits for a
floating point format with M mantissa bits is the (M+1) ✕ (M+1) array of full adders (that are needed for
multiplying together the mantissa portions of the two input numbers. The IEEE fp32, IEEE fp16 and
bfloat16 formats need 576 full adders, 121 full adders, and 64 full adders, respectively. Because
multipliers for the bfloat16 format require so much less circuitry, it is possible to put more multipliers in the
same chip area and power budget, thereby meaning that ML accelerators employing this format can have
higher flops/sec and flops/Watt, all other things being equal. Reduced precision representations also
reduce the bandwidth and energy required to move data to and from memory or to send it across
interconnect fabrics, giving further efficiency gains.

The Challenge of Uncertainty in a Fast Moving Field

One challenge for building machine learning accelerator hardware is that the ML research field is moving
extremely fast (as witnessed by the growth and absolute number of research papers published per year
shown in Figure 4). Chip design projects that are started today often take 18 months to 24 months to
finish the design, fabricate the semiconductor parts and get them back and install them into a production
datacenter environment. For these parts to be economically viable, they typically must have lifetimes of
at least three years. So, the challenge for computer architects building ML hardware is to predict where
the fast moving field of machine learning will be in the 2 to 5 year time frame. Our experience is that
bringing together computer architects, higher-level software system builders and machine learning
researchers to discuss co-design-related topics like “what might be possible in the hardware in that time
frame?” and “what interesting research trends are starting to appear and what would be their implications
for ML hardware?” is a useful way to try to ensure that we design and build useful hardware to accelerate
ML research and production uses of ML.

Machine Learning for Chip Design

One area that has significant potential is the use of machine learning to learn to automatically generate
high quality solutions for a number of different NP-hard optimization problems that exist in the overall
workflow for designing custom ASICs. For example, currently placement and routing for complex ASIC
designs takes large teams of human placement experts to iteratively refine from high-level placement to
detailed placement as the overall design of an ASIC is fleshed out. Because there is considerable human
involvement in the placement process, it is inconceivable to consider radically different layouts without
dramatically affecting the schedule of a chip project once the initial high level design is done. However,
placement and routing is a problem that is amenable to the sorts of reinforcement learning approaches
that were successful in solving games, like AlphaGo. In placement and routing, a sequence of placement
and routing decisions all combine to affect a set of overall metrics like chip area, timing, and wire length.
By having a reinforcement learning algorithm learn to “play” the game of placement and routing, either in
general across many different ASIC designs, or for a particular ASIC design, with a reward function that
combines the various attributes into a single numerical reward function, and by applying significant
amounts of machine-learning computation (in the form of ML accelerators), it may be possible to have a
system that can do placement and routing more rapidly and more effectively than a team of human
experts working with existing electronic design tools for placement and routing. We have been exploring
these approaches internally at Google and have early preliminary-but-promising looking results. The
automated ML based system also enables rapid design space exploration, as the reward function can be
easily adjusted to optimize for different trade-offs in target optimization metrics.
Furthermore, it may even be possible to train a machine learning system to make a whole series of
decisions from high-level synthesis down to actual low-level logic representations and then perform
placement and routing of these low-level circuits into a physical realization of the actual high level design
in a much more automated and end-to-end fashion. If this could happen, then it’s possible that the time
for a complex ASIC design could be reduced substantially, from many months down to weeks. This
would significantly alter the tradeoffs involved in deciding when it made sense to design custom chips,
because the current high level of non-recurring engineering expenses often mean that custom chips or
circuits are designed only for the highest volume and highest value applications.

Machine Learning for Semiconductor Manufacturing Problems

With the dramatic improvements in computer vision over the past decade, there are a number of
problems in the domain of visual inspection of wafers during the semiconductor manufacturing process
that may be amenable to more automation, or to improved accuracy over the existing approaches in this
area. By detecting defects earlier or more accurately, we may be able to achieve higher yields or reduced
costs. A survey of these approaches provides a general sense of the area [Huang and Pan 2015]

Machine Learning for Learned Heuristics in Computer Systems

Another opportunity for machine learning is in the use of learned heuristics in computer systems such as
compilers, operating systems, file systems, networking stacks, etc. Computer systems are filled with
hand-written heuristics that have to work in the general case. For example, compilers must make
decisions about which routines to inline, which instruction sequences to choose which of many possible
loop nesting structures to use, and how to lay out data structures in memory [Aho et al. 1986]. Low-level
networking software stacks must make decisions about when to increase or decrease the TCP window
size, when to retransmit packets that might have been dropped, and whether and how to compress data
across network links with different characteristics. Operating systems must choose which blocks to evict
from their buffer cache, which processes and threads to schedule next, and which data to prefetch from
disk [Tanenbaum and Woodhull 1997]. Database systems choose execution plans for high-level queries,
make decisions about how to lay out high level data on disks, and which compression methods to use for
which pieces of data [Silberschatz et al. 1997].

The potential exists to use machine-learned heuristics to replace hand-coded heuristics, with the ability
for these ML heuristics to take into account much more contextual information than is possible in
hand-written heuristics, allowing them to adapt more readily to the actual usage patterns of a system,
rather than being constructed for the average case. Other uses of ML can replace traditional data
structures like B-trees, hash tables, and Bloom filters with learned index structures, that can take
advantage of the actual distribution of data being processed by a system to produce indices that are
higher performance while being 20X to 100X smaller [Kraska et al. 2018].

Future Machine Learning Directions

A few interesting threads of research are occuring in the ML research community at the moment that will
likely be even more interesting if combined together.
First, work on sparsely-activated models, such as the sparsely-gated mixture of experts model [Shazeer
et al. 2017], shows how to build very large capacity models where just a portion of the model is “activated”
for any given example (say, just 2 or 3 experts out of 2048 experts). The routing function in such models
is trained simultaneously and jointly with the different experts, so that the routing function learns which
experts are good at which sorts of examples, and the experts simultaneously learn to specialize for the
characteristics of the stream of examples to which they are given. This is in contrast with most ML
models today where the whole model is activated for every example. Table 4 in Shazeer et al. 2 017
showed that such an approach be simultaneously ~9X more efficient for training, ~2.5X more efficient for
inference, and higher accuracy (+1 BLEU point for a language translation task).

Second, work on automated machine learning (AutoML), where techniques such as neural architecture
search [Zoph and Le 2016, Pham et al. 2018] or evolutionary architectural search [Real et al. 2017, Gaier
and Ha 2019] can automatically learn effective structures and other aspects of machine learning models
or components in order to optimize accuracy for a given task. These approaches often involve running
many automated experiments, each of which may involve significant amounts of computation.

Third, multi-task training at modest scales of a few to a few dozen related tasks, or transfer learning from
a model trained on a large amount of data for a related task and then fine-tuned on a small amount of
data for a new task, has been shown to be very effective in a wide variety of problems [Devlin et al. 2018].
So far, most use of multi-task machine learning is usually in the context of a single modality (e.g. all visual
tasks, or all textual tasks) [Doersch and Zisserman 2017], although a few authors have considered
multi-modality settings as well [Ruder 2017].

A particularly interesting research direction puts these three trends together, with a system running on
large-scale ML accelerator hardware, with a goal of being able to train a model that can perform
thousands or millions of tasks in a single model. Such a model might be made up of many different
components of different structures, with the flow of data between examples being relatively dynamic on
an example-by-example basis. The model might use techniques like the sparsely-gated mixture of
experts and learned routing in order to have a very large capacity model [Shazeer et al. 2017], but where
a given task or example only sparsely activates a small fraction of the total components in the system
(and therefore keeps computational cost and power usage per training example or inference much lower).
An interesting direction to explore would be to use dynamic and adaptive amounts of computation for
different examples, so that “easy” examples use much less computation than “hard” examples (a
relatively unusual property in the machine learning models of today). Figure 8 depicts such a system.
Figure 8: A diagram depicting a design for a large, sparsely activated, multi-task model. Each
box in the model represents a component. Models for tasks develop by stitching together
components, either using human-specified connection patterns, or automatically learned
connectivity. Each component might be running a small architectural search to adapt to the kinds
of data which is being routed to it, and routing decisions making components decide which
downstream components are best suited for a particular task or example, based on observed
behavior.

Each component might itself be running some AutoML-like architecture search [Pham et al. 2017], in
order to adapt the structure of the component to the kinds of data that it is being routed to that
component. New tasks can leverage components trained on other tasks when that is useful. The hope is
that through very large scale multi-task learning, shared components, and learned routing, the model can
very quickly learn to accomplish new tasks to a high level of accuracy, with relatively few examples for
each new task (because the model is able to leverage the expertise and internal representations it has
already developed in accomplishing other, related tasks).

Building a single machine learning system that can handle millions of tasks, and that can learn to
successfully accomplish new tasks automatically, is a true grand challenge in the field of artificial
intelligence and computer systems engineering: it will require expertise and advances in many areas,
spanning solid-state circuit design, computer networking, ML-focused compilers, distributed systems, and
machine learning algorithms in order to push the field of artificial intelligence forward by building a system
that can generalize to solve new tasks independently across the full range of application areas of machine
learning.

Conclusion
The advances in machine learning over the past decade are already affecting a huge number of fields of
science, engineering, and other forms of human endeavor, and this influence is only going to increase.
The specialized computational needs of machine learning combined with the slowdown of
general-purpose CPU performance improvements in the post-Moore’s Law-era represent an exciting time
for the computing hardware industry [Hennessy and Patterson 2019]: we now have a set of techniques
that seem to be applicable to a vast array of problems across a huge number of domains, where we want
to dramatically increase the scale of the models and datasets on which we can train these models, and
where the impact of this work will touch a vast fraction of humanity. As we push the boundaries of what is
possible with large-scale, massively multi-task learning systems that can generalize to new tasks, we will
create tools to enable us to collectively accomplish more as societies and to advance humanity. We truly
live in exciting times.

Acknowledgements
Anand Babu, Alison Carroll, Satrajit Chatterjee, Jason Freidenfelds, Anna Goldie, Norm Jouppi, Azalia
Mirhoseini, David Patterson, and Cliff Young, as well as this year’s ISSCC chairs and anonymous ISSCC
representatives all provided helpful feedback on the content of this article that was much appreciated.

References
[Abadi et al. 2016] Abadi, Martín, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S.
Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp,
Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh
Levenberg, Dan Mane, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon
Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan,
Fernanda Viegas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang
Zheng. "Tensorflow: Large-scale machine learning on heterogeneous distributed systems."
arxiv.org/abs/1603.04467 (2016).
[Aho et al. 1986] Aho, Alfred V., Ravi Sethi, and Jeffrey D. Ullman. "Compilers, principles, techniques." Addison
Wesley (1986).
[Angelova et al. 2015] Angelova, Anelia, Alex Krizhevsky, Vincent Vanhoucke, Abhijit Ogale, and Dave Ferguson.
"Real-time pedestrian detection with deep network cascades." In Proceedings of BMVC 2015,
ai.google/research/pubs/pub43850 (2015).
[Ardila et al. 2019] Ardila, Diego, Atilla P. Kiraly, Sujeeth Bharadwaj, Bokyung Choi, Joshua J. Reicher, Lily Peng,
Daniel Tse, Mozziyar Etemadi, Wenxing Ye, Greg Corrado, David P. Naidich and Shravya Shetty.
"End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed
tomography." Nature Medicine 25, no. 6 (2019): 954.
[Baldi et al. 2014] Baldi, Pierre, Peter Sadowski, and Daniel Whiteson. "Searching for exotic particles in high-energy
physics with deep learning." Nature Communications 5 (2014): 4308.
www.nature.com/articles/ncomms5308
[Bansal et al. 2018] Bansal, Mayank, Alex Krizhevsky, and Abhijit Ogale. "ChauffeurNet: Learning to drive by imitating
the best and synthesizing the worst." arxiv.org/abs/1812.03079 (2018).
[Chan et al. 2016] Chan, William, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. "Listen, attend and spell: A neural
network for large vocabulary conversational speech recognition." In 2016 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), pp. 4960-4964. IEEE, 2016. arxiv.org/abs/1508.01211
[Collobert et al. 2011] Collobert, Ronan, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel
Kuksa. "Natural language processing (almost) from scratch." Journal of Machine Learning Research 1 2, no.
Aug (2011): 2493-2537. arxiv.org/abs/1103.0398
[Dean 1990] Dean, Jeffrey. "Parallel Implementations of neural network training: two back-propagation approaches”.
Undergraduate honors thesis, University of Minnesota, 1990.
drive.google.com/file/d/1I1fs4sczbCaACzA9XwxR3DiuXVtqmejL/view
[Dean et al. 2012] Dean, Jeffrey, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc'aurelio
Ranzato et al. "Large scale distributed deep networks." In Advances in Neural Information Processing
Systems, pp. 1223-1231. 2012. papers.nips.cc/paper/4687-large-scale-distributed-deep-networks.pdf
[Dean et al. 2018] Dean, Jeff, David Patterson, and Cliff Young. "A new golden age in computer architecture:
Empowering the machine-learning revolution." IEEE Micro 38, no. 2 (2018): 21-29.
ieeexplore.ieee.org/document/8259424
[Deng et al. 2009] Deng, Jia, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. "Imagenet: A large-scale
hierarchical image database." In 2009 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pp. 248-255. IEEE, 2009. http://www.image-net.org/papers/imagenet_cvpr09.pdf
[Devlin et al. 2018] Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. "Bert: Pre-training of deep
bidirectional transformers for language understanding." arxiv.org/abs/1810.04805 (2018).
[DeVries et al. 2 018] DeVries, Phoebe MR, Fernanda Viégas, Martin Wattenberg, and Brendan J. Meade. "Deep
learning of aftershock patterns following large earthquakes." Nature 560, no. 7720 (2018): 632.
www.nature.com/articles/s41586-018-0438-y
[Doersch and Zisserman 2017] Doersch, Carl, and Andrew Zisserman. "Multi-task self-supervised visual learning." In
Proceedings of the IEEE International Conference on Computer Vision, pp. 2051-2060. 2017.
arxiv.org/abs/1708.07860
[Esteva et al. 2017] Esteva, Andre, Brett Kuprel, Roberto A. Novoa, Justin Ko, Susan M. Swetter, Helen M. Blau, and
Sebastian Thrun. "Dermatologist-level classification of skin cancer with deep neural networks." Nature 542,
no. 7639 (2017): 115. www.nature.com/articles/nature21056
[Esteva et al. 2019] Esteva, Andre, Alexandre Robicquet, Bharath Ramsundar, Volodymyr Kuleshov, Mark DePristo,
Katherine Chou, Claire Cui, Greg Corrado, Sebastian Thrun, and Jeff Dean. "A guide to deep learning in
healthcare." Nature Medicine 25, no. 1 (2019): 24. www.nature.com/articles/s41591-018-0316-z
[Evans et al. 2018] Evans, R., J. Jumper, J. Kirkpatrick, L. Sifre, T. F. G. Green, C. Qin, A. Zidek et al. "De novo
structure prediction with deep-learning based scoring." Annual Review of Biochemistry 77 (2018): 363-382.
[Gaier and Ha 2019] Gaier, Adam, and David Ha. "Weight Agnostic Neural Networks." arxiv.org/abs/1906.04358
(2019).
[Gilmer et al. 2017] Gilmer, Justin, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl.
"Neural message passing for quantum chemistry." In Proceedings of the 34th International Conference on
Machine Learning (ICML)- Volume 70, pp. 1263-1272. JMLR. org, 2017. arxiv.org/abs/1704.01212
[Gulshan et al. 2016] Gulshan, Varun, Lily Peng, Marc Coram, Martin C. Stumpe, Derek Wu, Arunachalam
Narayanaswamy, Subhashini Venugopalan et al. "Development and validation of a deep learning algorithm
for detection of diabetic retinopathy in retinal fundus photographs." Journal of the American Medical
Association (JAMA), vol. 316, no. 22 (2016): 2402-2410. jamanetwork.com/journals/jama/fullarticle/2588763
[He et al. 2016] He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Deep residual learning for image
recognition." In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp.
770-778. 2016. arxiv.org/abs/1512.03385
[Hennessy and Patterson 2017] Hennessy, John L., and David A. Patterson. Computer architecture: a quantitative
approach, sixth edition. Morgan Kaufmann, 2017.
[Hennessy and Patterson 2019] Hennessy, John L., and David A. Patterson. "A new golden age for computer
architecture." Commun. ACM 62, no. 2 (2019): 48-60.
cacm.acm.org/magazines/2019/2/234352-a-new-golden-age-for-computer-architecture/fulltext
[Hinton et al. 2012] Hinton, Geoffrey, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, Navdeep Jaitly,
Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara Sainath, and Brian Kingsbury. "Deep neural
networks for acoustic modeling in speech recognition." IEEE Signal Processing Magazine 29 (2012).
www.cs.toronto.edu/~hinton/absps/DNN-2012-proof.pdf
[Huang and Pan 2015] Huang, Szu-Hao, and Ying-Cheng Pan. "Automated visual inspection in the semiconductor
industry: A survey." Computers in industry 66 (2015): 1-10.
www.sciencedirect.com/science/article/abs/pii/S0166361514001845
[Jouppi et al. 2017] Jouppi, Norman P., Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder
Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao,
Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami,
Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert
Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Andy Koch,
Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle
Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan,
Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy
Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew
Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma,
Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon.
"In-datacenter performance analysis of a tensor processing unit." In 2017 ACM/IEEE 44th Annual
International Symposium on Computer Architecture (ISCA), pp. 1-12. IEEE, 2017. arxiv.org/abs/1704.04760
[Kalashnikov et al. 2018] Kalashnikov, Dmitry, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang,
Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, and Sergey Levine. "Qt-opt: Scalable
deep reinforcement learning for vision-based robotic manipulation." arxiv.org/abs/1806.10293 (2018).
[Karpathy 2014] Karpathy, Andrej. “What I learned from competing against a ConvNet on ImageNet”,
karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/, 2014.

[Kraska et al. 2018] Kraska, Tim, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis. "The case for learned
index structures." In Proceedings of the 2018 International Conference on Management of Data (SIGMOD),
pp. 489-504. ACM, 2018. arxiv.org/abs/1712.01208
[Krause et al. 2 018] Krause, Jonathan, Varun Gulshan, Ehsan Rahimy, Peter Karth, Kasumi Widner, Greg S.
Corrado, Lily Peng, and Dale R. Webster. "Grader variability and the importance of reference standards for
evaluating machine learning models for diabetic retinopathy." Ophthalmology 125, no. 8 (2018): 1264-1272.
arxiv.org/abs/1710.01711
[Krizhevsky et al. 2009] Krizhevsky, Alex, Vinod Nair, and Geoffrey Hinton. "The CIFAR-10 dataset."
www.cs.toronto.edu/~kriz/cifar.html (2009).
[Krizhevsky et al. 2012] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep
convolutional neural networks." In Advances in Neural Information Processing Systems (NIPS), pp.
1097-1105. 2012.
papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
[LeCun et al. 2000] LeCun, Y., C. Cortes, and C. J. Burges. "MNIST handwritten digits dataset." (2000).
http://yann.lecun.com/exdb/mnist/

[LeCun et al. 2015] LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. "Deep learning." Nature 521, no. 7553
(2015): 436. www.nature.com/articles/nature14539
[Le et al. 2012] Le, Quoc V., Marc'Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg S. Corrado, Jeff
Dean, and Andrew Y. Ng. "Building high-level features using large scale unsupervised learning." In
Proceedings of the 29th International Coference on International Conference on Machine Learning, pp.
507-514, 2012. arxiv.org/abs/1112.6209
[Levine et al. 2016] Levine, Sergey, Peter Pastor, Alex Krizhevsky, and Deirdre Quillen. "Learning hand-eye
coordination for robotic grasping with large-scale data collection." In International Symposium on
Experimental Robotics, pp. 173-184. Springer, Cham, 2016. arxiv.org/abs/1603.02199
[Liu et al. 2017] Liu, Yun, Krishna Gadepalli, Mohammad Norouzi, George E. Dahl, Timo Kohlberger, Aleksey Boyko,
Subhashini Venugopalan, Aleksei Timofeev, Philip Q. Nelson, Greg S. Corrado, Jason D. Hipp, Lily Peng,
and Martin C. Stumpe. "Detecting cancer metastases on gigapixel pathology images."
arxiv.org/abs/1703.02442 (2017).
[Luebke et al. 2006] Luebke, David, Mark Harris, Naga Govindaraju, Aaron Lefohn, Mike Houston, John Owens, Mark
Segal, Matthew Papakipos, and Ian Buck. "GPGPU: general-purpose computation on graphics hardware." In
Proceedings of the 2006 ACM/IEEE conference on Supercomputing, p. 208. ACM, 2006.
dl.acm.org/citation.cfm?id=1103933
[Lu et al. 2019] Lu, Jiasen, Dhruv Batra, Devi Parikh, and Stefan Lee. "Vilbert: Pretraining task-agnostic visiolinguistic
representations for vision-and-language tasks." arxiv.org/abs/1908.02265 (2019).
[Mikolov et al. 2013] Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. "Distributed
representations of words and phrases and their compositionality." In Advances in Neural Information
Processing Systems, pp. 3111-3119. 2013. arxiv.org/abs/1310.4546
[Minsky and Papert 1969] Minsky, Marvin and Seymour Papert. Perceptrons. MIT Press, 1969, Cambridge.
[Mnih et al. 2013] Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan
Wierstra, and Martin Riedmiller. "Playing atari with deep reinforcement learning." arxiv.org/abs/1312.5602
(2013).
[Mnih et al. 2015] Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G.
Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles
Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg and
Demis Hassabis. "Human-level control through deep reinforcement learning." Nature 518, no. 7540 (2015):
529. www.nature.com/articles/nature14236
[Morgan 2018] Morgan, Timothy Prickett. “Intel Unfolds Roadmaps for Future CPUs and GPUs”. NextPlatform, Dec.
16, 2018. www.nextplatform.com/2018/12/16/intel-unfolds-roadmaps-for-future-cpus-and-gpus/
[Nevo 2019] Sella Nevo, “An Inside Look at Flood Forecasting”, Google AI Blog, Sep 2019,
ai.googleblog.com/2019/09/an-inside-look-at-flood-forecasting.html.
[OpenAI 2018] OpenAI, “AI and Compute”, OpenAI Blog, May 2018, openai.com/blog/ai-and-compute/
[Pham et al. 2018] Pham, Hieu, Melody Y. Guan, Barret Zoph, Quoc V. Le, and Jeff Dean. "Efficient neural
architecture search via parameter sharing." arxiv.org/abs/1802.03268 (2018).
[Poplin et al. 2018] Poplin, Ryan, Pi-Chuan Chang, David Alexander, Scott Schwartz, Thomas Colthurst, Alexander
Ku, Dan Newburger, Jojo Dijamco, Nam Nguyen, Pegah T Afshar, Sam S Gross, Lizzie Dorfman, Cory Y
McLean & Mark A DePristo. "A universal SNP and small-indel variant caller using deep neural networks."
Nature Biotechnology 36, no. 10 (2018): 983. www.nature.com/articles/nbt.4235
[Rajkomar et al. 2018] Rajkomar, Alvin, Eyal Oren, Kai Chen, Andrew M. Dai, Nissan Hajaj, Michaela Hardt, Peter J.
Liu, Xiaobing Liu, Jake Marcus, Mimi Sun, Patrik Sundberg, Hector Yee, Kun Zhang, Yi Zhang, Gerardo
Flores, Gavin E. Duggan, Jamie Irvine, Quoc Le, Kurt Litsch, Alexander Mossin, Justin Tansuwan, De
Wang, James Wexler, Jimbo Wilson, Dana Ludwig, Samuel L. Volchenboum, Katherine Chou, Michael
Pearson, Srinivasan Madabushi, Nigam H. Shah, Atul J. Butte, Michael D. Howell, Claire Cui, Greg S.
Corrado & Jeffrey Dean. "Scalable and accurate deep learning with electronic health records." Nature
Digital Medicine 1, no. 1 (2018): 18. www.nature.com/articles/s41746-018-0029-1
[Rajkomar et al. 2019] Rajkomar, Alvin, Jeffrey Dean, and Isaac Kohane. "Machine learning in medicine." New
England Journal of Medicine 380, no. 14 (2019): 1347-1358.
www.nejm.org/doi/full/10.1056/NEJMra1814259
[Ramcharan et al. 2017] Ramcharan, Amanda, Kelsee Baranowski, Peter McCloskey, Babuali Ahmed, James Legg,
and David P. Hughes. "Deep learning for image-based cassava disease detection." Frontiers in plant
science 8 (2017): 1852. www.frontiersin.org/articles/10.3389/fpls.2017.01852/full
[Real et al. 2017] Real, Esteban, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Jie Tan,
Quoc V. Le, and Alexey Kurakin. "Large-scale evolution of image classifiers." In Proceedings of the 34th
International Conference on Machine Learning (ICML) -Volume 70, pp. 2902-2911. JMLR. org, 2017.
arxiv.org/abs/1703.01041
[Ruder 2017] Ruder, Sebastian. "An overview of multi-task learning in deep neural networks."
arxiv.org/abs/1706.05098 (2017).
[Rumelhart et al. 1988] Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. "Learning representations
by back-propagating errors." Cognitive modeling 5, no. 3 (1988): 1.
[Sermanet et al. 2018] Sermanet, Pierre, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal,
Sergey Levine, and Google Brain. "Time-contrastive networks: Self-supervised learning from video." In 2018
IEEE International Conference on Robotics and Automation (ICRA), pp. 1134-1141. IEEE, 2018.
arxiv.org/abs/1704.06888
[Shaw 1981] Shaw, David Elliot. "NON-VON: A parallel machine architecture for knowledge-based information
processing." (1981). pdfs.semanticscholar.org/19d8/bed84c8827025c5adad43f6ec0d4eb9ed114.pdf
[Shazeer et al. 2 017] Shazeer, Noam, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton,
and Jeff Dean. "Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.". In
International Conference on Learning Representations (ICLR), 2017. arxiv.org/abs/1701.06538 (2017).
[Silberschatz et al. 1997] Silberschatz, Abraham, Henry F. Korth, and Shashank Sudarshan. Database system
concepts. Vol. 4. New York: McGraw-Hill, 1997.
[Silver et al. 2017] Silver, David, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez,
Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent
Sifre, George van den Driessche, Thore Graepel and Demis Hassabis. "Mastering the game of go without
human knowledge." Nature 550, no. 7676 (2017): 354. www.nature.com/articles/nature24270
[Sutskever et al. 2014] Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural
networks." In Advances in Neural Information Processing Systems, pp. 3104-3112. 2014.
arxiv.org/abs/1409.3215
[Szegedy et al. 2015] Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov,
Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. "Going deeper with convolutions." In
Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 1-9. 2015.
arxiv.org/abs/1409.4842
[Tan and Le 2019] Tan, Mingxing, and Quoc V. Le. "EfficientNet: Rethinking Model Scaling for Convolutional Neural
Networks." arxiv.org/abs/1905.11946 (2019).
[Tanenbaum and Woodhull 1997] Tanenbaum, Andrew S., and Albert S. Woodhull. Operating systems: design and
implementation. Vol. 68. Englewood Cliffs: Prentice Hall, 1997.
[Tesauro 1994] Tesauro, Gerald. "TD-Gammon, a self-teaching backgammon program, achieves master-level play."
Neural Computation 6, no. 2 (1994): 215-219.
www.aaai.org/Papers/Symposia/Fall/1993/FS-93-02/FS93-02-003.pdf
[Vanhoucke et al. 2011] Vanhoucke, Vincent, Andrew Senior, and Mark Z. Mao. "Improving the speed of neural
networks on CPUs." In Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011.
ai.google/research/pubs/pub37631
[Vaswani et al. 2017] Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Łukasz Kaiser, and Illia Polosukhin. "Attention is all you need." In Advances in Neural Information
Processing Systems, pp. 5998-6008. 2017. arxiv.org/abs/1706.03762
[Vinyals et al. 2019] Vinyals, Oriol, Igor Babuschkin, Junyoung Chung, Michael Mathieu, Max Jaderberg, Wojciech M.
Czarnecki, Andrew Dudzik et al. "AlphaStar: Mastering the real-time strategy game StarCraft II." DeepMind
Blog (2019). deepmind.com/blog/article/alphastar-mastering-real-time-strategy-game-starcraft-ii
[Wang & Kanwar 2019] Wang, Shibo and Pankaj Kanwar. “BFloat16: The secret to high performance on Cloud
TPUs”, Google Cloud Blog, August 2019,
cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus
[Wu et al. 2016] Wu, Yonghui, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey,
Maxim Krikun et al. "Google's neural machine translation system: Bridging the gap between human and
machine translation." arxiv.org/abs/1609.08144 (2016).
[Zoph and Le 2016] Zoph, Barret, and Quoc V. Le. "Neural architecture search with reinforcement learning."
arxiv.org/abs/1611.01578 (2016).

Unit IV
No ratings yet
Unit IV
21 pages
Deep Learning Lab Manual
No ratings yet
Deep Learning Lab Manual
65 pages
Abhijit Ghatak - Deep Learning With R-Springer (2019)
No ratings yet
Abhijit Ghatak - Deep Learning With R-Springer (2019)
259 pages
Deep Learning Basics Concepts
90% (10)
Deep Learning Basics Concepts
69 pages
Tubingen DL Notes
No ratings yet
Tubingen DL Notes
151 pages
Recent Advances in Deep Learning Based Computer Vision
No ratings yet
Recent Advances in Deep Learning Based Computer Vision
6 pages
Machine Learning in Python
No ratings yet
Machine Learning in Python
50 pages
2022-01-01 How Much Longer Can Computing Power Drive Artificial Intelligence Progress (CSET)
No ratings yet
2022-01-01 How Much Longer Can Computing Power Drive Artificial Intelligence Progress (CSET)
34 pages
Deep L Earning
No ratings yet
Deep L Earning
7 pages
Unit I - Fundamentals of DL
No ratings yet
Unit I - Fundamentals of DL
41 pages
ML New New 1
No ratings yet
ML New New 1
15 pages
The Artificial Intelligence Renaissance Deep Learning and The Road To Human Level Machine Intelligence
No ratings yet
The Artificial Intelligence Renaissance Deep Learning and The Road To Human Level Machine Intelligence
19 pages
Notes Unit 1 ML
No ratings yet
Notes Unit 1 ML
17 pages
Role of AI & ML On The Global Economy
No ratings yet
Role of AI & ML On The Global Economy
51 pages
Image Classification Using Convolutional Neural Networks
No ratings yet
Image Classification Using Convolutional Neural Networks
8 pages
31.DL - Post-Proofreading 1-5
No ratings yet
31.DL - Post-Proofreading 1-5
5 pages
Artificial Inteligence in Health Care Medicine
No ratings yet
Artificial Inteligence in Health Care Medicine
21 pages
ML 1234
No ratings yet
ML 1234
10 pages
31.DL - Post-Proofreading 1-4
No ratings yet
31.DL - Post-Proofreading 1-4
4 pages
AI in Dentisrty - Khan Et Al 03042024 - v5
No ratings yet
AI in Dentisrty - Khan Et Al 03042024 - v5
14 pages
Adobe Photoshop MCQ
100% (2)
Adobe Photoshop MCQ
6 pages
Unit-3 NNDL
No ratings yet
Unit-3 NNDL
22 pages
Lesson 02 Introduction To Deep Learning
No ratings yet
Lesson 02 Introduction To Deep Learning
74 pages
Akanksh Learning
No ratings yet
Akanksh Learning
33 pages
ML Material
No ratings yet
ML Material
40 pages
In 2012 Alex Krizhevsky
No ratings yet
In 2012 Alex Krizhevsky
5 pages
XCXCXCXCXCXCXCXC
No ratings yet
XCXCXCXCXCXCXCXC
20 pages
Unit 4 NNDL
No ratings yet
Unit 4 NNDL
37 pages
Neural Networks and Deep Learning: A Comprehensive Overview of Modern Techniques and Applications
No ratings yet
Neural Networks and Deep Learning: A Comprehensive Overview of Modern Techniques and Applications
15 pages
A Review On Deep Learning Approaches To Image Classification and Object Segmentation 1
No ratings yet
A Review On Deep Learning Approaches To Image Classification and Object Segmentation 1
23 pages
A Study On Machine Learning Algorithms and Its Applications
No ratings yet
A Study On Machine Learning Algorithms and Its Applications
13 pages
Ai in Ee
No ratings yet
Ai in Ee
9 pages
Understanding AI, Part 1 Data Training, Mouser, Revised
No ratings yet
Understanding AI, Part 1 Data Training, Mouser, Revised
4 pages
Buku Manual Mindray Dp10
0% (1)
Buku Manual Mindray Dp10
157 pages
Deep Learning With R
No ratings yet
Deep Learning With R
18 pages
Deep Learning
No ratings yet
Deep Learning
7 pages
Machine Learning 2
No ratings yet
Machine Learning 2
7 pages
3 A Review On Machine Learning and It's Application
No ratings yet
3 A Review On Machine Learning and It's Application
5 pages
Deep Learning Vs Machine Learning
No ratings yet
Deep Learning Vs Machine Learning
2 pages
Casaresa 2019
No ratings yet
Casaresa 2019
7 pages
A Research Survey Report On Deep Learning Concepts
No ratings yet
A Research Survey Report On Deep Learning Concepts
8 pages
A Literature Review On Artificial Intelligence
No ratings yet
A Literature Review On Artificial Intelligence
8 pages
A Review On Deep Learning Applications
No ratings yet
A Review On Deep Learning Applications
11 pages
249 254Tesma601IJEAST
No ratings yet
249 254Tesma601IJEAST
7 pages
AI Discussion Intro
No ratings yet
AI Discussion Intro
8 pages
More Than Machines
No ratings yet
More Than Machines
1 page
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
No ratings yet
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
21 pages
Report Phase1
No ratings yet
Report Phase1
39 pages
Unit 4 NNDL
No ratings yet
Unit 4 NNDL
37 pages
Wiki Artificial Intelligence Summary
No ratings yet
Wiki Artificial Intelligence Summary
2 pages
Software Quality, Dilemma, Achieving
33% (3)
Software Quality, Dilemma, Achieving
21 pages
Leture 01
No ratings yet
Leture 01
105 pages
FAM Unit4
No ratings yet
FAM Unit4
11 pages
Nitin Raj Sharma - AIApplicationsInTheDomainsOfMachineLearningAndDeepLearning - NitinRajSh
No ratings yet
Nitin Raj Sharma - AIApplicationsInTheDomainsOfMachineLearningAndDeepLearning - NitinRajSh
9 pages
Unit 1
No ratings yet
Unit 1
30 pages
Irjet V10i1067
No ratings yet
Irjet V10i1067
5 pages
Advancements and Applications of Deep Learning
No ratings yet
Advancements and Applications of Deep Learning
4 pages
Cancers 16 03702
No ratings yet
Cancers 16 03702
23 pages
Deep Learning Applications and Image Processing
No ratings yet
Deep Learning Applications and Image Processing
5 pages
Insidedeeplearning Preview
No ratings yet
Insidedeeplearning Preview
5 pages
Artificial Intelligence - Benefits and Applications of Deep Learning
No ratings yet
Artificial Intelligence - Benefits and Applications of Deep Learning
7 pages
ITR Roll No.20
No ratings yet
ITR Roll No.20
3 pages
Speeduino Manual PDF
100% (1)
Speeduino Manual PDF
107 pages
F3 Serial Port Diagnostics PDF
0% (1)
F3 Serial Port Diagnostics PDF
497 pages
OPC Training Agenda
100% (1)
OPC Training Agenda
61 pages
LMT User Manual
No ratings yet
LMT User Manual
126 pages
Model Predictive Control Using YALMIP Getting Started
No ratings yet
Model Predictive Control Using YALMIP Getting Started
5 pages
Catia V5R21: CATIA - Hybrid Design 2 Configuration (HD2)
No ratings yet
Catia V5R21: CATIA - Hybrid Design 2 Configuration (HD2)
11 pages
Bus Ticket System
No ratings yet
Bus Ticket System
15 pages
College Level Project Assessor Package
No ratings yet
College Level Project Assessor Package
8 pages
Ireless Networks Ireless Etwork Omponents Ireless Ocal REA Etwork Opologies
No ratings yet
Ireless Networks Ireless Etwork Omponents Ireless Ocal REA Etwork Opologies
22 pages
Razorpay
No ratings yet
Razorpay
7 pages
Lec05 Intermediate Code Generation
No ratings yet
Lec05 Intermediate Code Generation
40 pages
Becoming A Full-Stack Developer With Python Involves Mastering Both Front - 20241230 - 120127 - 0000
No ratings yet
Becoming A Full-Stack Developer With Python Involves Mastering Both Front - 20241230 - 120127 - 0000
23 pages
Data Peserta Didik Kec. Betoambari - Dapodikdasmen
No ratings yet
Data Peserta Didik Kec. Betoambari - Dapodikdasmen
3 pages
Ankit's Resume PDF
No ratings yet
Ankit's Resume PDF
1 page
CAAL Previous Year Paper
No ratings yet
CAAL Previous Year Paper
5 pages
All About Preferences DataStore. in This Post, We Will Take A Look at - by Simona Milanović - Android Developers - Medium
No ratings yet
All About Preferences DataStore. in This Post, We Will Take A Look at - by Simona Milanović - Android Developers - Medium
16 pages
What Is ETL
No ratings yet
What Is ETL
14 pages
MT6582 Android Scatter
No ratings yet
MT6582 Android Scatter
6 pages
Resume 1 Naresh Bolikonda
No ratings yet
Resume 1 Naresh Bolikonda
3 pages
It Report PDF
No ratings yet
It Report PDF
24 pages
Objective:: Related Coursework
No ratings yet
Objective:: Related Coursework
1 page
Script Quizizz
No ratings yet
Script Quizizz
4 pages
Temperature Calibration: Applications Solutions
No ratings yet
Temperature Calibration: Applications Solutions
40 pages
Block Coding
No ratings yet
Block Coding
2 pages
Chapter 2 - Understanding Quality Attributes
No ratings yet
Chapter 2 - Understanding Quality Attributes
39 pages
An Introduction To Presentation 2
No ratings yet
An Introduction To Presentation 2
48 pages
United States Patent: Nguyen Et Al
No ratings yet
United States Patent: Nguyen Et Al
19 pages
Building Scalable Data-Intensive Applications
From Everand
Building Scalable Data-Intensive Applications
Chandani Kaul
No ratings yet

Coa PDF

Uploaded by

Coa PDF

Uploaded by

The Deep Learning Revolution and Its Implications

for Computer Architecture and Chip Design

Figure 1: ImageNet classification contest winner accuracy over time

Moore’s Law, Post Moore’s Law, and the Computational

Figure 3: Some important AI Advances and their Computational Requirements

Figure 5: A block diagram of Google’s Tensor Processing Unit v2 (TPUv2)

Low Precision Numeric Formats for Machine Learning

Figure 7: Differences between single-precision IEEE/half-precision IEEE/brain16 Floating Point Formats

The Challenge of Uncertainty in a Fast Moving Field

Machine Learning for Chip Design

Machine Learning for Semiconductor Manufacturing Problems

Machine Learning for Learned Heuristics in Computer Systems

Future Machine Learning Directions

You might also like