2023-01 How Nvidia's CUDA Monopoly in Machine Learning Is Breaking - OpenAI Triton and PyTorch 2.0
2023-01 How Nvidia's CUDA Monopoly in Machine Learning Is Breaking - OpenAI Triton and PyTorch 2.0
128 27 Share
Over the last decade, the landscape of machine learning software development has
undergone significant changes. Many frameworks have come and gone, but most
have relied heavily on leveraging Nvidia's CUDA and performed best on Nvidia
GPUs. However, with the arrival of PyTorch 2.0 and OpenAI's Triton, Nvidia's
dominant position in this field, mainly due to its software moat, is being disrupted.
This report will touch on topics such as why Google’s TensorFlow lost out to
PyTorch, why Google hasn’t been able to capitalize publicly on its early leadership of
AI, the major components of machine learning model training time, the memory
capacity/bandwidth/cost wall, model optimization, why other AI hardware companies
haven’t been able to make a dent in Nvidia’s dominance so far, why hardware will
start to matter more, how Nvidia’s competitive advantage in CUDA is wiped away,
and a major win one of Nvidia’s competitors has at a large cloud for training silicon.
https://thegradient.pub/state-of-ml-frameworks-2019-pytorch-dominates-
research-tensorflow-dominates-industry/
Instead, PyTorch won. Google failed to convert its first mover’s advantage into
dominance of the nascent ML industry. Nowadays, Google is somewhat isolated
within the machine learning community because of its lack of use of PyTorch and
✖ Our use of cookies
GPUs in favor
We use of cookies
necessary its own to software stack
make our site work.and hardware.
We also In typical
set performance Google fashion,
and functionality cookies that
theyhelp us make
even haveimprovements by measuring
a 2nd framework traffic
called Jaxonthat
our site. For more directly
competes detailed information about the
with TensorFlow.
cookies we use, please see our privacy policy.
There’s even endless talk of Google’s dominance in search and natural language
processing waning due to large language models, particularly those from OpenAI
and the various startups that utilize OpenAI APIs or are building similar
foundational models. While we believe this doom and gloom is overblown, that story
is for another day. Despite these challenges, Google is still at the forefront of the
most advanced machine learning models. They invented transformers and remain
state-of-the-art in many areas (PaLM, LaMBDA, Chinchilla, MUM, TPU).
Back to why PyTorch won. While there was an element of wrestling control away
from Google, it was primarily due to its increased flexibility and usability of PyTorch
versus TensorFlow. If we boil it down to a first principal level, PyTorch differed from
TensorFlow in using “Eager mode” rather than "Graph Mode."
Eager mode can be thought of as a standard scripting execution method. The deep
learning framework executes each operation immediately, as it is called, line by line,
like any other piece of Python code. This makes debugging and understanding your
code more accessible, as you can see the results of intermediate operations and see
how your model behaves.
In contrast, graph mode has two phases. The first phase is the definition of a
computation graph representing the operations to perform. A computation graph is a
series of interconnected nodes representing operations or variables, and the edges
between nodes represent the data flow between them. The second phase is the
deferred execution of an optimized version of the computation graph.
This two-phase approach makes it more challenging to understand and debug your
code, as you cannot see what is happening until the end of the graph execution. This
is analogous to "interpreted" vs. "compiled" languages, like python vs. C++. It's easier
to debug Python, largely since it's interpreted.
While TensorFlow now has Eager mode by default, the research community and most
large tech firms have settled around PyTorch. This is exemplified by the fact that
nearly ever generative AI model that made the news, being based on PyTorch. The
Google generative AI models are based on Jax, not TensorFlow.
Of course
✖ there
Our use of is a long tail of image nets using other frameworks like TensorFlow
cookies
We use necessary cookies to make our site work. We also set performance and functionality cookies that
and Keras, but the compute budgets for new model development is all flowing to
help us make improvements by measuring traffic on our site. For more detailed information about the
PyTorch
cookies models. Forsee
we use, please a deeper explanation
our privacy policy. of why PyTorch won, see here. In general,
if you walk around the halls of NeurIPS (the main AI conference), all generative AI,
the non-Google work is with PyTorch.
2. Memory (Bandwidth): Waiting for data or layer weights to get to the compute
resources. Common examples of bandwidth-constrained operations are various
normalizations, pointwise operations, SoftMax, and ReLU.
In the past, the dominant factor in machine learning training time was compute
time, waiting for matrix multiplies. As Nvidia’s GPUs continued to develop, this
quickly faded away from being the primary concern.
https://arxiv.org/pdf/2007.00072.pdf
Memory follows a hierarchy from close and fast to slow and cheap. The nearest
shared memory pool is on the same chip and is generally made of SRAM. Some
✖ Our use of cookies
machine-learning ASICs
We use necessary cookies to attempt to work.
make our site utilize
Wehuge
also setpools of SRAM
performance to hold model
and functionality cookies that
help us but
weights, makethere
improvements by measuring
are issues with thistraffic on our site.Even
approach. For more detailed information
Cerebras’ ~$2,500,000about the
wafer
cookies we use, please see our privacy policy.
scale chips only have 40GB of SRAM on the chip. There isn’t enough memory
capacity to hold the weights of a 100B+ parameter model.
Nvidia’s architecture has always used a much smaller amount of memory on the die.
The current generation A100 has 40MB, and the next generation H100 has 50MB.
1GB of SRAM on TSMC’s 5nm process node would require ~200mm^2 of silicon.
Once the associated control logic/fabric are implemented, that would require over
400mm^2 of silicon, or about 50% of the total logic area of an Nvidia datacenter GPU.
Given that an A100 GPU costs $10k+ and the H100 is more like $20k+, economically,
this is infeasible. Even when you ignore Nvidia’s ~75% gross margin on datacenter
GPUs (~4x markup), the cost per GB of SRAM memory would still be in the $100s for
a fully yielded product.
Furthermore, the cost of on-chip SRAM memory will not decrease much through
conventional Moore’s Law process technology shrinks. The same 1GB of memory
actually costs more with the next-generation TSMC 3nm process technology. While
3D SRAM will help with SRAM costs to some degree, that is only a temporary bend
of the curve.
SemiAnalysis
Read more
DRAM followed the path of Moore’s Law for many decades. When Gordon Moore
coined the term, Intel’s primary business was DRAM. His economic prediction about
density and cost of transistors generally held true until ~2009 for DRAM. Since ~2012
though, the cost of DRAM has barely improved.
The demands for memory have only increased. DRAM now comprises 50% of the
total server’s cost. This is the memory wall, and it has shown up in products.
Comparing Nvidia’s 2016 P100 GPU to their 2022 H100 GPU that is just starting to
ship, there is a 5x increase in memory capacity (16GB -> 80GB) but a 46x increase in
FP16 performance (21.2 TFLOPS -> 989.5 TFLOPS).
Even with heavy optimizations from leading researchers, 60% FLOPS utilization is
considered a very high utilization rate for large language model training. The rest of
the time is overhead, idle time spent waiting for data from another
calculation/memory, or recomputing results just in time to reduce memory
bottlenecks.
From the current generation A100 to the next generation H100, the FLOPS grow by
more than 6X, but memory bandwidth only grows by 1.65x. This has led to many
fears of low utilization for H100. The A100 required many tricks to get around the
memory wall, and more will need to be implemented with the H100.
The H100 brings distributed shared memory and L2 multicast to Hopper. The idea is
that different SMs (think cores) can write directly to another SM’s SRAM (shared
memory/L1 Cache). This effectively increases the size of the cache and reduces the
required bandwidth of DRAM read/writes. Future architectures will rely on sending
fewer operations to memory to minimize the impact of the memory wall. It should be
noted that larger models tend to achieve higher utilization rates as FLOPS demands
scale more exponentially whereas memory bandwidth and capacity demands tend to
scale more linearly.
https://horace.io/brrr_intro.html
✖ Our use of cookies
We use necessary cookies to make our site work. We also set performance and functionality cookies that
help us make improvements by measuring traffic on our site. For more detailed information about the
cookies we use, please see our privacy policy.
Referring back to why PyTorch won, it was the increased flexibility and usability due
to Eager mode, but moving to Eager mode isn’t all sunshine and rainbows. When
executing in Eager mode, each operation is read from memory, computed, then sent
to memory before the next operation is handled. This significantly increases the
memory bandwidth demands if heavy optimizations aren’t done.
As such, one of the principal optimization methods for a model executed in Eager
mode is called operator fusion. Instead of writing each intermediate result to
memory, operations are fused, so multiple functions are computed in one pass to
minimize memory reads/writes. Operator fusion improves operator dispatch,
memory bandwidth, and memory size costs.
https://horace.io/brrr_intro.html
This optimization often involves writing custom CUDA kernels, but that is much
more difficult than using simple python scripts. As a built-in compromise, PyTorch
steadily implemented more and more operators over time natively within PyTorch.
Many of these operators were simply multiple commonly used operations fused into
a single, more complex function.
TheOur
✖ increase in operators made both creating the model within PyTorch easier and
use of cookies
theWe use necessary cookies
performance of Eagerto make
mode ourfaster
site work.
due Wetoalso set performance
having and functionality
fewer memory cookies The
read/writes. that
help us make improvements by measuring traffic on our site. For more detailed information about the
downside was that PyTorch ballooned to over 2,000 operators over a few years.
cookies we use, please see our privacy policy.
We would say software developers are lazy, but let’s be honest, almost all people are
lazy. If they get used to one of the new operators within PyTorch, they will continue
to use that. The developer may not even recognize the performance improvement but
✖ Our use of cookies
instead, use that operator because it means writing less code.
We use necessary cookies to make our site work. We also set performance and functionality cookies that
help us make improvements by measuring traffic on our site. For more detailed information about the
cookies we use, please see our privacy policy.
Additionally, not all operations can be fused. A significant amount of time is often
spent deciding which operations to fuse and which operations to assign to specific
compute resources at the chip and cluster levels. The strategy of which operations to
fuse where, although generally similar, does vary significantly depending on the
architecture.
Nvidia Is King
The growth in operators and position as the default has helped Nvidia as each
operator was quickly optimized for their architecture but not for any other hardware.
If an AI hardware startup wanted to fully implement PyTorch, that meant supporting
the growing list of 2,000 operators natively with high performance.
The talent level required to train a massive model with high FLOPS utilization on a
GPU grows increasingly higher because of all the tricks needed to extract maximum
performance. Eager mode execution plus operator fusion means that software,
techniques, and models that are developed are pushed to fit within the ratios of
compute and memory that the current generation GPU has.
Everyone developing machine learning chips is beholden to the same memory wall.
ASICs are beholden to supporting the most commonly used frameworks. ASICs are
beholden to the default development methodology, GPU-optimized PyTorch code
with a mix of Nvidia and external libraries. An architecture that eschews a GPU’s
various non-compute baggage in favor of more FLOPS and a stiffer programming
model makes very little sense in this context.
The only way to break the vicious cycle is for the software that runs models on
Nvidia GPUs to transfer seamlessly to other hardware with as little effort as possible.
As model architectures stabilize and abstractions from PyTorch 2.0, OpenAI Triton,
and MLOps firms such as MosaicML become the default, the architecture and
economics of the chip solution starts to become the biggest driver of the purchase
rather than the ease of use afforded to it by Nvidia’s superior software.
✖ Our use of cookies
We use necessary cookies to make our site work. We also set performance and functionality cookies that
PyTorch 2.0
help us make improvements by measuring traffic on our site. For more detailed information about the
cookies we use, please see our privacy policy.
The PyTorch Foundation was established and moved out from under the wings of
Meta just a few months ago. Alongside this change to an open development and
governance model, 2.0 has been released for early testing with full availability in
March. PyTorch 2.0 brings many changes, but the primary difference is that it adds a
compiled solution that supports a graph execution model. This shift will make
properly utilizing various hardware resources much easier.
PyTorch 2.0 brings an 86% performance improvement for training on Nvidia’s A100
and 26% on CPUs for inference! This dramatically reduces the compute time and cost
required for training a model. These benefits could extend to other GPUs and
accelerators from AMD, Intel, Tenstorrent, Luminous Computing, Tesla, Google,
Amazon, Microsoft, Marvell, Meta, Graphcore, Cerebras, SambaNova, etc.
The performance improvements from PyTorch 2.0 will be larger for currently
unoptimized hardware. Meta and other firms’ heavy contribution to PyTorch stems
from the fact that they want to make it easier to achieve higher FLOPS utilization
with less effort on their multi-billion-dollar training clusters made of GPUs. They are
also motivated to make their software stacks more portable to other hardware to
introduce competition to the machine learning space.
PyTorch 2.0 also brings advancements to distributed training with better API
support for data parallelism, sharding, pipeline parallelism, and tensor parallelism.
In addition, it supports dynamic shapes natively through the entire stack, which
among many other examples, makes varying sequence lengths for LLMs much easier
to support. This is the first time a major compiler supports Dynamic Shapes from
training to inference.
TorchDynamo
Moving to graph mode requires a robust graph definition. Meta and PyTorch have
been attempting to work on implementing this for ~5 years, but every solution they
came up with had significant drawbacks. They finally cracked the puzzle with
TorchDynamo.
✖ TorchDynamo will ingest any PyTorch user script, including those
Our use of cookies
We use necessary cookies to make our site work. We also set performance and functionality cookies that
that call outside 3rd party libraries, and generate an FX graph.
help us make improvements by measuring traffic on our site. For more detailed information about the
cookies we use, please see our privacy policy.
Dynamo lowers all complex operations to the ~250 primitive operations in
PrimTorch. Once the graph is formed, unused operations are discarded, and the
graph determines which intermediate operations need to be stored or written to
memory and which can potentially be fused. This dramatically reduces the overhead
within a model while also being seamless for the user.
TorchDynamo already works for over 99% of the 7,000 PyTorch models tested,
including those from OpenAI, HuggingFace, Meta, Nvidia, Stability.AI, and more,
without any changes to the original code. The 7,000 models tested were
indiscriminately chosen from the most popular projects using PyTorch on GitHub.
Google’s TensorFlow/Jax and other graph mode execution pipelines generally require
the user to ensure their model fits into the compiler architecture so that the graph
can be captured. Dynamo changes this by enabling partial graph capture, guarded
graph capture, and just-in-time recapture.
Partial
✖ Our use of graph
cookies capture allows the model to include unsupported/non-python
Weconstructs.
use necessaryWhen
cookiesatograph
make our site work.
cannot beWe also set performance
generated and functionality
for that portion of thecookies
model,that
a
help us make improvements by measuring traffic on our site. For more detailed information about the
cookies we use, please see our privacy policy.
graph break is inserted, and the unsupported constructs will be executed in
eager mode between the partial graphs.
Guarded graph capture checks if the captured graph is valid for execution. A
guard is a change that would require recompilation. This is important because
running the same code multiple times won't recompile multiple times.
PyTorch’s goal is to create a unified front end with a smooth UX that leverages
Dynamo to generate graphs. The user experience of this solution would be
unchanged, but the performance can be significantly improved. Capturing the graph
means execution can be parallelized more efficiently over a large base of compute
resources.
Dynamo and AOT Autograd then pass the optimized FX graphs to the PyTorch
native compiler level, TorchInductor. Hardware companies can also take this graph
and input it into their own backend compilers.
TorchInductor
✖ Our use of cookies
TorchInductor is cookies
We use necessary a python native
to make deep
our site learning
work. compiler
We also set thatand
performance generates fast
functionality code
cookies for
that
help us accelerators
multiple make improvements
andby measuring traffic
backends. Inductoron ourwill
site. For
takemore
thedetailed information
FX graphs, whichabout the
have
cookies we use, please see our privacy policy.
~250 operators, and lowers them to ~50 operators. Inductor then moves to a
scheduling phase where operators are fused, and memory planning is determined.
Inductor then goes to the “Wrapper Codegen,” which generates code that runs on the
CPU, GPU, or other AI accelerators. The wrapper codegen replaces the interpreter
part of a compiler stack and can call kernels and allocate memory. The backend code
generation portion leverages OpenAI Triton for GPUs and outputs PTX code. For
CPUs, an Intel compiler generates C++ (will work on non-Intel CPUs too).
More hardware will be supported going forward, but the key is that Inductor
dramatically reduces the amount of work a compiler team must do when making a
compiler for their AI hardware accelerator. Furthermore, the code is more optimized
for performance. There are significant reductions in memory bandwidth and capacity
requirements.
OpenAI Triton
OpenAI’s Triton is very disruptive angle to Nvidia’s closed-source software moat for
machine learning. Triton takes in Python directly or feeds through the PyTorch
Inductor stack. The latter will be the most common use case. Triton then converts
the input to an LLVM intermediate representation and then generates code. In the
case of Nvidia GPUs, it directly generates PTX code, skipping Nvidia’s closed-source
CUDA libraries, such as cuBLAS, in favor of open-source libraries, such as cutlass.
OpenAI Triton only officially supports Nvidia GPUs today, but that is changing in
the near future. Multiple other hardware vendors will be supported in the future, and
this open-source project is gaining incredible steam. The ability for other hardware
accelerators to integrate directly into the LLVM IR that is part of Triton dramatically
reduces the time to build an AI compiler stack for a new piece of hardware.
Nvidia’s colossal software organization lacked the foresight to take their massive
advantage in ML hardware and software and become the default compiler for
machine learning. Their lack of focus on usability is what enabled outsiders at
OpenAI and Meta to create a software stack that is portable to other hardware. Why
aren’t they the one building a « simplified » CUDA like Triton for ML researchers?
Stuff like Flash Attention, why does it come out of Ph.D. students and not Nvidia?
The rest of this report will point out the specific hardware accelerator that has a
huge win at Microsoft, as well as multiple companies’ hardware that is quickly being
integrated into the PyTorch 2.0/OpenAI Trion software stack. Furthermore, it will
share the opposing view as a defense of Nvidia’s moat/strength in the AI training
market.
Subscribe