0% found this document useful (0 votes)
25 views

PyTorch 2_ Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation - Pytorch2-2

This paper presents two extensions to the PyTorch framework, TorchDynamo and TorchInductor, which enhance machine learning performance through dynamic Python bytecode transformation and graph compilation. TorchDynamo acts as a Python-level JIT compiler that captures PyTorch operations into an FX graph for efficient execution, while TorchInductor serves as the default backend compiler translating these operations for GPUs and CPUs. The extensions provide significant speed improvements, achieving a 2.27× inference and 1.41× training speedup across various real-world models, while maintaining the flexibility that users expect from PyTorch.

Uploaded by

awanhumair2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

PyTorch 2_ Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation - Pytorch2-2

This paper presents two extensions to the PyTorch framework, TorchDynamo and TorchInductor, which enhance machine learning performance through dynamic Python bytecode transformation and graph compilation. TorchDynamo acts as a Python-level JIT compiler that captures PyTorch operations into an FX graph for efficient execution, while TorchInductor serves as the default backend compiler translating these operations for GPUs and CPUs. The extensions provide significant speed improvements, achieving a 2.27× inference and 1.41× training speedup across various real-world models, while maintaining the flexibility that users expect from PyTorch.

Uploaded by

awanhumair2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

PyTorch 2: Faster Machine Learning Through Dynamic

Python Bytecode Transformation and Graph


Compilation
Jason Ansel Edward Yang Horace He Natalia Gimelshein
Meta Meta Meta OpenAI

Animesh Jain Michael Voznesensky Bin Bao Peter Bell


Meta Meta Meta Quansight

David Berard Evgeni Burovski Geeta Chauhan Anjali Chourdia


Meta Quansight Meta Meta

Will Constable Alban Desmaison Zachary DeVito Elias Ellison


Meta Meta Meta Meta

Will Feng Jiong Gong Michael Gschwind Brian Hirsh


Meta Intel Meta Meta

Sherlock Huang Kshiteej Kalambarkar Laurent Kirsch Michael Lazos


Meta Quansight Meta Meta

Mario Lezcano Yanbo Liang Jason Liang Yinghai Lu


Quansight Meta Meta Meta

CK Luk Bert Maher Yunjie Pan Christian Puhrsch


Meta Meta University of Michigan Meta

Matthias Reso Mark Saroum Marcos Yukio Siraichi Helen Suk


Meta Meta Quansight Meta

Michael Suo Phil Tillet Eikan Wang Xiaodong Wang


Meta OpenAI Intel Meta

William Wen Shunting Zhang Xu Zhao Keren Zhou


Meta Meta Meta OpenAI
George Mason University

Richard Zou Ajit Mathews Gregory Chanan Peng Wu


Meta Meta Meta Meta

Soumith Chintala
Meta
Abstract
Permission to make digital or hard copies of part or all of this work for This paper introduces two extensions to the popular PyTorch
personal or classroom use is granted without fee provided that copies are machine learning framework, TorchDynamo and TorchIn-
not made or distributed for prot or commercial advantage and that copies
ductor, which implement the torch.compile feature released
bear this notice and the full citation on the rst page. Copyrights for third-
party components of this work must be honored. For all other uses, contact in PyTorch 2. TorchDynamo is a Python-level just-in-time
the owner/author(s). (JIT) compiler that enables graph compilation in PyTorch
ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA programs without sacricing the exibility of Python. It
© 2024 Copyright held by the owner/author(s). achieves this by dynamically modifying Python bytecode
ACM ISBN 979-8-4007-0385-0/24/04
https://doi.org/10.1145/3620665.3640366
before execution and extracting sequences of PyTorch oper- like fusion or scheduling, that cross operator boundaries.
ations into an FX graph, which is then JIT compiled using To address this, there have been attempts to allow graph
one of many extensible backends. TorchInductor is the de- capture in PyTorch through record/replay [17, 34], Python
fault compiler backend for TorchDynamo, which translates parsing [17], and lazy evaluation [39]. Unfortunately, these
PyTorch programs into OpenAI’s Triton for GPUs and C++ approaches have sacriced much of the usability that draws
for CPUs. Results show that TorchDynamo is able to capture users to PyTorch. Record/replay is unsound and can produce
graphs more robustly than prior approaches while adding incorrect behavior [17]. Python parsing works for simple
minimal overhead, and TorchInductor is able to provide a programs, but has not been able to replicate the complex
2.27× inference and 1.41× training geometric mean speedup semantics of all of Python, so results will show it fails on
on an NVIDIA A100 GPU across 180+ real-world models, over half of real-world models. Lazy evaluation incurs high
which outperforms six other compilers. These extensions run-time overheads and adds latency to kernel launches. Ad-
provide a new way to apply optimizations through compilers ditionally, an exclusively graph mode backend for PyTorch is
in eager mode frameworks like PyTorch. intractable for some models. Due to the exibility provided
ACM Reference Format: by PyTorch, many model authors take advantage of features
Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Ani- that do not easily map to graphs, such as: dictionaries, lists,
mesh Jain, Michael Voznesensky, Bin Bao, Peter Bell, David Berard, custom classes, third party libraries (numpy, logging, etc),
Evgeni Burovski, Geeta Chauhan, Anjali Chourdia, Will Consta- disk/network, multiprocessing, exceptions, and handwritten
ble, Alban Desmaison, Zachary DeVito, Elias Ellison, Will Feng, kernels.
Jiong Gong, Michael Gschwind, Brian Hirsh, Sherlock Huang, Kshi- This paper presents two open source extensions to Py-
teej Kalambarkar, Laurent Kirsch, Michael Lazos, Mario Lezcano, Torch: TorchDynamo and TorchInductor. These extensions
Yanbo Liang, Jason Liang, Yinghai Lu, CK Luk, Bert Maher, Yunjie are behind the torch.compile feature introduced in PyTorch
Pan, Christian Puhrsch, Matthias Reso, Mark Saroum, Marcos
2 and ocially released in March 2023. TorchDynamo is a
Yukio Siraichi, Helen Suk, Michael Suo, Phil Tillet, Eikan Wang,
Python-level JIT compiler designed to allow graph compila-
Xiaodong Wang, William Wen, Shunting Zhang, Xu Zhao, Keren
Zhou, Richard Zou, Ajit Mathews, Gregory Chanan, Peng Wu, tion in PyTorch programs while retaining the full exibility
and Soumith Chintala. 2024. PyTorch 2: Faster Machine Learning of Python. TorchDynamo hooks into the Python frame eval-
Through Dynamic Python Bytecode Transformation and Graph uation API [9] in CPython to dynamically modify Python
Compilation. In 29th ACM International Conference on Architectural bytecode right before it is executed. It rewrites Python byte-
Support for Programming Languages and Operating Systems, Volume code in order to extract sequences of PyTorch operations into
2 (ASPLOS ’24), April 27-May 1, 2024, La Jolla, CA, USA. ACM, New an FX graph [34] which is then just-in-time compiled with
York, NY, USA, 18 pages. https://doi.org/10.1145/3620665.3640366 many extensible backends. It creates this FX graph through
bytecode analysis and is designed to generate smaller graph
1 Introduction fragments that can be mixed with Python execution to get
Modern machine learning frameworks can be divided into ea- the best of both worlds: usability and performance.
ger mode frameworks, such as PyTorch [32] and JAX [8], and TorchInductor is a new compiler backend for TorchDy-
graph mode frameworks, such as TensorFlow [1], Cae [25], namo. It translates PyTorch programs into OpenAI’s Tri-
Theano [5], and CNTK [37]. Eager mode frameworks use ton [46] for GPUs and C++/OpenMP [15] for CPUs. TorchIn-
an imperative dene-by-run [47] approach where a machine ductor is able to support the exibility and dynamism of Py-
learning model is represented as code that is executed each Torch by using similar abstractions to PyTorch eager mode.
time one wants to run the model. Graph mode frameworks It introduces a new dene-by-run loop-level intermediate
take a more declarative dene-and-run [47] approach, where representation (IR) to make it easy to add new operator low-
they expose a graph building API that requires users to rst erings. Additionally, it is implemented in Python, so it is easy
construct a graph and then later execute that graph. for PyTorch users to extend and modify to meet their needs.
Users of machine learning frameworks, and especially re- Experimental results show that TorchDynamo is able to
searchers, have shown an overwhelming preference for the capture graphs more robustly than prior approaches while
eager programming model [22]. The eager mode model is adding minimal overhead. TorchDynamo is able to capture a
easier to understand and can be debugged using standard single whole-program graph for most models and can grace-
tools such as print and pdb in Python [23]. This user pref- fully fall back to partial graphs when needed. Measurements
erence towards eager mode has caused traditionally graph show TorchInductor produces faster code on average than six
mode frameworks to switch to eager mode programming other PyTorch compiler backends. Performance comparisons
models [4]. include both training and inference, CPU and GPU, oat32
The downside of eager mode frameworks is that they make and oat16, and three large benchmark suites containing
it harder to apply graph-level optimizations through compil- 180+ full-sized models taken from real-world applications.
ers. The framework only has visibility of a single operator at a
time, and thus cannot automatically perform optimizations,
2
2 Prior Attempts at PyTorch Graph Capture of Python. Experimental results show that torch.jit.script
Graph capture in PyTorch presents unique challenges when works only about half the time on real-world models in the
compared to graph mode frameworks [1, 25, 5, 37], where TorchBench benchmark suite, and anecdotally we have heard
the user is restricted to only using constructs that are repre- stories of it taking weeks or months to “torchscript” large
sentable in the graph. With PyTorch and other eager mode models, which leads to a frustrating user experience.
frameworks, the user is free to embed arbitrary code, includ-
ing non-PyTorch libraries, inside their models. This results 2.3 Lazy Tensors
in frequent conversion from PyTorch Tensors to Python Lazy Tensors were introduced in the PyTorch/XLA [42, 39]
types (via .item(), .tolist(), etc), usage of external libraries project, which is primarily focused on supporting Google
(numpy, logging, etc), and usage of Python constructs (classes, TPUs [26] with PyTorch. Lazy Tensors is a C++ level graph
closures, exceptions, control ow, etc) that do not map well capture technology. Every iteration, it defers execution of
to a xed graph abstraction. Due to this mismatch between operations to accumulate a graph and then sends the accumu-
the exibility provided by Python/PyTorch, and the inexibil- lated graph to the XLA [45] compiler. By hashing this graph,
ity of graph representations, prior attempts at graph capture Lazy Tensors can avoid recompilation when the captured
in PyTorch have needed to place restrictions on the user graph is identical across iterations. While this approach is
experience. While this tension between exibility and repre- eective and sound, it has a few major downsides:
sentation is solved by TorchDynamo, we examine prior art
in the space to provide context and background. • Higher overheads: Lazy Tensors incurs additional work
when compared to PyTorch eager. Besides running
2.1 torch.jit.trace the same Python code and PyTorch dispatcher stack
torch.jit.trace uses record/replay with example inputs to that eager does, it must maintain additional graph data
produce a TorchScript [17] graph. The recording is done at structures that incur added runtime costs.
the PyTorch dispatcher level, which is inside the C++ portion • Introduced delays: PyTorch eager issues the rst kernel
of PyTorch and used to dispatch operators to device-specic on the rst operation of the model, after which point
kernels and for autograd. Because the recording is done in host-side code is run in parallel with kernels on the
C++, torch.jit.trace does not capture any control ow in GPU or accelerator thus hiding overheads. In contrast,
Python. Consider this example: Lazy Tensors doesn’t issue the rst kernel until the
def example1(x):
model’s code has nished executing, resulting in added
if len(torch.nonzero(x)) > 1: delays before the rst kernel is issued and after any
return x + 1 operation that requires a round trip to the CPU (which
return x - 1
are common in real-world models). Thus, Lazy Tensors
With example input torch.tensor([0, 0]), torch.jit.trace often serializes host execution with GPU/accelerator
would capture a graph equivalent to: utilization, which amplies host side overheads. Mod-
def example1_incorrect_capture(x): els, loss logging, and optimizers need to be modied
torch.nonzero(x) to work around this issue.
return x - 1
• Recompilation: Whenever the captured graph has a
Since the path through the program is specialized on the new hash, Lazy Tensors must recompile. This can lead
example input, a dierent input (such as torch.tensor([1,1])) to some pathological cases where recompilation hap-
will give incorrect results. Additionally, any non-PyTorch pens frequently.
operators (such as external libraries, prints, logging, side
eects, etc.) will be omitted from the captured graph. The PyTorch/XLA project has built [10] an integration with
TorchDynamo which uses a hybrid of both Lazy Tensors
2.2 torch.jit.script and TorchDynamo. This integration hides the overheads of
torch.jit.script also constructs a TorchScript [17] graph, Lazy Tensors by only running Lazy Tensors once, rather than
but does so by parsing the Python AST and performing static every iteration, and using TorchDynamo to gure out when
analysis. It is able to capture example1 above correctly and, recapture is needed. The PyTorch/XLA results later in the
unlike torch.jit.trace, it is a sound approach that should paper use this integration.
not produce incorrect results.
The major challenge torch.jit.script faces is that it is 2.4 torch.fx.symbolic_trace
trying to reimplement all of Python as a static language. This torch.fx.symbolic_trace [34] is the newest of these systems
approach is all or nothing: encountering an unimplemented and introduced the FX graph format that is shared by Torch-
component of Python makes the entire program unt for Dynamo. It takes a similar record/replay-based approach to
capture. Emulating all of Python statically is a daunting torch.jit.trace, but does its tracing at the Python level as
task and, in practice, torch.jit.script only supports a subset opposed to at the PyTorch C++ dispatcher level. It runs the
3

You might also like