-
Notifications
You must be signed in to change notification settings - Fork 127
support inplace buffer reuse #1486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The approach seems right, but the test failures look real.
This reverts commit 33f9c61.
…rchdynamo into jgong5/inplace_buffers
This is weird. Can you give our minifier a try?https://github.com/pytorch/torchdynamo/blob/main/documentation/TROUBLESHOOTING.md, torchdynamo/torchdynamo/config.py Line 121 in 8c9f11c
|
Thanks a lot for the tip, will have a try! BTW, seems trunk is broken? I got something like argument number not matching with all CUDA related test... |
@desertfire Can this troubleshoot accuracy issues or only handle compilation/crash errors? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requesting changes to take this off my review queue, re-request review when tests pass.
Thanks @desertfire @soumith for the hints on the minifier tool. It helped to generate a repro, even though not a small one since the issue only happened on the first clean-cache run. I managed to narrow it down manually to a smaller repro #1670. It seems an issue inside Triton compiler or the way how TorchInductor integrates Triton? @jansel |
@jansel Are you ok to land this PR with |
Yes, sounds good to me. |
Enable
inplace_buffers = True
feature. This addresses the issue #823 and the second optimization mentioned in #1401. It is turned off by default now due to #1670. Currently, only one CI case gmixer_24_224 training on CUDA would fail due to this issue.This PR made following fixes:
other_names
inmake_inplace
(common.py) to support a chain of inplace reuses.make_buffer_reuse
(wrapper.py)V.graph.inplaced_to_remove
. For inplace buffers subject to remove, we don't actually remove them but put them in this dedicated set. This simplifies the life cycle management of inplace buffers. This set is used to 1) avoid unnecessary store in DeferredLine; 2) avoid alias var definitions in kernel (graph.py/scheduler.py).ExternKernelAlloc
andMultiOutput
(wrapper.py) since these are pre-allocated.allocate
(scheduler.py)(ir.MultiOutputLayout, ir.MutationLayout, ir.AliasedLayout)
. This is a conservative choice and can revisit later if we can still inplace reuse buffers with some of these layouts.available_buffer_names
incodegen
(scheduler.py) to make sureremaining_uses
is correct inallocate
(scheduler.py)