Skip to content

support inplace buffer reuse #1486

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 40 commits into from
Oct 15, 2022
Merged

Conversation

jgong5
Copy link

@jgong5 jgong5 commented Oct 5, 2022

Enable inplace_buffers = True feature. This addresses the issue #823 and the second optimization mentioned in #1401. It is turned off by default now due to #1670. Currently, only one CI case gmixer_24_224 training on CUDA would fail due to this issue.

This PR made following fixes:

  1. Support more than two other_names in make_inplace (common.py) to support a chain of inplace reuses.
  2. Do not delete buffer if it is the graph output in make_buffer_reuse (wrapper.py)
  3. Add a new field V.graph.inplaced_to_remove. For inplace buffers subject to remove, we don't actually remove them but put them in this dedicated set. This simplifies the life cycle management of inplace buffers. This set is used to 1) avoid unnecessary store in DeferredLine; 2) avoid alias var definitions in kernel (graph.py/scheduler.py).
  4. Do not allocate buffer for ExternKernelAlloc and MultiOutput (wrapper.py) since these are pre-allocated.
  5. Make sure reuse key matches on inplace buffer reuse in allocate (scheduler.py)
  6. Do not inplace reuse buffers with layout (ir.MultiOutputLayout, ir.MutationLayout, ir.AliasedLayout). This is a conservative choice and can revisit later if we can still inplace reuse buffers with some of these layouts.
  7. Add kernel output buffers to available_buffer_names in codegen (scheduler.py) to make sure remaining_uses is correct in allocate (scheduler.py)

@jgong5 jgong5 marked this pull request as draft October 5, 2022 06:38
@jgong5 jgong5 marked this pull request as ready for review October 6, 2022 15:39
@jgong5 jgong5 requested review from jansel and EikanWang October 6, 2022 15:39
Copy link
Contributor

@jansel jansel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach seems right, but the test failures look real.

@jgong5 jgong5 marked this pull request as draft October 9, 2022 03:58
@desertfire
Copy link
Contributor

Ah, I managed to repro it locally. Seems only happen with the clean run without code cache. If I run it the second time, the test passes with cached code.

This is weird. Can you give our minifier a try?https://github.com/pytorch/torchdynamo/blob/main/documentation/TROUBLESHOOTING.md,

repro_after = os.environ.get("TORCHDYNAMO_REPRO_AFTER", None)
. It the minifier works smoothly, you should be able to get a smaller reproducible graph which makes debugging much easier.

@jgong5
Copy link
Author

jgong5 commented Oct 12, 2022

It the minifier works smoothly, you should be able to get a smaller reproducible graph which makes debugging much easier.

Thanks a lot for the tip, will have a try! BTW, seems trunk is broken? I got something like argument number not matching with all CUDA related test...

@soumith
Copy link
Member

soumith commented Oct 12, 2022

@jgong5 I believe that is because the main now requires a newer triton: #1585
Please upgrade your triton to the outlined commit and you should be good

@jgong5
Copy link
Author

jgong5 commented Oct 12, 2022

It the minifier works smoothly, you should be able to get a smaller reproducible graph which makes debugging much easier

@desertfire Can this troubleshoot accuracy issues or only handle compilation/crash errors?

@soumith
Copy link
Member

soumith commented Oct 12, 2022

@jgong5 it can minify accuracy issues. This feature was introduced in this PR: #1242

@jgong5 jgong5 marked this pull request as draft October 13, 2022 00:58
Copy link
Contributor

@jansel jansel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes to take this off my review queue, re-request review when tests pass.

@jgong5
Copy link
Author

jgong5 commented Oct 14, 2022

Thanks @desertfire @soumith for the hints on the minifier tool. It helped to generate a repro, even though not a small one since the issue only happened on the first clean-cache run. I managed to narrow it down manually to a smaller repro #1670. It seems an issue inside Triton compiler or the way how TorchInductor integrates Triton? @jansel

@jgong5 jgong5 marked this pull request as ready for review October 14, 2022 23:18
@jgong5
Copy link
Author

jgong5 commented Oct 14, 2022

@jansel Are you ok to land this PR with inplace_buffers set to False first? If it is turned on with the same PR change, only gmixer_24_224 training fails the accuracy test with the reason I explained in the description. I don't think the failure is caused by the change though but I will continue to check the root cause. Having it landed first allows others to experiment inplace_buffers if needed. If you are ok, the PR is in a good shape for review.

@jgong5 jgong5 requested a review from jansel October 14, 2022 23:50
@jansel
Copy link
Contributor

jansel commented Oct 15, 2022

Yes, sounds good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

5 participants